Dani Drywa






Line ranges in Swift - Redux


Last week I published a post about finding the fastest way to retrieve line ranges from a large string. The way to determine which method was the fastest was very crude and I had this nagging feeling that it was also wrong. To save you the time of reading the last post again here is a quick summary. I processed a file of about 24k lines of assembly code that contained about 15 characters per line on average. I wrote about 6 different ways of finding the half-open ranges of each line, measured their runtime, and came to the following results:

  1. Using String.Index with a subscript: 24 milliseconds
  2. Using a Character iterator and calculating the next index: 19 milliseconds
  3. Zipping the Character and Indices iterator together: 15 milliseconds
  4. Using enumerateLines: 18 milliseconds
  5. Using lineRange: 14 milliseconds
  6. Using split: 14 milliseconds

These results meant that using lineRange or split were the fastest options but I can now say that those results were indeed wrong. The reasons for this were the very crude nature of my tests and some glaring errors regarding Xcode build optimisations which can be boiled down to my inexperience with Xcode. I ran all the tests again with a benchmarking framework and build the project from the command line with optimisations enabled. And this time I got the following results:

Benchmark results for finding line ranges in a String with 24k lines of Assembly Code
Benchmark results for finding line ranges in a String with 24k lines of Assembly Code. Picture generated by DANIEL DRYWA with swift-collection-benchmark by Apple Inc. Picture License: Creative Commons Attribution-ShareAlike 4.0 International License.

Before talking about the results I have to mention that the code has not been modified since last week but I have switched to a MacMini with an M2 Pro Processor.

These new results confirm that using lineRange is still the fastest way of finding all line ranges in a given string. But I was quite surprised to see that using direct indices via a subscript is not the slowest method. That would be enumeratedLines. Using indices is actually the second fastest method to retrieve line ranges in a string. It goes to show: Do not trust crude benchmarks within Xcode. As a second exercise I also increased the amount of data to 1M lines of assembly code.

Benchmark results for finding line ranges in a String with 1M lines of Assembly Code
Benchmark results for finding line ranges in a String with 1M lines of Assembly Code. Picture generated by DANIEL DRYWA with swift-collection-benchmark by Apple Inc. Picture License: Creative Commons Attribution-ShareAlike 4.0 International License.

Nothing surprising to see here. Just a linear continuation of the times measured with 24k lines all the way up to 1M lines.

Now with a proper benchmark in place I was curious to see if there was any difference between using String and Substring when trying to find line ranges, so I changed all the test code to process a Substring as input data instead and ran the benchmarks again.

Benchmark results for finding line ranges in a Substring with 1M lines of Assembly Code
Benchmark results for finding line ranges in a Substring with 1M lines of Assembly Code. Picture generated by DANIEL DRYWA with swift-collection-benchmark by Apple Inc. Picture License: Creative Commons Attribution-ShareAlike 4.0 International License.

What the heck is happening? All methods were sort of in the same ballpark in terms of runtime as the version that was using String. But lineRange was just exploding in run time when used on a Substring. This was quite a shock! My first instinct was that there had to be some kind of mistake in my code but after looking at it countless times I could not find any obvious mistake.

extension StringProtocol {
    func lineRangeBenchmark() -> [Range<Self.Index>] {
        var output: [Range<Self.Index>] = []

        var startIndex = self.startIndex

        repeat {
            let range = self.lineRange(for: startIndex..<startIndex)
            output.append(range)
            startIndex = range.upperBound
        } while startIndex < self.endIndex
        
        return output
    }
}
The lineRange Benchmark code used by String and Substring

I ran the lineRange code in isolated builds as well and was getting the same results. Clearly something horrible was going on in the lineRange implementation for substrings in Swift and I was the wrong person to figure out what.

Because lineRange was skewing the results I had to exclude it form the graph of the Substring versions to get a better picture of the run times.

Benchmark results for finding line ranges in a Substring with 1M lines of Assembly Code - excluding the lineRange method
Benchmark results for finding line ranges in a Substring with 1M lines of Assembly Code - excluding the lineRange method. Picture generated by DANIEL DRYWA with swift-collection-benchmark by Apple Inc. Picture License: Creative Commons Attribution-ShareAlike 4.0 International License.

For substrings I could see that using direct indices was the fastest and enumeratedLines was no longer the worst option. That would be both iterator versions instead. This was quite interesting to me but I also wanted a direct one to one comparison between the String and Substring version of each implementation. In the following graphs the line labelled before is for the String version, and the line labelled after is for the Substring version.


Enumerate Lines - String vs. Substring


Benchmark results for finding line ranges in a String/Substring with enumerateLines
Benchmark results for finding line ranges in a String/Substring with enumerateLines. Picture generated by DANIEL DRYWA with swift-collection-benchmark by Apple Inc. Picture License: Creative Commons Attribution-ShareAlike 4.0 International License.

There was not much difference between using enumerateLines on String or Substring but in general I would avoid using this method as it is quite slow compared to other solutions.


Split - String vs. Substring


Benchmark results for finding line ranges in a String/Substring with split
Benchmark results for finding line ranges in a String/Substring with split. Picture generated by DANIEL DRYWA with swift-collection-benchmark by Apple Inc. Picture License: Creative Commons Attribution-ShareAlike 4.0 International License.

Using split was almost 20 milliseconds worse with a Substring compared to a String. However, for 1M lines of assembly code this difference is negligible.


Character Iterator - String vs. Substring


Benchmark results for finding line ranges in a String/Substring with an iterator over characters
Benchmark results for finding line ranges in a String/Substring with an iterator over characters. Picture generated by DANIEL DRYWA with swift-collection-benchmark by Apple Inc. Picture License: Creative Commons Attribution-ShareAlike 4.0 International License.

Using iterators on a Substring come with performance implications. Not to mention that iterators are already slower than using plain indices on a String.


Zipped Character and Indices Iterator - String vs. Substring


Benchmark results for finding line ranges in a String/Substring with a zipped iterator of characters and indices
Benchmark results for finding line ranges in a String/Substring with a zipped iterator of characters and indices. Picture generated by DANIEL DRYWA with swift-collection-benchmark by Apple Inc. Picture License: Creative Commons Attribution-ShareAlike 4.0 International License.

Similar characteristics as a plain Character iterator with an added penalty of using a second iterator.


Indices with Subscript - String vs. Substring


Benchmark results for finding line ranges in a String/Substring with subscript
Benchmark results for finding line ranges in a String/Substring with subscript. Picture generated by DANIEL DRYWA with swift-collection-benchmark by Apple Inc. Picture License: Creative Commons Attribution-ShareAlike 4.0 International License.

Similar to split, using indices with a subscript on a Substring comes with slight performance implications. When dealing with StringProtocol that could either be a String or a Substring I would recommend using subscripts with a String.Index. This version is fast and predictable.


LineRange - String vs. Substring


Benchmark results for finding line ranges in a String/Substring with lineRange
Benchmark results for finding line ranges in a String/Substring with lineRange. Picture generated by DANIEL DRYWA with swift-collection-benchmark by Apple Inc. Picture License: Creative Commons Attribution-ShareAlike 4.0 International License.

What is there left to say. I would not hesitate to use lineRange when dealing with a String. But if working with a StringProtocol that could either be a String or Substring it would be best to avoid using lineRange altogether.


Conclusion


These results have helped to change the approach for my current project. Based on my first crude benchmarks I was content to use split to get the line ranges and the substrings for each line in a string. I would then use the substring to parse the assembly code within. But now that I learned that Substring comes with performance implications, I am no longer considering this approach. Instead I would use subscript with indices and leave the compiler to do the heavy lifting of optimising it.