Line ranges in Swift
What is the best way to enumerate over a String on a line by line basis in Swift? I asked myself that question and tried to come up with a solution. I had an assembly source file with about 24k lines of assembly instructions (about 15 characters per line on average) that I was using for my tests on a Mac Mini from 2018 with a 3 GHz Intel i5 (no multithreading used). The goal was to find the start and end index of each line of assembly code in the given string (Ideally storing this information as an half-open range per line).
My first attempt was to directly enumerate over each single Character of the string and adding the range to the output array whenever I encountered a newline character with isNewline.
This piece of code, with optimisations enabled, took about 24 milliseconds on my machine. That is not too bad... Until I realised that a modern video game in 4K resolution running at 60 frames per second takes about 16.67 milliseconds to render a single frame. And there is a lot more happening on the screen than what I am trying to do here with a bunch of lines of text.
But why is the above code so slow? To answer this I tried to look into the Swift and Foundation source code but I got quickly lost because GitHub is a terrible website to browse and search source code, and the instructions to open the project with Xcode made me want to throw myself off a cliff. I also did not want to start up a profiler for a tiny micro benchmark. Instead I had a few guesses of what might be happening based on what I saw when trying to browse through the swift code. Using an index to retrieve a character in the string is costly due to validating and bounds checking the index every time
input[currentIndex] is used. Calculating the next index with
input.formIndex(after: ¤tIndex) is potentially costly as String.Index is not just a simple integer into an array (see source file comment on line 17).
There was no way for me to utilise unsafe Swift to enumerate over the characters with pointers and avoid the index bounds checking unless I went down the UTF8 byte level route. I would have to give up the work that String has already done for me which includes validating and combining Unicode scalar values into single characters. And that is not something I wanted to do at the time. However, another way to enumerate over the string characters one by one is to use a for loop.
This version took about 19 milliseconds which is already a lot better than the index variant. The next thing I wanted to attempt was to remove the manual index advancement. If I could avoid manually calculating indices I could probably shave off a few more milliseconds. And luckily string provides a property that returns a DefaultIndices collection which contains all valid indices for the string in ascending order. I used zip to combine the indices and character iterators together and looped over them in a for loop.
This version took 15 milliseconds to run which was nice, but this solution didn't quite feel right to me because of these two separate iterators that might not even align properly. They probably do but in the back of my head I had this constant voice saying "what if they don't?". So I went on to look for other solutions.
String also offers a method called enumerateLines which will enumerate over the string on a line by line basis and invoke a closure. The closure has two arguments: the string that represents the current line and a boolean reference which allows for early termination of the loop.
This version is by far the nicest to read but to my surprise it took about 18 milliseconds to execute. I think it is slower because for each line it creates a new string instance as well as calling a closure that takes a copy of the output to modify it. Nice to read but not something I would want to use given the faster two iterator solution. But there was another solution I wanted to try, which was using lineRange. This method takes a range as an argument and returns a new range that makes up the start and end of the line (or lines) that the given range is in. The range it returns is an half-open range with the upperBound being the character after the newline character, which means the
upperBound can then be used for the next
This version took about 14 milliseconds to execute which so far was the winner. Plus
lineRange is a useful method to have when it comes to editing text and figuring out which lines have have been touched by a given range.
There was one more method I wanted to try, and that was split. I tend to avoid splitting strings in most programming languages because it is a costly operation that creates a lot of copies. But to my surprise in Swift
split is returning substrings instead of creating a new copy, which is excellent.
Split also took about 14 milliseconds to execute. This means either
lineRange do the best job for the task and I would use either one of those depending on what it is I am trying to achieve. With some crude back of the envelope calculation we could assume that both of these methods would be able to retrieve ranges for roughly 1.7 million lines of assembly code in about 1 second. That said, I do not know if this is the fastest way of retrieving line ranges in Swift but for now this is adequate enough for me.
I would like to add that these micro benchmarks have been conducted with Xcode unit tests by using the measure method. The times stated above are the average runtime of each test and it is by no means supposed to be a highly accurate measurement. I have not used any profiling tools to actually figure out what the bottlenecks for each solution were and that was also not the point of this exercise. All I wanted was a quick and dirty way of figuring out which method of retrieving line ranges is faster so I could have a good starting point for the project I am currently working on.