2023-02-04

Line ranges in Swift

What is the best way to enumerate over a String on a line by line basis in Swift? I asked myself that question and tried to come up with a solution. I had an assembly source file with about 24k lines of assembly instructions (about 15 characters per line on average) that I was using for my tests on a Mac Mini from 2018 with a 3 GHz Intel i5 (no multithreading used). The goal was to find the start and end index of each line of assembly code in the given string (Ideally storing this information as an half-open range per line).

My first attempt was to directly enumerate over each single Character of the string and adding the range to the output array whenever I encountered a newline character with isNewline.

var input = ... // The input string from the source file
var output: [Range<String.Index>] = [] // The output ranges

var startIndex = input.startIndex // The start of the line
var currentIndex = startIndex

// Loop until the end of the input string has been reached
while currentIndex < input.endIndex {
    // Retrieve the current character at the index
    let character = input[currentIndex]
    
    // Advance the index to the next character
    input.formIndex(after: &currentIndex)
    
    // If the character is a newline, add it to the output
    if character.isNewline {
        output.append(startIndex..<currentIndex)
        
        // Set the beginning of the next line to the current index
        // which is past the newline character
        startIndex = currentIndex
    }
}

Retrieving ranges (start and end index) of each line in the source file with the help of an index

This piece of code, with optimisations enabled, took about 24 milliseconds on my machine. That is not too bad... Until I realised that a modern video game in 4K resolution running at 60 frames per second takes about 16.67 milliseconds to render a single frame. And there is a lot more happening on the screen than what I am trying to do here with a bunch of lines of text.

A screenshot of God of War Ragnarök taken on the PlayStation 5. Screenshot taken by DANIEL DRYWA. God of War Ragnarök is a trademark of Santa Monica Studio/Sony Interactive Entertainment. PlayStation 5 is a trademark of Sony Interactive Entertainment.

But why is the above code so slow? To answer this I tried to look into the Swift and Foundation source code but I got quickly lost because GitHub is a terrible website to browse and search source code, and the instructions to open the project with Xcode made me want to throw myself off a cliff. I also did not want to start up a profiler for a tiny micro benchmark. Instead I had a few guesses of what might be happening based on what I saw when trying to browse through the swift code. Using an index to retrieve a character in the string is costly due to validating and bounds checking the index every time input[currentIndex] is used. Calculating the next index with input.formIndex(after: &currentIndex) is potentially costly as String.Index is not just a simple integer into an array (see source file comment on line 17).

There was no way for me to utilise unsafe Swift to enumerate over the characters with pointers and avoid the index bounds checking unless I went down the UTF8 byte level route. I would have to give up the work that String has already done for me which includes validating and combining Unicode scalar values into single characters. And that is not something I wanted to do at the time. However, another way to enumerate over the string characters one by one is to use a for loop.

var input = ... // The input string from the source file
var output: [Range<String.Index>] = []

var startIndex = input.startIndex
var currentIndex = startIndex

for character in input {
    input.formIndex(after: &currentIndex)
    
    if character.isNewline {
        output.append(startIndex..<currentIndex)
        startIndex = currentIndex
    }
}

Retrieving ranges (start and end index) of each line in the source file with an iterator

This version took about 19 milliseconds which is already a lot better than the index variant. The next thing I wanted to attempt was to remove the manual index advancement. If I could avoid manually calculating indices I could probably shave off a few more milliseconds. And luckily string provides a property that returns a DefaultIndices collection which contains all valid indices for the string in ascending order. I used zip to combine the indices and character iterators together and looped over them in a for loop.

var input = ... // The input string from the source file
var output: [Range<String.Index>] = []

var startIndex = input.startIndex

var newLine = false
for (currentIndex, character) in zip(input.indices, input) {
    if newLine {
        newLine = false
        output.append(startIndex..<currentIndex)
        startIndex = currentIndex
    }
    
    // Avoid processing of newline until the next character.
    // This way the half-open range can be constructed without
    // calling formIndex
    newLine = character.isNewline
}

// This makes sure we have added the last line
output.append(startIndex..<input.endIndex)

Retrieving ranges (start and end index) of each line in the source file with two iterators zipped together

This version took 15 milliseconds to run which was nice, but this solution didn't quite feel right to me because of these two separate iterators that might not even align properly. They probably do but in the back of my head I had this constant voice saying "what if they don't?". So I went on to look for other solutions.

String also offers a method called enumerateLines which will enumerate over the string on a line by line basis and invoke a closure. The closure has two arguments: the string that represents the current line and a boolean reference which allows for early termination of the loop.

var input = ... // The input string from the source file
var output: [Range<String.Index>] = []

input.enumerateLines() { (line, _) in // I do not care about the early termination boolean
    output.append(line.startIndex..<line.endIndex)
}

Retrieving ranges (start and end index) of each line in the source file with enumerateLines closure

This version is by far the nicest to read but to my surprise it took about 18 milliseconds to execute. I think it is slower because for each line it creates a new string instance as well as calling a closure that takes a copy of the output to modify it. Nice to read but not something I would want to use given the faster two iterator solution. But there was another solution I wanted to try, which was using lineRange. This method takes a range as an argument and returns a new range that makes up the start and end of the line (or lines) that the given range is in. The range it returns is an half-open range with the upperBound being the character after the newline character, which means the upperBound can then be used for the next lineRange call.

var input = ... // The input string from the source file
var output: [Range<String.Index>] = []

var startIndex = input.startIndex

repeat {
    let range = input.lineRange(for: startIndex..<startIndex)
    output.append(range)
    startIndex = range.upperBound
} while startIndex < input.endIndex

Retrieving ranges (start and end index) of each line in the source file with lineRange

This version took about 14 milliseconds to execute which so far was the winner. Plus lineRange is a useful method to have when it comes to editing text and figuring out which lines have have been touched by a given range.

There was one more method I wanted to try, and that was split. I tend to avoid splitting strings in most programming languages because it is a costly operation that creates a lot of copies. But to my surprise in Swift split is returning substrings instead of creating a new copy, which is excellent.

var input = ... // The input string from the source file

let output = input.split(omittingEmptySubsequences: false, whereSeparator: \.isNewline).map { $0.startIndex..<$0.endIndex }

Retrieving ranges (start and end index) of each line in the source file with split

Split also took about 14 milliseconds to execute. This means either split or lineRange do the best job for the task and I would use either one of those depending on what it is I am trying to achieve. With some crude back of the envelope calculation we could assume that both of these methods would be able to retrieve ranges for roughly 1.7 million lines of assembly code in about 1 second. That said, I do not know if this is the fastest way of retrieving line ranges in Swift but for now this is adequate enough for me.

I would like to add that these micro benchmarks have been conducted with Xcode unit tests by using the measure method. The times stated above are the average runtime of each test and it is by no means supposed to be a highly accurate measurement. I have not used any profiling tools to actually figure out what the bottlenecks for each solution were and that was also not the point of this exercise. All I wanted was a quick and dirty way of figuring out which method of retrieving line ranges is faster so I could have a good starting point for the project I am currently working on.