Line ranges in Swift - Redux
Last week I published a post about finding the fastest way to retrieve line ranges from a large string. The way to determine which method was the fastest was very crude and I had this nagging feeling that it was also wrong. To save you the time of reading the last post again here is a quick summary. I processed a file of about 24k lines of assembly code that contained about 15 characters per line on average. I wrote about 6 different ways of finding the half-open ranges of each line, measured their runtime, and came to the following results:
String.Indexwith a subscript: 24 milliseconds
- Using a
Characteriterator and calculating the next index: 19 milliseconds
- Zipping the
Indicesiterator together: 15 milliseconds
enumerateLines: 18 milliseconds
lineRange: 14 milliseconds
split: 14 milliseconds
These results meant that using
split were the fastest options but I can now say that those results were indeed wrong. The reasons for this were the very crude nature of my tests and some glaring errors regarding Xcode build optimisations which can be boiled down to my inexperience with Xcode. I ran all the tests again with a benchmarking framework and build the project from the command line with optimisations enabled. And this time I got the following results:
Before talking about the results I have to mention that the code has not been modified since last week but I have switched to a MacMini with an M2 Pro Processor.
These new results confirm that using
lineRange is still the fastest way of finding all line ranges in a given string. But I was quite surprised to see that using direct indices via a subscript is not the slowest method. That would be
enumeratedLines. Using indices is actually the second fastest method to retrieve line ranges in a string. It goes to show: Do not trust crude benchmarks within Xcode. As a second exercise I also increased the amount of data to 1M lines of assembly code.
Nothing surprising to see here. Just a linear continuation of the times measured with 24k lines all the way up to 1M lines.
Now with a proper benchmark in place I was curious to see if there was any difference between using String and Substring when trying to find line ranges, so I changed all the test code to process a
Substring as input data instead and ran the benchmarks again.
What the heck is happening? All methods were sort of in the same ballpark in terms of runtime as the version that was using
lineRange was just exploding in run time when used on a
Substring. This was quite a shock! My first instinct was that there had to be some kind of mistake in my code but after looking at it countless times I could not find any obvious mistake.
I ran the
lineRange code in isolated builds as well and was getting the same results. Clearly something horrible was going on in the
lineRange implementation for substrings in Swift and I was the wrong person to figure out what.
lineRange was skewing the results I had to exclude it form the graph of the
Substring versions to get a better picture of the run times.
For substrings I could see that using direct indices was the fastest and
enumeratedLines was no longer the worst option. That would be both iterator versions instead. This was quite interesting to me but I also wanted a direct one to one comparison between the
Substring version of each implementation. In the following graphs the line labelled
before is for the
String version, and the line labelled
after is for the
Enumerate Lines - String vs. Substring
There was not much difference between using
Substring but in general I would avoid using this method as it is quite slow compared to other solutions.
Split - String vs. Substring
split was almost 20 milliseconds worse with a
Substring compared to a
String. However, for 1M lines of assembly code this difference is negligible.
Character Iterator - String vs. Substring
Using iterators on a
Substring come with performance implications. Not to mention that iterators are already slower than using plain indices on a
Zipped Character and Indices Iterator - String vs. Substring
Similar characteristics as a plain
Character iterator with an added penalty of using a second iterator.
Indices with Subscript - String vs. Substring
split, using indices with a subscript on a
Substring comes with slight performance implications. When dealing with
StringProtocol that could either be a
String or a
Substring I would recommend using subscripts with a
String.Index. This version is fast and predictable.
LineRange - String vs. Substring
What is there left to say. I would not hesitate to use
lineRange when dealing with a
String. But if working with a
StringProtocol that could either be a
Substring it would be best to avoid using
These results have helped to change the approach for my current project. Based on my first crude benchmarks I was content to use
split to get the line ranges and the substrings for each line in a string. I would then use the substring to parse the assembly code within. But now that I learned that
Substring comes with performance implications, I am no longer considering this approach. Instead I would use subscript with indices and leave the compiler to do the heavy lifting of optimising it.