Line ranges in Swift - Redux
Last week I published a post about finding the fastest way to retrieve line ranges from a large string. The way to determine which method was the fastest was very crude and I had this nagging feeling that it was also wrong. To save you the time of reading the last post again here is a quick summary. I processed a file of about 24k lines of assembly code that contained about 15 characters per line on average. I wrote about 6 different ways of finding the half-open ranges of each line, measured their runtime, and came to the following results:
- Using
String.Index
with a subscript: 24 milliseconds - Using a
Character
iterator and calculating the next index: 19 milliseconds - Zipping the
Character
andIndices
iterator together: 15 milliseconds - Using
enumerateLines
: 18 milliseconds - Using
lineRange
: 14 milliseconds - Using
split
: 14 milliseconds
These results meant that using lineRange
or split
were the fastest options but I can now say that those results were indeed wrong. The reasons for this were the very crude nature of my tests and some glaring errors regarding Xcode build optimisations which can be boiled down to my inexperience with Xcode. I ran all the tests again with a benchmarking framework and build the project from the command line with optimisations enabled. And this time I got the following results:
Before talking about the results I have to mention that the code has not been modified since last week but I have switched to a MacMini with an M2 Pro Processor.
These new results confirm that using lineRange
is still the fastest way of finding all line ranges in a given string. But I was quite surprised to see that using direct indices via a subscript is not the slowest method. That would be enumeratedLines
. Using indices is actually the second fastest method to retrieve line ranges in a string. It goes to show: Do not trust crude benchmarks within Xcode. As a second exercise I also increased the amount of data to 1M lines of assembly code.
Nothing surprising to see here. Just a linear continuation of the times measured with 24k lines all the way up to 1M lines.
Now with a proper benchmark in place I was curious to see if there was any difference between using String and Substring when trying to find line ranges, so I changed all the test code to process a Substring
as input data instead and ran the benchmarks again.
What the heck is happening? All methods were sort of in the same ballpark in terms of runtime as the version that was using String
. But lineRange
was just exploding in run time when used on a Substring
. This was quite a shock! My first instinct was that there had to be some kind of mistake in my code but after looking at it countless times I could not find any obvious mistake.
I ran the lineRange
code in isolated builds as well and was getting the same results. Clearly something horrible was going on in the lineRange
implementation for substrings in Swift and I was the wrong person to figure out what.
Because lineRange
was skewing the results I had to exclude it form the graph of the Substring
versions to get a better picture of the run times.
For substrings I could see that using direct indices was the fastest and enumeratedLines
was no longer the worst option. That would be both iterator versions instead. This was quite interesting to me but I also wanted a direct one to one comparison between the String
and Substring
version of each implementation. In the following graphs the line labelled before
is for the String
version, and the line labelled after
is for the Substring
version.
Enumerate Lines - String vs. Substring
There was not much difference between using enumerateLines
on String
or Substring
but in general I would avoid using this method as it is quite slow compared to other solutions.
Split - String vs. Substring
Using split
was almost 20 milliseconds worse with a Substring
compared to a String
. However, for 1M lines of assembly code this difference is negligible.
Character Iterator - String vs. Substring
Using iterators on a Substring
come with performance implications. Not to mention that iterators are already slower than using plain indices on a String
.
Zipped Character and Indices Iterator - String vs. Substring
Similar characteristics as a plain Character
iterator with an added penalty of using a second iterator.
Indices with Subscript - String vs. Substring
Similar to split
, using indices with a subscript on a Substring
comes with slight performance implications. When dealing with StringProtocol
that could either be a String
or a Substring
I would recommend using subscripts with a String.Index
. This version is fast and predictable.
LineRange - String vs. Substring
What is there left to say. I would not hesitate to use lineRange
when dealing with a String
. But if working with a StringProtocol
that could either be a String
or Substring
it would be best to avoid using lineRange
altogether.
Conclusion
These results have helped to change the approach for my current project. Based on my first crude benchmarks I was content to use split
to get the line ranges and the substrings for each line in a string. I would then use the substring to parse the assembly code within. But now that I learned that Substring
comes with performance implications, I am no longer considering this approach. Instead I would use subscript with indices and leave the compiler to do the heavy lifting of optimising it.