1BRC - Generating the test data
I have decided to try this challenge with the C Programming Language, since that is my language of choice these days. I won't be using any of the C standard library functions and instead will be using my own functions that provide me with similar functionality. There is no specific reason for this. I just don't like using the C standard library. This project is written and compiled for Windows only. Specifically Windows 11 Home 64-bit - 23H2 - 22631.3880
. My compiler of choice will be Microsoft (R) C/C++ Optimizing Compiler Version 19.38.33135 for x64
.
The first step for this challenge is to generate the test data. A single text file named measurements.txt
that contains multiple lines of text. 1 Billion lines to be exact. Each line contains the name of a location, followed by a semicolon as a separator, which is followed by a temperature value. The name can contain any UTF-8 characters. The temperature can either be positive or negative, and there is only one fractional digit after the decimal point.
The generation of the test data file happens in a single function named RunGenerator
. This is being called by the projects main
function (or ProgramStart
in this project) if the right command line argument is passed to the executable. At the moment there are three valid command line arguments: generate
, benchmark
, and run
. I will talk about these in more detail in the upcoming posts. But if generate
is passed to the executable, it will call the RunGenerator
function that generates the test data file.
First, I create an empty measurements.txt
file by calling my own CreateFileToWrite
function. This function calls CreateFileA
from Windows.h
with the GENERIC_WRITE
and CREATE_ALWAYS
flags set, and returns the resulting file handle. It will also exit the program with an error code if the call to CreateFileA
fails for some reason.
StringLiteral
is one of my macros that will create a string
struct for any given string literal. Almost all of my functions that work with strings do so with a string
type instead of a raw char *
. This is because I like to have the length of a string available to me instead of having to call strlen
every time.
As a quick side note: u8
is a typedef to unsigned char
and u32
to unsigned long
.
I do not really care about the performance of the generation code as it will be only run very few times to generate the test data. But, I do know that having multiple calls to write into a file in very small chunks, say one write call for each line of text, is very slow. It is better to collect multiple lines of text into a buffer first and then write the whole buffer into the file with a single call. So after creating the file I am allocating a new file buffer of about 128 KiB. The size is arbitrarily chosen. It might be better to just allocate a buffer with the same size as the hard drive sector size. But as I said before, I don't really care about performance of the generator so I just leave it at 128 KiB for now.
The file buffer just holds the base pointer to a memory location, the overall capacity of the buffer, and the current position of the cursor as an offset. The position basically indicates how much has been written into the buffer so far.
Creating the test data is just a matter of looping for 1 billion times, and writing a new line of measurement into the file buffer on each iteration. The file buffer needs to be flushed once it is full (the contents need to be written to file), and then it can be reused for more lines. All of this happens within my helper functions so the loop in itself will look something like this:
Each line of text that I am writing into the file needs have a random location name, as well as a random temperature value. The same name and temperature can appear more than once. To solve this I have created a weather_station
struct that holds the name and temperature.
All possible weather stations are then stored in a global array. Each station also received a base temperature. I have used the same data as Danny van Kooten which you can find here. I have not looked at any of their solutions to the challenge and only used their create-sample.c
file to get the weather station data out of it. Everything else is my own. Unfortunately, you can only take my word for it as there is no way for me to prove this.
To get a station name within the loop I am getting a random value between 0 and the number of items within the g_weather_station_data
array. This number is then used as the index into the array to retrieve the name and base temperature.
RandomZeroRange64
returns a number between 0 and a given inclusive maximum. Because I am not using the C standard library I had to implement my own random function. For this I have been using a 64-bit Xorshift by Marsaglia. The implementation looks something like this:
The temperature value also has to be randomised. For this I have a little function that takes in the base temperature of a weather station from the g_weather_station_data
array. It then sets the range of the temperature to be within +/-5 degrees of the base temperature and picks a random value within that range.
With the station name and temperature I can now create a string that represents the line of text to be written into the file buffer. Since I am not using the C standard library I can not call fprintf to format the string. Instead I have created a function called StringFormat
that does the same thing. Internally it is using stb_sprintf from Sean Barrett (Originally written by Jeff Roberts). This is also the only external dependency used in this project (apart from the Windows dependencies of course).
And that is all there is to it. I have added timestamps before and after the loop so I can calculate the time it takes to generate the output file. And because writing 1 billion lines of text takes a couple of minutes on my machine, I have also added a status output to the terminal for every 1 millions lines written.
When running the executable with the generate
argument it will create a roughly 13 GiB large text file in about 3 minutes. I am running this on a AMD Ryzen 7 7700 8Core @ 3.8GHz
. I am sure this could be sped up a lot as well but the focus of this challenge is the reading of the data, and not the writing. However, I might come back to the generation after I am done with the reading part of the challenge. I am sure there is a lot to learn here in tying to make this as fast as possible as well.
Download
- Project Source Code (50.9 KiB)
SHA256 Checksum: 353bee6a7132b1957ca04ccc11354153de70cb6207991e0465f0a202cee11350