2024-08-15

1BRC - Generating the test data

I have decided to try this challenge with the C Programming Language, since that is my language of choice these days. I won't be using any of the C standard library functions and instead will be using my own functions that provide me with similar functionality. There is no specific reason for this. I just don't like using the C standard library. This project is written and compiled for Windows only. Specifically Windows 11 Home 64-bit - 23H2 - 22631.3880. My compiler of choice will be Microsoft (R) C/C++ Optimizing Compiler Version 19.38.33135 for x64.

The first step for this challenge is to generate the test data. A single text file named measurements.txt that contains multiple lines of text. 1 Billion lines to be exact. Each line contains the name of a location, followed by a semicolon as a separator, which is followed by a temperature value. The name can contain any UTF-8 characters. The temperature can either be positive or negative, and there is only one fractional digit after the decimal point.

Hamburg;12.0
Bulawayo;8.9
Palembang;38.8
St. John's;15.2
Cracow;12.6

And example of the data within measurements.txt.

The generation of the test data file happens in a single function named RunGenerator. This is being called by the projects main function (or ProgramStart in this project) if the right command line argument is passed to the executable. At the moment there are three valid command line arguments: generate, benchmark, and run. I will talk about these in more detail in the upcoming posts. But if generate is passed to the executable, it will call the RunGenerator function that generates the test data file.

First, I create an empty measurements.txt file by calling my own CreateFileToWrite function. This function calls CreateFileA from Windows.h with the GENERIC_WRITE and CREATE_ALWAYS flags set, and returns the resulting file handle. It will also exit the program with an error code if the call to CreateFileA fails for some reason.

void *file = CreateFileToWrite(StringLiteral("measurements.txt"));

Creating a new file for writing.

StringLiteral is one of my macros that will create a string struct for any given string literal. Almost all of my functions that work with strings do so with a string type instead of a raw char *. This is because I like to have the length of a string available to me instead of having to call strlen every time.

typedef struct STRING string;
struct STRING {
    u8 *base;
    u32 length;
};

My string type.

As a quick side note: u8 is a typedef to unsigned char and u32 to unsigned long.

I do not really care about the performance of the generation code as it will be only run very few times to generate the test data. But, I do know that having multiple calls to write into a file in very small chunks, say one write call for each line of text, is very slow. It is better to collect multiple lines of text into a buffer first and then write the whole buffer into the file with a single call. So after creating the file I am allocating a new file buffer of about 128 KiB. The size is arbitrarily chosen. It might be better to just allocate a buffer with the same size as the hard drive sector size. But as I said before, I don't really care about performance of the generator so I just leave it at 128 KiB for now.

file_buffer buffer = FileBuffer(arena, KiB(128));

Allocating a file buffer of 128 KiB.

The file buffer just holds the base pointer to a memory location, the overall capacity of the buffer, and the current position of the cursor as an offset. The position basically indicates how much has been written into the buffer so far.

Creating the test data is just a matter of looping for 1 billion times, and writing a new line of measurement into the file buffer on each iteration. The file buffer needs to be flushed once it is full (the contents need to be written to file), and then it can be reused for more lines. All of this happens within my helper functions so the loop in itself will look something like this:

u64 count = Billion(1);
for (u64 i = 0; i < count; i++) {
    string line = ...
    WriteStringToFileBuffer(file, &buffer, line);
}

The test data generation loop.

Each line of text that I am writing into the file needs have a random location name, as well as a random temperature value. The same name and temperature can appear more than once. To solve this I have created a weather_station struct that holds the name and temperature.

typedef struct WEATHER_STATION weather_station;
struct WEATHER_STATION {
    string name;
    f64 temperature; // f64 is a double
};

The weather station struct.

All possible weather stations are then stored in a global array. Each station also received a base temperature. I have used the same data as Danny van Kooten which you can find here. I have not looked at any of their solutions to the challenge and only used their create-sample.c file to get the weather station data out of it. Everything else is my own. Unfortunately, you can only take my word for it as there is no way for me to prove this.

static const weather_station g_weather_station_data[] = {
    { StringLiteralConst("Abha"), 18.0 },
    { StringLiteralConst("Abidjan"), 26.0 },
    ...
    { StringLiteralConst("Zanzibar City"), 26.0 },
    { StringLiteralConst("Zürich"), 9.3 },
};

A slice of the static weather station data.

To get a station name within the loop I am getting a random value between 0 and the number of items within the g_weather_station_data array. This number is then used as the index into the array to retrieve the name and base temperature.

u64 station_count = ArrayCount(g_weather_station_data);
u64 slot = RandomZeroRange64(station_count - 1);

Getting the random index into the g_weather_station_data array.

RandomZeroRange64 returns a number between 0 and a given inclusive maximum. Because I am not using the C standard library I had to implement my own random function. For this I have been using a 64-bit Xorshift by Marsaglia. The implementation looks something like this:

static u64 g_random_state_u64;

u64 Random64(void) {
    u64 result = g_random_state_u64;
    
    result ^= result << 13;
    result ^= result >> 7;
    result ^= result << 17;
    
    g_random_state_u64 = result;

    return (result);
}

u64 RandomZeroRange64(u64 max) {
    u64 result = Random64() % (max + 1);
    return (result);
}

Implementation of Xorshift.

The temperature value also has to be randomised. For this I have a little function that takes in the base temperature of a weather station from the g_weather_station_data array. It then sets the range of the temperature to be within +/-5 degrees of the base temperature and picks a random value within that range.

f64 CalculateTemperature(f64 temperature) {
    f64 scale = RandomFloat64(); // Returns a value between 0.0 and 1.0
    f64 min = temperature - 5.0;
    f64 max = temperature + 5.0;

    f64 result = min + scale * ( max - min );
    return (result);
}

The temperature calculation.

With the station name and temperature I can now create a string that represents the line of text to be written into the file buffer. Since I am not using the C standard library I can not call fprintf to format the string. Instead I have created a function called StringFormat that does the same thing. Internally it is using stb_sprintf from Sean Barrett (Originally written by Jeff Roberts). This is also the only external dependency used in this project (apart from the Windows dependencies of course).

string line = StringFormat(arena, "%.*s;%.1f\n",
    StringToFormatArgs(g_weather_station_data[slot].name), temperature);

WriteStringToFileBuffer(file, &buffer, line);

Formatting the string and writing it into the file buffer.

And that is all there is to it. I have added timestamps before and after the loop so I can calculate the time it takes to generate the output file. And because writing 1 billion lines of text takes a couple of minutes on my machine, I have also added a status output to the terminal for every 1 millions lines written.

When running the executable with the generate argument it will create a roughly 13 GiB large text file in about 3 minutes. I am running this on a AMD Ryzen 7 7700 8Core @ 3.8GHz. I am sure this could be sped up a lot as well but the focus of this challenge is the reading of the data, and not the writing. However, I might come back to the generation after I am done with the reading part of the challenge. I am sure there is a lot to learn here in tying to make this as fast as possible as well.

Download

Project Source Code (50.9 KiB)
SHA256 Checksum: 353bee6a7132b1957ca04ccc11354153de70cb6207991e0465f0a202cee11350