Something we all have to do at one point or another is parse a file. Usually, it's simple enough. Import a parser for JSON, CSV, TOML, or whatever and run parser.parse(file). The same holds true for writing one of these tried and true formats, but sometimes that doesn't cut it.

In some instances, we need to use our own special format whether that's to save disk space, improve read/write times, delay loading of certain data, improve human readability, or our boss just hates us—the possibilities are truly endless. In other cases, we might not need our own special format per say. Instead we need to write specific data to disk to be retrieved later on. Think serialization. In this latter case, your best bet is probably to just use any off-the-shelf, tried and true format, and my following tip applies just as much to this simpler case.

Every single time you write a reader, you should also be writing a writer and vice versa.

A cautionary tale

When I started my work at Verizon, one of my first tasks involved working on the camera calibration system. Essentially, we had all kinds of fun OpenCV code to find the positions and orientations of 32 cameras (16 stereo pairs) from a few pictures of a structured target. After all that hard work was done, we wrote out the camera data to a file with a little snippet that looked something like this.

std::ofstream fileOut("cameras.json");
for (auto const& camera : cameras) {
  fileOut
  << "{\n"
  << "  Id: " << camera.id << "\n"
  << "  Position: " << camera.position << "\n"
  << "  Rotation: " << camera.rotation << "\n"
  << "}\n"
}

Then, in a completely separate application with a completely separate repo, we would inline a reader for this file with all the ugly logic that entails. They had been doing this for years by the time I arrived.

Within weeks of starting the job, I noticed a pattern. Every morning, Engineer A runs a test. Everything is broken. Engineer A debugs and finds that the application breaks as soon as we load the calibration file (Running the debugger was actually required, since no errors were printed or logged for this case). Then, Engineer A mentions the problem to Engineer B. Finally, Engineer B looks into it and tells Engineer A they need to be parsing a new item that was added the previous day. This was an every day occurrence.

A resolution to our story

It's first important to understand what went so horribly wrong in the scenario above. First of all, we had a custom file format (i.e. no JSON, YAML, XML, etc). It was a format extremely reminiscent of JSON, but it broke a few rules and so was custom nonetheless. This isn't that big of a deal since like I said before, sometimes there's a benefit to having your own file format. However, in this case, there was no real reason. It's just how it was done at the time—most likely hacked together during R&D as there was a lot of that on this team, semi-understandably.

The other, far more sever issue was a lack of synchronization between the reader and writer code. Being in separate code bases, one could easily be updated without the other, creating a bug from that point.

Finally, as any good software engineer should already be aware, code needs to be tested! Furthermore, code should be written with testing in mind. A test case could have picked this problem up in the build early on before we ever deployed to our testing environment. Of course, given the architecture of the old code, testability wasn't exactly built in.

Use an industry standard format

My answer to the former part of this problem was to grab an off-the-shelf open source JSON parser. More specifically, I landed on nlohmann-json. Using nlohmann-json takes care of most of the dirty work for JSON formatting. All you have to do is tell it how to represent your specific type. The library even includes a few macros to specify JSON formats for simple structs automatically. My case was a bit more complicated, so I needed to manually define the fields. All that aside, now what we essentially had was the following.

// Write
std::ofstream fileOut("cameras.json");
fileOut << nlohmann::json(cameras).dump(2);
// Read
std::ifstream fileIn("cameras.json");
auto const cameras = fileIn.parse(fileIn);

Notice, by specifying the JSON representation for nlohmann-json (not shown here), we get the writer and reader code pretty much for free, which is where the final piece of our fix comes in.

Keep reader and writer code together

If our camera calibration library generates a certain kind of file, there's really no reason we should expect someone else to just know how to parse that file. Instead, we exposed a set of public objects in our calibration library's interface and added public methods to read and write those objects. This way, any application that uses this file type just needs to add the calibration library as a dependency. In organizing our code this way, we always keep our reader and writer in sync.

A note on testing

Something to notice about this approach is that it becomes extremely simple to test your file parsers. First off, how do we know a reader/writer set is any good? Simple. The test body looks something like the following pseudocode, which is to say, your object should be able to go from object to string (or raw bytes) and back completely unchanged.

assert(A == ClassA::from_string(A.to_string()));

TLDR;

  • If you don't need a custom file format, don't create one.

  • If you make a file reader, make the writer too!

    • This keeps all of the serialization and/or formatting details in one place.

    • This also makes testing extremely simple.