DSM-9: The Best Tooling for Physical Flash Cards

DSM-9: Release Notes

Current version: v2.0

v2.0

To start, here’s the sum total of the (meaningful) changes since v1.1:

      v2: redesign API, experiment with implementation

      v2: implement due card tracking

      v2: implement logging

      v2: track new and due cards

      v2: implement card front and back extraction

      fully migrate to the v2 API and eradicate all traces of v1

      due: add -c and -o for count and offset control

      due: inform the user when there are no cards

      fixup exiting

      tmp: log: ignore cards marked due that aren’t

      Remove CPU pinning interface

      library: refactor and simplify code

      library: clean up comments

      sb: minor simplification

      minor formatting tweak for consistency

      Remove timing logic

      cli: minor formatting changes

      add TODO notes

      library: minor code cleanup

      due: remove unused header

      reformatting and switch to sysfatal

      library: reduce use of stretchy buffers

      Removes the usage of stretchy buffers and removes the header

      library: open logs on startup

      cli: fix formatting

      library: fix initial due list construction

      library: update comments

      library: remove unneeded parts of deck structure

      library: reduce size of two length variables to u16

      Update TODO notes

      library: implement simplistic due list sorting

      library: fail if an error occurs opening logs

      due: fix sysfatal output

      library: fix memory leaks

      fixup logging

      log,cli: remove unnecessary checks

      logging: fix in-memory easiness storage

      logging: minor simplification

      log: support overriding max card count

      posix: correctly print new line after fatal message

      cli: remove unused config value

      cli: make parameters consistent with other clients

      clients: improve argument parsing consistency

      library: expose dsm9find

      POSIX: link statically

The biggest change by far is the design and implementation of a revised API for the library. This major change reduces code size, simplifies the API (thus both making the existing clients simpler and making it easier to write new ones), and provides a massive speed boost on weaker hardware. Of course, the various clients (due,log,cli) were migrated to the new API as well.

Additionally, the clients have been changed a little bit to improve consistency. Lastly, some unwanted misfeatures have been removed. While there are a few bug fixes, most of them fix bugs which were introduced during the switch to the new API (but which, of course, never made it to a release!).

The v2 API

DSM-9 v1 had one major user-facing flaw: it was slow. It took a few seconds to load a reasonably large deck (eight seconds initially on a Cortex A55 with DDR2 and 512K of L2, 2.3 seconds on a Cortex A72 with 1M of L2 and DDR4), and even after being optimized, it took a solid chunk of time. While runtime speed was fine, startup latency was unacceptably low.

As such, one of the key considerations of the v2 design was improved performance, through a number of mechanisms.

Card Streaming

DSM-9 v1 loaded every card in the deck immediately upon startup. It would find the delimiters, extract the fields, and construct self-contained cards. DSM-9 v2 uses a streaming mechanism instead: on startup, it finds the card borders - i.e. line indices - but doesn’t process them at all. When a card is actually used, the card’s fields are found and extracted. This replaces a lot of work at startup with a tiny amount of work per card throughout usage.

Data Locality

tl;dr: moves from an array-of-struct approach to a struct-of-arrays approach, and optimizes card storage.

DSM-9 v1 viewed cards as atomic entities - that is, a card was self-contained, and included every piece of information. DSM-9 v2 instead views cards as a group of smaller pieces, and stores those pieces much differently. In v1, every part of a card was stored sequentially in memory, and a card took roughly 57 bytes (plus the string data for the various fields). That’s nearly a full cache line!

On a system with 32K of L1 data cache - such as both of my main computers - that allows a mere 512 cards to fit into cache at once, even before getting into the massive problems with strings. Since one of my decks is over 10M in size, the cache would be highly contended. Initialization would repeatedly use the entire cache, evicting all prior data very quickly.

v2 redesigns cards such that a single card takes a mere 23 bytes - less than half of the static size needed before, and significantly less once the strings are taken into account. Furthermore, the data isn’t stored as a single unit. Instead of storing one card, then another card, then another card, etc, it now splits the card into distinct arrays. For instance, the due timestamp of any given card is stored immediately after the due timestamp of the previous card. This allows 16 such timestamps to fit into a single cache line - when sorting cards by due date, for instance, a whopping 8K cards now fit into L1 cache, an improved factor of sixteen.

Current Bottlenecks

With a large log file, the savings from card streaming are largely nullified, as every card in the log currently needs to be streamed at startup in order to correctly match the log entry with the card. A solution to this is intended for v3, and will require changing the log file format. For 2.1, a hotfix replacing card streaming with field streaming is intended - that is, it will once again find every relevant index at startup, but postpone reading the actual contents until they are needed. If that ends up being sufficient, it may be decided that the complexity isn’t worthwhile for v3.

Loading is still easily the bottleneck for any short-lived client, such as dsm9/due. For due, loading comprises roughly 99.9% of runtime. For my test, with a large deck and small logs, processing the deck is the biggest part. However, with a large log file which uses more of the deck, this will cease to be the case, as the data streaming ends up becoming more harmful than helpful, as mentioned above.

Either way, the loading time is still easily the biggest bottleneck. Fixing the log file format, in addition to restoring the benefits of card streaming, will make this moot, as it will reduce the log loading time from O(log entries) to O(logged cards), and the deck loading time is already nearly as fast as it possibly can be without resorting to an assembly version.

Any client which reads the fields (name, front, or back) of every card (or many cards) on startup will suffer from the same costs to card streaming that are mentioned above in the context of logging. Streaming is actually slightly more expensive per card, but the benefit is that it makes it possible to only stream a single card at a time, which means that for a typical client it ends up being worthwhile.

Client improvements

In addition to migrating the various clients to the new API, a few minor changes have been made. The most important user-facing difference is that the command line arguments have been made consistent - for instance, to skip the first N cards, all three clients now recognize the -o N 

option.

Misfeature removal

The v1 clients used bespoke timing logic to track how long various tasks (deck loading, log loading, etc) took. To facilitate this, CPU pinning was required as the rpi3 9front kernel at the time did not correctly sync the cycle counter when processes were migrated between cores. While that issue has long since been fixed, I have not updated my kernel, so I kept the logic around. That logic broke anyways as the libc was updated to expect the newer kernel, causing permissions issues. At that point, I realized that I’d been using a real profiler for measurements anyways, the timing logic was platform-specific by design, and there was no real need to keep it around. As such, the timing logic (and the CPU pinning which underpinned [heh] it) is now gone.

Additionally, DSM-9 "instances" have been replaced with decks. To handle multiple decks, a client should load each deck separately and handle the relevant logic. (A later 2.x version may include such multiplexing logic.) The complexity that was previously required made every single card more complex, had performance issues, and was not even used.