To start, here’s the sum total of the (meaningful) changes since v1.1:
v2: redesign API, experiment with implementation
v2: implement due card tracking
v2: implement logging
v2: track new and due cards
v2: implement card front and back extraction
fully migrate to the v2 API and eradicate all traces of v1
due: add -c and -o for count and offset control
due: inform the user when there are no cards
tmp: log: ignore cards marked due that aren’t
Remove CPU pinning interface
library: refactor and simplify code
library: clean up comments
sb: minor simplification
minor formatting tweak for consistency
Remove timing logic
cli: minor formatting changes
add TODO notes
library: minor code cleanup
due: remove unused header
reformatting and switch to sysfatal
library: reduce use of stretchy buffers
Removes the usage of stretchy buffers and removes the header
library: open logs on startup
cli: fix formatting
library: fix initial due list construction
library: update comments
library: remove unneeded parts of deck structure
library: reduce size of two length variables to u16
Update TODO notes
library: implement simplistic due list sorting
library: fail if an error occurs opening logs
due: fix sysfatal output
library: fix memory leaks
log,cli: remove unnecessary checks
logging: fix in-memory easiness storage
logging: minor simplification
log: support overriding max card count
posix: correctly print new line after fatal message
cli: remove unused config value
cli: make parameters consistent with other clients
clients: improve argument parsing consistency
library: expose dsm9find
POSIX: link statically
The biggest change by far is the design and implementation of a revised API for the library. This major change reduces code size, simplifies the API (thus both making the existing clients simpler and making it easier to write new ones), and provides a massive speed boost on weaker hardware. Of course, the various clients (due,log,cli) were migrated to the new API as well.
Additionally, the clients have been changed a little bit to improve consistency. Lastly, some unwanted misfeatures have been removed. While there are a few bug fixes, most of them fix bugs which were introduced during the switch to the new API (but which, of course, never made it to a release!).
The v2 API
DSM-9 v1 had one major user-facing flaw: it was slow. It took a few seconds to load a reasonably large deck (eight seconds initially on a Cortex A55 with DDR2 and 512K of L2, 2.3 seconds on a Cortex A72 with 1M of L2 and DDR4), and even after being optimized, it took a solid chunk of time. While runtime speed was fine, startup latency was unacceptably low.
As such, one of the key considerations of the v2 design was improved performance, through a number of mechanisms.
DSM-9 v1 loaded every card in the deck immediately upon startup. It would find the delimiters, extract the fields, and construct self-contained cards. DSM-9 v2 uses a streaming mechanism instead: on startup, it finds the card borders - i.e. line indices - but doesn’t process them at all. When a card is actually used, the card’s fields are found and extracted. This replaces a lot of work at startup with a tiny amount of work per card throughout usage.
tl;dr: moves from an array-of-struct approach to a struct-of-arrays approach, and optimizes card storage.
DSM-9 v1 viewed cards as atomic entities - that is, a card was self-contained, and included every piece of information. DSM-9 v2 instead views cards as a group of smaller pieces, and stores those pieces much differently. In v1, every part of a card was stored sequentially in memory, and a card took roughly 57 bytes (plus the string data for the various fields). That’s nearly a full cache line!
On a system with 32K of L1 data cache - such as both of my main computers - that allows a mere 512 cards to fit into cache at once, even before getting into the massive problems with strings. Since one of my decks is over 10M in size, the cache would be highly contended. Initialization would repeatedly use the entire cache, evicting all prior data very quickly.
v2 redesigns cards such that a single card takes a mere 23 bytes - less than half of the static size needed before, and significantly less once the strings are taken into account. Furthermore, the data isn’t stored as a single unit. Instead of storing one card, then another card, then another card, etc, it now splits the card into distinct arrays. For instance, the due timestamp of any given card is stored immediately after the due timestamp of the previous card. This allows 16 such timestamps to fit into a single cache line - when sorting cards by due date, for instance, a whopping 8K cards now fit into L1 cache, an improved factor of sixteen.
With a large log file, the savings from card streaming are largely nullified, as every card in the log currently needs to be streamed at startup in order to correctly match the log entry with the card. A solution to this is intended for v3, and will require changing the log file format. For 2.1, a hotfix replacing card streaming with field streaming is intended - that is, it will once again find every relevant index at startup, but postpone reading the actual contents until they are needed. If that ends up being sufficient, it may be decided that the complexity isn’t worthwhile for v3.
Loading is still easily the bottleneck for any short-lived client, such as dsm9/due. For due, loading comprises roughly 99.9% of runtime. For my test, with a large deck and small logs, processing the deck is the biggest part. However, with a large log file which uses more of the deck, this will cease to be the case, as the data streaming ends up becoming more harmful than helpful, as mentioned above.
Either way, the loading time is still easily the biggest bottleneck. Fixing the log file format, in addition to restoring the benefits of card streaming, will make this moot, as it will reduce the log loading time from O(log entries) to O(logged cards), and the deck loading time is already nearly as fast as it possibly can be without resorting to an assembly version.
Any client which reads the fields (name, front, or back) of every card (or many cards) on startup will suffer from the same costs to card streaming that are mentioned above in the context of logging. Streaming is actually slightly more expensive per card, but the benefit is that it makes it possible to only stream a single card at a time, which means that for a typical client it ends up being worthwhile.
In addition to migrating the various clients to the new API, a few minor changes have been made. The most important user-facing difference is that the command line arguments have been made consistent - for instance, to skip the first N cards, all three clients now recognize the -o N
The v1 clients used bespoke timing logic to track how long various tasks (deck loading, log loading, etc) took. To facilitate this, CPU pinning was required as the rpi3 9front kernel at the time did not correctly sync the cycle counter when processes were migrated between cores. While that issue has long since been fixed, I have not updated my kernel, so I kept the logic around. That logic broke anyways as the libc was updated to expect the newer kernel, causing permissions issues. At that point, I realized that I’d been using a real profiler for measurements anyways, the timing logic was platform-specific by design, and there was no real need to keep it around. As such, the timing logic (and the CPU pinning which underpinned [heh] it) is now gone.
Additionally, DSM-9 "instances" have been replaced with decks. To handle multiple decks, a client should load each deck separately and handle the relevant logic. (A later 2.x version may include such multiplexing logic.) The complexity that was previously required made every single card more complex, had performance issues, and was not even used.