Earlier this month, I replaced OpenTally's previous naive string-manipulation-based BLT parser with one using pest. A new parser was necessary to support extensions to the BLT format such as comments, and format variations generated by some other software (such as optional newlines).
Under the pest-based parser, the mean time (±95%CI) to parse and count the 4,492,197-vote 2016 New South Wales Senate election increased from 7.90 (±0.01) seconds to 10.8 (±0.1) seconds. This is not insubstantial, but probably acceptable given the increased functionality of the new parser.
What is shocking, however, is that the memory requirement appeared to be significantly larger under the pest-based parser, with the mean peak memory usage (±95%CI) increased from 1523.4 (±0.8) MiB to 4892.58 (±0.02) MiB, over 3 times as large. Despite continual advances in computer memory technology, a requirement of nearly 5 GiB of RAM to count an election is probably too much at this time.
Pest is not a streaming parser, meaning that it requires the entire payload to be loaded into memory before it can be parsed. However, given that the raw BLT file is only 223.1 MiB large, this is probably unlikely to account for much of the nearly 5 GiB of RAM used. Nevertheless, I did try to create an alternative implementation which mapped (rather than copied) the file into memory using memmap, but this did not affect the memory usage, and if anything only worsened the performance.
The most popular parser framework for Rust is nom. Nom does claim to support streaming input, but there is extremely limited documentation and implementing this appears to be nontrivial. Instead, I turned to another parser framework, combine, which has out-of-the-box constructs for streaming from files.
Whereas pest is a parser generator, taking a grammar defined in its own syntax from a separate file, combine is a parser combinator, and each component of the grammar is expressed in Rust logic. I was afraid that this would complicate writing the grammar, but happily, after a few false starts figuring out how to express grammars in combine, it took only less than 200 lines of Rust to express.
In a testing implementation responsible only for parsing the abovementioned 4,492,197-vote BLT file, the pest-based parser had a mean execution time (±95%CI) of 4.90 (±0.01) seconds, and a mean peak memory usage (±95%CI) of 4697.27 (±0.03) MiB.
By comparison, the combine-based parser had a mean execution time (±95%CI) of 22.7 (±0.2) seconds, and a mean peak memory usage (±95%CI) of 913.09 (±0.03) MiB.
While the combine-based parser did, as hoped, have a much smaller memory footprint, the performance was unacceptably poor, with a nearly 80% reduction in speed. Now I can't rule out that I might have made a mistake in specifying the grammar in the combine-based parser leading to poor performance, but this result was sufficiently disheartening to discourage me from further pursuing combine.
Now, by inspection, it is known that the grammar of a BLT file is LL(1), which means that it can be parsed by a simple handwritten predictive recursive descent parser. This would also allow the greatest degree of control over streaming the input data, which is ideal for our use case.
The handwritten BLT parser can be found here. Not having any previous experience writing parsers by hand, I was afraid this would be a complex endeavour, but happily the parser is not that much longer than the combine-based parser, clocking in at just over 300 lines of code.
In the testing implementation, looking only at parsing performance, the handwritten parser had a mean execution time (±95%CI) of 5.99 (±0.04) seconds, and a mean peak memory usage (±95%CI) of 913.09 (±0.08) MiB. This result is highly respectable, with equivalent memory consumption to the combine-based parser, but significantly better performance only a little slower than the pest-based parser.
In a complete benchmark involving both parsing and counting the election, the handwritten parser had a mean execution time (±95%CI) of 11.7 (±0.02) seconds, and a mean peak memory usage of 1523.44 (±0.06) MiB. Again, the result is respectable, with equivalent memory consumption to the original string-manipulation-based parser.
The handwritten parser is slower than the original string-manipulation-based parser, but this is acceptable given the increased complexity of the BLT specification. The handwritten parser is also slightly slower than the pest-based parser, but the significant improvement in memory footprint well justifies this tradeoff.
The handwritten parser also has the advantage of easily producing meaningful error messages, with line/column numbers and useful information about the specific syntax error, which would have been somewhat more complicated to customise in pest or combine.