Efficiently reading a CSV of floats in Rust
In hpstat, we are often required to read large CSV files consisting of a header of string column names, followed by data consisting entirely of floating-point numbers. Profiling reveals that a naïve approach based on generic CSV parsers is inefficient at this task.
Consider the following naïve approach using the Rust csv library:
use csv::Reader;
fn main() {
for _ in 0..100 {
let mut csv_reader = Reader::from_path("/path/to/file.csv").unwrap();
let headers = csv_reader.headers().unwrap().clone();
// For reasons beyond the scope of this article, it is acceptable to
// read the data into a 1-dimensional structure
let mut records = Vec::new();
for record in csv_reader.records() {
for field in &record.unwrap() {
records.push(parse_float(field));
}
}
println!("{:?}", headers);
println!("{:?}", records);
}
}
fn parse_float(s: &str) -> f64 {
let value = match s {
"inf" => f64::INFINITY,
_ => s.parse().expect("Malformed float")
};
return value;
}
For illustration, the above code is run on a real-word 371 KiB CSV dataset with 2 columns and 16639 data rows. Across 5 trials on an Intel Core i5-7500, the mean (±SE) execution time is 0.665 (±0.004) seconds.
Profiling this code reveals that 6.56% of cycles – a not insignificant overhead – are spent in StringRecord::clone_truncated, and of these a substantial proportion is spent in memory allocations. This is because when iterating over csv_reader.records()
, an owned StringRecord is yielded for each row, necessitating allocations for both the StringRecord and each String within. This is superfluous, since the String is immediately parsed as a float and the String is discarded.
Similar to a previous discussion in the case of integers, we can avoid this overhead with a custom parser. The present case with floats is slightly more complicated than the integer case, since in the integer case all substrings of an integer may themselves be represented as integers, avoiding allocations, but the same is not true of floats due to floating point imprecision and rounding error. Nevertheless, we can avoid unnecessary allocations by maintaining only a single String buffer for the entire process, minimising the need for repeated memory allocations.
The relevant code is presented below:
use std::fs::File;
use std::io::{BufRead, BufReader};
fn main() {
for _ in 0..100 {
let mut reader = BufReader::new(File::open("/path/to/file.csv").expect("IO error"));
// Reuse a single buffer to avoid unnecessary allocations
// Since we need to make copies only for the headers - the
// data are directly parsed to float
let mut buffer = String::new();
// Read header
let headers = read_row_as_strings(&mut reader, &mut buffer);
// Read data
let mut records = Vec::new();
let mut row = Vec::new();
loop {
if read_row_as_floats(&mut reader, &mut buffer, &mut row) {
if row.len() != headers.len() { /* ... */ }
records.append(&mut row);
} else {
// EOF
break;
}
}
println!("{:?}", headers);
println!("{:?}", records);
}
}
fn read_row_as_strings<R: BufRead>(reader: &mut R, buffer: &mut String) -> Vec<String> {
buffer.clear();
let bytes_read = reader.read_line(buffer).expect("IO error");
if bytes_read == 0 { /* ... */ }
let mut result = Vec::new();
let mut entries_iter = buffer.trim().split(',');
loop {
if let Some(entry) = entries_iter.next() {
if entry.starts_with('"') {
/* ... */
} else {
result.push(String::from(entry));
}
} else {
// EOL
break;
}
}
return result;
}
fn read_row_as_floats<R: BufRead>(reader: &mut R, buffer: &mut String, row: &mut Vec<f64>) -> bool {
buffer.clear();
let bytes_read = reader.read_line(buffer).expect("IO error");
if bytes_read == 0 {
// EOF
return false;
}
// String split yields &str not String, avoiding unnecessary allocations
let mut entries_iter = buffer.trim().split(',');
loop {
if let Some(entry) = entries_iter.next() {
if entry.starts_with('"') {
/* ... */
} else {
row.push(parse_float(entry));
}
} else {
// EOL
break;
}
}
return true;
}
fn parse_float(s: &str) -> f64 {
let value = match s {
"inf" => f64::INFINITY,
_ => s.parse().expect("Malformed float")
};
return value;
}
On the same dataset, the mean (±SE) execution time is 0.565 (±0.002) seconds, which represents an 18% increase in performance.
The full CSV reader implementation for hpstat is available here.