Foreground segmentation with JBIG2 for improved PDF compression: pdf-segmented
Motivation
JBIG2 is an efficient image compression format for bi-level (bi-tonal) images, which is supported by the PDF file format and common PDF viewers. To date, tooling for producing documents using JBIG2 (particularly open source tooling) has been limited; I have previously presented file-jbig2pdf, a GIMP plugin for generating PDF files using JBIG2. This suffices for simple black-and-white documents, but does not assist when documents contain both black-and-white text and colour graphics.
In the DjVu format designed specifically for archival, scanned documents can be separated into a black-and-white foreground layer which is losslessly compressed, and a colour background layer which is lossily compressed. This enables more efficient compression. PDF easily supports a similar structure (JBIG2 black-and-white foreground layer, on top of a colour background), and proprietary PDF software (particularly in scanners) can generate analogously structured PDF documents, but I am unaware of any open source software to do so.1
pdf-segmented
Here I present pdf-segmented, a tool which separates the black-and-white foreground from colour background in a (potentially multi-page) document, and produces a PDF (or DjVu) file where the foreground and background are compressed separately. This is analogous to how DjVu compresses the foreground and background separately.
The foreground is detected as all pixels in the original document which are pure black (#000000). This suffices for the vast majority of scanned text-based documents, but does require manual handling. This is most easily accomplished by selecting all colour graphics in GIMP, inverting the selection (Ctrl+I), then applying the Threshold tool.
Example
Below is an example of a scanned letter (courtesy of Seattle Municipal Archives, via Wikimedia Commons, licensed under the CC BY 2.0 Generic licence). The image on the left shows the original scan. The image on the right is post manual cleaning, including setting all text to pure black.2
Using a naive approach, we could save the cleaned image as PNG using maximum compression, and use img2pdf to directly embed it into a PDF. This is lossless, but results in a large file size of 675.8 KiB. Alternatively, we could save as JPEG and use img2pdf. Using 75% quality, this yields a smaller file size of 557.9 KiB, but is lossy and introduces JPEG artifacts, which might be acceptable for the images but is particularly undesirable in the text:3
Using pdf-segmented, the text (foreground) can be compressed losslessly using JBIG2, and the JPEG compression can be used for the colour graphics (background). This yields a much smaller file size of 226.4 KiB, while the text retains its crisp edges. The result (PDF file here) is shown below:
pdf-segmented also supports other compression methods. Below is a comparison of various combinations:
Output | Foreground | Background | Size |
---|---|---|---|
Naive (single PNG) | 675.8 KiB | ||
Naive (single JPEG, 75% quality) | 557.9 KiB | ||
PNG | PNG | 648.0 KiB | |
PNG | JPEG (75% quality) | 258.6 KiB | |
JBIG2 | PNG | 615.8 KiB | |
JBIG2 | JPEG (75% quality) | 226.4 KiB | |
JBIG2 | JPEG 2000 (lossless) | 695.3 KiB | |
JBIG2 | JPEG 2000 (ratio 128:1) | 217.3 KiB | |
DjVu | JB2 | IW44 (defaults) | 104.8 KiB |
The images below compare the compression artifacts in JPEG, JPEG 2000 and IW44:
Input file
JPEG (75% quality)
JPEG 2000 (ratio 128:1)
IW44 (encoder defaults)
Footnotes
-
Thanks to maskros on HackerNews, whose comment I stumbled upon at some point in the last 3 years sketches out an analogy between DjVu and PDF in this way. ↩
-
Images have been downscaled to 96 dpi for online publication. File sizes stated are based on 300 dpi images. ↩
-
Contrast adjusted to enhance the JPEG artifacts for illustration purposes. ↩