Foreground segmentation with JBIG2 for improved PDF compression: pdf-segmented

Motivation

JBIG2 is an efficient image compression format for bi-level (bi-tonal) images, which is supported by the PDF file format and common PDF viewers. To date, tooling for producing documents using JBIG2 (particularly open source tooling) has been limited; I have previously presented file-jbig2pdf, a GIMP plugin for generating PDF files using JBIG2. This suffices for simple black-and-white documents, but does not assist when documents contain both black-and-white text and colour graphics.

In the DjVu format designed specifically for archival, scanned documents can be separated into a black-and-white foreground layer which is losslessly compressed, and a colour background layer which is lossily compressed. This enables more efficient compression. PDF easily supports a similar structure (JBIG2 black-and-white foreground layer, on top of a colour background), and proprietary PDF software (particularly in scanners) can generate analogously structured PDF documents, but I am unaware of any open source software to do so.¹

pdf-segmented

Here I present pdf-segmented, a tool which separates the black-and-white foreground from colour background in a (potentially multi-page) document, and produces a PDF (or DjVu) file where the foreground and background are compressed separately. This is analogous to how DjVu compresses the foreground and background separately.

The foreground is detected as all pixels in the original document which are pure black (#000000). This suffices for the vast majority of scanned text-based documents, but does require manual handling. This is most easily accomplished by selecting all colour graphics in GIMP, inverting the selection (Ctrl+I), then applying the Threshold tool.

Example

Below is an example of a scanned letter (courtesy of Seattle Municipal Archives, via Wikimedia Commons, licensed under the CC BY 2.0 Generic licence). The image on the left shows the original scan. The image on the right is post manual cleaning, including setting all text to pure black.²

Original letter

Letter after cleanup

Using a naive approach, we could save the cleaned image as PNG using maximum compression, and use img2pdf to directly embed it into a PDF. This is lossless, but results in a large file size of 675.8 KiB. Alternatively, we could save as JPEG and use img2pdf. Using 75% quality, this yields a smaller file size of 557.9 KiB, but is lossy and introduces JPEG artifacts, which might be acceptable for the images but is particularly undesirable in the text:³

JPEG artifacts

Using pdf-segmented, the text (foreground) can be compressed losslessly using JBIG2, and the JPEG compression can be used for the colour graphics (background). This yields a much smaller file size of 226.4 KiB, while the text retains its crisp edges. The result (PDF file here) is shown below:

Letter compressed using pdf-segmented

pdf-segmented also supports other compression methods. Below is a comparison of various combinations:

Output	Foreground	Background	Size
PDF	Naive (single PNG)		675.8 KiB
PDF	Naive (single JPEG, 75% quality)		557.9 KiB
PDF	PNG	PNG	648.0 KiB
PDF	PNG	JPEG (75% quality)	258.6 KiB
PDF	JBIG2	PNG	615.8 KiB
PDF	JBIG2	JPEG (75% quality)	226.4 KiB
PDF	JBIG2	JPEG 2000 (lossless)	695.3 KiB
PDF	JBIG2	JPEG 2000 (ratio 128:1)	217.3 KiB
DjVu	JB2	IW44 (defaults)	104.8 KiB

The images below compare the compression artifacts in JPEG, JPEG 2000 and IW44:

Input file

JPEG artifacts

JPEG (75% quality)

JPEG 2000 artifacts

JPEG 2000 (ratio 128:1)

IW44 artifacts

IW44 (encoder defaults)

Footnotes

Thanks to maskros on HackerNews, whose comment I stumbled upon at some point in the last 3 years sketches out an analogy between DjVu and PDF in this way. ↩
Images have been downscaled to 96 dpi for online publication. File sizes stated are based on 300 dpi images. ↩
Contrast adjusted to enhance the JPEG artifacts for illustration purposes. ↩