FeRDs WeRDs: Hyper-Optimization of PDF Scans

Mind: BLOWN.

The other night Robert threw a bunch of PDFs my way, scans from a Xerox office copier that needed to be converted to JPEG. Child's play for ImageMagick, even with dozens of files to convert. So, off we go...

The only sticking point was figuring out what DPI to rasterize them at. The nominal resolution of the PDFs appeared to be impossibly low. With default settings, ImageMagick produced a near-screen-res image that lost a ton of detail, detail I could see in the PDF. Plus, the resulting JPEG file sizes were already larger than the (tiny) PDFs. Confusing, since you'd expect a PDF of a scan to contain nothing but a bitmap; the converted JPEG file should be within a few K of the same size. But at a resolution that visually matched the PDF's detail level (300dpi), the output file was almost 10x the size of the input! Something weird was going on.

I decided to pull one of the PDFs into Inkscape (think: open-source Adobe Illustrator), to see if I could at least find the native resolution of the embedded bitmap (the one I thought should be the only thing in the file). That's when things started to get weird.

Turned out, the file didn't contain a single embedded bitmap image object. It contained NINETEEN distinct component objects. The base image was that expected embedded bitmap, which was indeed stored at the near-screen-resolution nominal dimensions. Yet, as I'd proved to myself by zooming in on the PDF, there was far too much detail for that resolution!

The trick was in those other 18 objects. I don't know if this was Xerox's code, or the latest Adobe PDF creator optimizations, but whatever the source... the image had been deconstructed, and any areas that required extra detail — mostly, the white lettering printed against the dark background art — had been stored in masked layers placed on top of the low-res image. Those layers are all much higher resolution than the background, but each one contains only a single color so they're stored as one-bit masks, requiring a few kilobytes at most to store.

Keep in mind, this optimization was all done 100% automatically by the copier's scan-to-PDF function! So, upshot is that even though the file is tiny in size, it contains all the detail of a much higher-res image. And any conversion to JPEG or other fixed-resolution format would necessarily result in a much larger file than the PDF. I ended up pegging the rasterization at 150 dpi and output 85%-quality JPEGs, which tended to be roughly 2-3x the filesize of the source.

Craziness.

The first image here is one of the output files. The second I created in inkscape, showing the structure of the input PDF — each box is a non-background layer, green for white text and blue for detail sections in the background art, and the background image is dimmed so the detail elements stand out more clearly.


The scanned image

Structure of the scanned PDF

Detail

Structure detail

FeRDs WeRDs

2014-04-23

Hyper-Optimization of PDF Scans

No comments:

Post a Comment

Blog Archive

About Me

Apture