Size Matters (and Smaller Isn’t Always Better) When it Comes to Document Compression for Archiving

February 22, 2011

4 minute read

I’m not surprised if you’ve ever been intimidated by all the compression options available when it comes to transforming your documents to PDF. Your questions may include:

  • Should the images be downsampled? (that is, reduce the resolution of the image to match resolution of document they’re sitting in)
  • Should I use Mixed Raster Content (MRC) or JPEG compression?
  • Do I compress content? If so, what are the implications?
  • What is linearization?

It’s easy to think that smaller is better… but attempting to create the smallest possible file size is rarely the right choice.

Image Downsampling

Image downsampling is a good idea – unless you intend to send the resulting PDF to a very high resolution printer. This is even more important today than in the past, because cameras and other sources of images are using high-resolution and high-quality images. When you place them in a document, and re-size to fit, there is usually far more data than is actually needed to create a good quality image for the desired size.

Authoring software such as Microsoft Word and Open Office don’t readily provide tools for reducing the size of these images, so documents and presentations exceed megabytes and megabytes of space. Image downsampling is clearly the solution here.

MRC vs. JPEG vs. Others

Although reducing the image resolution will go a long way towards decreasing your file size, additional compression techniques can be used to even further decrease the size of your documents.  This is where knowing the intent of the document (i.e. archiving” vs. “used often”) is important.

Most compression methods such as LZW  and JPEG require a sacrifice in either file size or quality. However, MRC will segment the image into similar sections so they can be compressed with the optimum algorithm based on the characteristics of the bitmap area. Sound complex? That’s because it is!

With MRC, you can get extremely small file sizes with very good quality images. However, it can be difficult to navigate the document if you go too far. Because it can make a document overly complex, the viewing engine – such as the free Adobe Reader and FoxIt  can struggle to display the document while you’re scrolling through the pages.

Long-Term Archiving

This makes MRC an excellent choice for long-term archiving, where the cost is not in document quality, but in document usability.  It’s also a valid approach to compression according to the PDF/A specification which is the ISO standard for long-term archiving.

For day-to-day consumption, JPEG or LZW compression is more appropriate. Since version 1.5 of the PDF specification JPEG2000 was also introduced, which has a modest increase in efficiency and a significant increase in image quality.

These formats will obtain a good balance of quality vs. file size, but have no negative effect on the usability of the document. Viewers such as Adobe Reader and FoxIt will have no issue in displaying the pages as you navigate through the document.

Content Compression

A PDF file is really a bunch of instructions to tell a viewer or a printer how to draw the pages to represent the original file. By default, these instructions will be uncompressed, so all of the text and all of the instructions to draw vector graphics such as boxes, circles will be included, similar to a text file, which is not very efficient, but saves a step when drawing the information.

By utilizing content compression, all of these instructions are compressed like a ZIP file, which means smaller sizes, with the trade-off of the viewer needing to un-compress the data prior to displaying it. (Given the power of today’s machines, the impact is negligible.)

Web Optimized

Linearization (or ‘web optimized’) is simply instructing the PDF Generator to put all of the objects of the document in order, starting with the content from page one, onwards. This may mean a slight increase in file size. However, it means that if a user is accessing the file from the web, the viewer will be able to simultaneously display the first several pages while it downloads additional pages. Otherwise, common objects, such as an image that is a part of the footer, may be placed anywhere in the document, meaning it needs to load the entire file before it can render the first page.

In most cases, enabling linearization is a good idea: The increase in file size is negligible, but the ability to immediately view the content is invaluable in today’s no-patience society.

Now that you have an understanding of the various options when it comes to compressing PDFs, you can create documents that are more appropriate for their intended use.

Whether it’s a document you need to distribute to customers to help them better understand your products – or you need to archive an invoice to meet regulatory requirements – be sure that your PDF rendering technology has the options to achieve the results you need.

Don’t forget to share this post