When is a Page Not a Page? When it’s HTML

November 5, 2010

3 minute read

I have occasionally come across the argument that PDF is not suitable for Web use. (Can you argue with 200 million PDF documents posted on the web?) I’ve read comments such as, “PDF is useless for presenting quality images,” or “PDF is good for printing, but that’s it.” Neither of these claims could be farther from the truth.

I’d like to set the record straight on why PDF is a good format for the web (not to mention an ideal format for the electronic sharing, annotating, securing, digitally signing and archiving of documents, which makes up the bulk of business processes).

HyperText Markup Language (HTML), created in 1991, is text based and all about hyperlinks – hence its usefulness for web content. One year earlier, the Portable Document Format (PDF) was introduced. As a binary file, PDF is a way to reliably view, print, and share information with other people – so that it displays the same way on every platform. Granted,  HTML is better suited for the web since it  re-flows content based on the window or page size and is easier to edit since its file format is text-based.

However, PDF and HTML have more in common than you might think:

  • Platform -independent
  • Open ISO standards
  • Support tagging: As the W3C web accessibility initiative states, “tagged PDF is a stylized use of PDF that allows reliable recovery of text, graphics, and images in PDF documents, with no ambiguity about the contents or the ordering of the contents.” HTML is actually constructed using tags that convey its structure
  • Supported by browsers
  • Can be indexed by search engines

What PDF has that HTML Lacks

I’m not going to provide an exhaustive list here, but here are some of the PDF attributes (which HTML lacks) that makes it well suited for business:

1. One of the key attributes of PDF is that it displays and prints the same on all platforms. Many business processes involve documents, and for these processes it is important that document render such that people can refer to a specific location (e.g. the bottom right corner on a specific page, so all participants can locate the same content easily). Having a page structure allows the grouping of multiple pages onto a single page (a process known as N-up imposition). And page structure inherent in PDF means the business user can display pages in various ways, such as a two-page book view.

2. Another key attribute of PDF is that it displays the page as laid out by the publisher in the source document. This includes all aspects of the document, such as support for all fonts including character effects, tables and border styles, etc. In addition, PDF support CMYK color model for color printing, important for publishing materials.

The above two attributes are important in document review processes where document are annotated and redacted.

3. PDF also supports layers which makes it ideal for Optical Character Recognition (OCR) of scanned paper documents. The PDF stores the image and the recognition text behind, which allows for indexing, searching and word highlighting. In addition, PDF support JBIG2 compression  which significantly reduces the file size from the scanned tiff document.

4. Good page layout/design is something the human brain recognizes as a familiar way to organize information – thereby capturing attention and expediting comprehension. The effect of page layout on mental workload is well documented.

So, the misunderstood PDF is actually capable of giving the user a good web experience – plus a whole lot more business benefits outside of the web world.

Don’t forget to share this post