Full of Images Yet Starved for Usable Content? Think OCR

By Scott Mackey, April-28-10 | Permalink

We’re in the business of automating business processes – and one of the most popular is turning a seemingly endless flow of images into searchable PDF output.  Scanners, imaging software, legacy content repositories are all loaded with images and starved for usable content – until that content is revealed through the process of Optical Character Recognition (OCR).

Since the type, volume and quality of images varies widely, we are often asked: “How much hardware and software do I need?” It’s difficult to answer because it depends on a variety of factors including:

  • Image quality
  • Resolution
  • File size
  • Page count
  • Server hardware/network performance

All these factors have the potential to create deviations-  large and small – to overall system throughput.

Image Quality

Poor image quality makes the OCR engine work harder, such as when:

  • Characters are unclear
  • Characters merge with adjacent characters
  • Images are warped from the original scan

The lower the quality of the image, the slower the processing speed. Most high-quality software offers configurable settings where image clean-up and de-skewing can optimize the document prior to OCR to help optimize the likelihood of accurate character recognition.

Resolution

If you’ve ever zoomed in on a web image, you’ll know that they are quite pixilated (rough) when viewed up close. Higher resolution images remain clear as you zoom in. Correspondingly, if an OCR engine has clearer characters to analyze, it will have more success at rendering the output accurately.

The balance that needs to be struck is file size vs. OCR level quality. High-resolution scans have much larger file sizes than low-resolution scans, which introduces transmission and storage capacity costs.

Most experts suggest 300 dpi (dots per inch) as the best resolution to provide balance between OCR level quality and file size.

File Size and Page Count

Large page sizes (think CAD drawings) and documents with hundreds or thousands of pages logically take more time to process. Some systems allow you to segment files with large page counts into multiple files that can be distributed across multiple hardware/software instances for faster overall throughput and reduced bottlenecks. In more advanced environments you may have the option to ‘re-merge’ the newly recognized content if required.

Server Hardware/Network Performance

Faster, more powerful servers will process documents faster than slower ones – no surprise there. For users, the network performance can play a role since files and instructions need to be transferred. This is more of an issue when the server resides in a different geographic region than the user. For enterprise users, look for technology that can be seamlessly scaled out to meet large or dynamic requirements. Our customers often deploy multiple instances of the application to maximize performance while benefitting from the added fault tolerance and redundancy.

PDF/A Requirements

Consider also the overall goals of making the content searchable. The ISO standard PDF/A (PDF for Archive) is a growing requirement – making documents searchable in a format that is an approved standard for long term accessibility.

Though more popular in Europe at the moment there is a growing wave of interest in both PDF/A and the requirement to – wherever possible – use OCR to make the content searchable.

In the U.S. the National Archives and Records Administration (NARA) discusses both PDF/A and OCR on it’s web site:

”Agencies that embed searchable text in PDF scanned images should use OCR processes that do not alter the original bit-mapped image. For example: agencies should avoid OCR processes that substitute OCR’d text for bitmapped characters, and/or use lossy compression to reduce file size.”

As with any project start with the end in mind – what is the intended use for the content – consider the items above as you determine the best setup for your OCR solution. Doing it right will maximize the value you derive from your content.