Automating OCR of Documents in SharePoint

Posted on: Thursday, May 28, 2009 by Paul Dyck

Recent discussions with customers have indicated that there is a need in SharePoint to automate optical character recognition (OCR) of documents once they are already in a SharePoint library.

One case where OCR was needed was with a research organization that deployed SharePoint to manage their research papers, reports and materials received from external sources.   They moved their massive collection of existing content to SharePoint and continue to add new material to their knowledge base.  Unfortunately, most of the legacy content is in image-only PDF format, making it impossible for SharePoint to index content and users to find it. Their existing solution was to process the material manually. Someone would scan the document and add keywords to the document metadata that would be picked up by the search indexer.

There are several solutions for paper document capture and storage into SharePoint from hardware devices, such as Knowledge Lake, but what about scanned documents that arrive via email or are already in SharePoint as in the above example?  These documents will be very difficult to find unless there is some information available to be indexed for search. One customer that I spoke with recently told me that he tried to find an answer to this problem at the recent European Best Practices SharePoint Conference in the UK.  He posed the question in an open session and no one could come up with a solution.

If you are looking for a solution to the problem of making scanned document image files or image-only PDF files searchable within SharePoint, Adlib has an answer for you. Our PDF for SharePoint solution ncludes SharePoint workflows that automate OCR within SharePoint, enabling you to create searchable PDF files and make your content more valuable.

Paul Dyck

Product Manager

Adlib

Paul Dyck of Adlib

Posted in:  SharePoint

Add Your Comment

Tag Cloud
Archives