Automating OCR of Documents in SharePoint
By Jeff Brand | May 29, 2014
3 minute read
Microsoft SharePoint is a powerful enterprise-grade solution that facilitates teamwork and collaboration by allowing organizations to store and share documents in one secure, centralized location. These business and operational ideals become more hypothetical and less conceivable, however, the moment an organization begins using SharePoint as a dumping ground for vast volumes of unstructured and unmanaged data.
To harness the full strategic value of SharePoint, organizations must convert the unstructured documents that reside within this platform into data that’s searchable, findable, and usable. This can be achieved by adopting efficient and effective SharePoint OCR (Optical Character Recognition) processes. Here’s why that’s important and how to find the right solution.
The Challenge of Unstructured Data
Most enterprises are sitting on repositories containing millions of historical documents that are largely unstructured, and in formats that are often unreadable and can’t be searched for or leveraged by automated tools (think emails, TIFs, CAD files, JPGs, etc.). The legacy data companies have already stored is just the tip of the iceberg. When you factor in the millions of new files that are incoming every day, matters become much more complicated.
All of this unsearchable, dark data causes significant problems in a few key areas of business and operations: Knowledge workers spend time searching for the files they need instead of working at their highest level of capability. Searching for the right data also slows downstream workflows and creates delays in serving customers, creating products, and responding to compliance requests. And, not having access to all the data you need clouds critical decision-making.
Solving the Unstructured Data Conundrum
Dealing with the unstructured data problem in SharePoint is no small matter. Converting both legacy and ingested data into content with value is a great start. However, applying manual methods to process that data further will leave your organization with a bigger operational challenge than you started with.
Take, for example, one research organization who deployed Microsoft SharePoint to manage their research papers, reports, and materials received from external sources. They moved their massive collection of existing content to SharePoint and continued to add new material to their knowledge base.
Unfortunately, most of the legacy content was in image-only PDF format, making it impossible for SharePoint to index content so users could find it. The organization’s initial solution was to process the material manually. Someone would scan the document and add keywords to the document metadata that would be picked up by the search index.
What the research company discovered was that their manual process was too resource-intensive and expensive, and introducing the human hand into the process also introduced errors that corrupted downstream processes. What organizations facing these challenges need is automated OCR software that can convert unstructured data in SharePoint into searchable, usable, high-fidelity content at scale.
Finding the Right OCR for SharePoint
When looking for the right OCR software solution consider the following:
An automated OCR solution needs to be able to operate at an enterprise level. Such a solution should have the ability to convert millions of unstructured documents into readable, searchable PDFs without manual intervention. And it should be able to do so without interfering with the computing resources needed for ongoing operations (working overnight to process millions of documents, for example).
A functional automated OCR software must also be able to deal with the dozens (or more) of different file formats that may reside in SharePoint repositories. It should be able to handle paper documents, as well as born-digital data that’s not searchable, like emails and image files. Moreover, it should also be able to convert “mixed” format data like text files with embedded images.
The best SharePoint OCR solutions also achieve very high levels of accuracy in the Document Conversion process. While some solutions may achieve 90 percent accuracy, if that could be improved to 98 percent or more, then the organization would see increased confidence in their results. For a bank analyzing contracts, for instance, a 90 percent accuracy rate opens the organization up to an unacceptable level of risk.
D. Add Value
Ultimately, OCR is one step in the process of adding value to existing content. OCR converts flat image files into readable and searchable data, but a good process goes further—it creates metadata that enables search engines to deliver up the most relevant content to searchers.
Consider a scanned invoice, for example. If the process is able to identify from contextual clues that the document is an invoice—and define the document type as “invoice” in the metadata—then it’s more likely to be found at the top of the search results. If the process can then pull values from the file (such as invoice amount, date, etc.), that piece of content has much more value to the organization.
Getting the most out of your SharePoint investment means effectively dealing with the problem of legacy and newly ingested unstructured data—and that requires a robust OCR solution. Implementing the SharePoint OCR solution will save time and resources, focus knowledge workers on the most valuable tasks, and improve decision-making and customer service.