The goal of capturing data is to be able to transform information from all sources into a format that can be automated and analyzed, while improving analytics and increasing efficiencies. But determining the best data capture methods to implement is an evolving practice.
Before you can consider the processes involved in developing an effective capture strategy, it’s important to understand what’s at stake. If your capture process is not optimal, the potential business risks are very real. For instance, if your capture system is manual—or if the information you’re capturing isn’t accurate—the process will be inefficient and expensive or your project will fail.
In some cases, the hazards are more serious than inefficient processes and unsuccessful projects. For example, if a pharmaceutical company has a faulty data capture process and H2O is incorrectly translated as HO, the consequences could be deadly. If an insurance company’s capture methods result in dollar signs being interpreted as the number 4, a $42 million expense suddenly turns into a 442-million-dollar payout.
It’s important to get the capture process right. Doing so means first understanding how methods are changing, and then following four key strategies to developing an effective data capture system.
The evolution of data capture
The most common capture method for dealing with unstructured data is one that originated at a time when most data inputs were received in paper form. To handle this, non-readable data vendors devised data capture methods that first undertook to digitize the information before it could be worked on.
Commonly, this digitization stage involved scanning the document to turn it into a digital file—a picture of the content—then using Optical Character Recognition (OCR) to interpret the pixels in the image and try to make sense of them by turning them into text. Finally, text analysis could be used to make that text meaningful. The resulting file would then be ready for analysis. For a variety of reasons, this process can result in inaccuracies and errors. For instance, if the platen of the scanner is not perfectly clean, the “dirt” can introduce errors into the digital version of the data. Additionally, OCR isn’t 100 percent accurate, which can create even more errors.
The good news is that the capture of data is moving beyond the old paper-first paradigm. More and more data is arriving in digital form and is not subject to those old issues. And, as the type of data we receive evolves, so should our thinking about the data capture process.
Here are four tips for developing an effective data capture system in the new paper-reduced paradigm.
Tip #1: Use a digital-first methodology
The first best practice in capturing unstructured data is to keep digital files digital whenever possible. A digital-first process avoids introducing errors, saves time and effort in processing, and retains the context of the content by preserving name-value pairs (think of a field in an invoice with an amount in it—the value—and a label like “total” which tells you what it means).
When designing your data capture methodology, look for a vendor who focuses on digital first. Most vendors have taken a traditional paper-first approach and have just applied that to their capture methods, which means flattening even digital documents into images and using OCR to reinterpret the data. If the data is already digital, it’s much easier to work with, since there’s zero loss of accuracy, the completeness of the data is preserved, and there’s no loss of context.
Tip #2: Make sure your digitization process is appropriate for later stages
If digitization is happening as part of the data capture process, it’s important to understand the parameters and make sure that the way in which data is digitized accommodates steps that will have to be taken later in the capture process.
If your digitization methods don’t consider work that later needs to be done on the data, problems can arise. For example, one large project involved digitizing a national archive library’s war diaries from World War I. This was done by scanning the diaries to create low-resolution thumbnails meant to give the reader an idea of what that page looks like. Unfortunately, it later became difficult to pull useful information from those files because the resolution was too low for OCR to work without a lot of errors cropping up.
Tip #3: Understand all the ways unstructured content is presented
One of the big challenges in data capture is to understand all the ways that unstructured data is presented. For instance, different countries have different conventions when it comes to how they format dates or how they represent decimals or currencies. When designing your data capture process, it’s crucial to make sure you have the means to accommodate those variants.
A simple example of problems created by not anticipating all the ways content can be presented comes from scanned paper invoices. It’s possible for software to specify X and Y coordinates on a page to identify where the “invoice total” field will be. However, people don’t always put their pages in the scanner perfectly straight, and that causes problems in locating data. So, rather than using coordinates, it’s better to create anchors on a page of data so that information can be located relative to the anchors—ensuring accuracy every time.
Tip #4: Think carefully about who will use the data
Using naming conventions and labels that will be meaningful to the anticipated users of your data is critical. Sometimes the data capture process breaks down because the organization engages different types of staff throughout the entire process, each with varying needs and expectations.
Consider, if we had a field with a number value in it that represented temperature. If that number is 20, for example, this could either be really cold or quite mild, depending on whether you’re using Fahrenheit or Celsius. So, you need to have the measuring system in the name of the field. If you don’t do that, people tend to make assumptions based on where they live and what their experience has been. And that can lead to errors.
Consistency is key, both in how we describe the information that we’re capturing, and also in the format and naming. Normalization of the presentation of data that’s captured enables automation and analysis, making the data useful for all.
Developing an effective capture process can dramatically increase the amount of data, especially unstructured data, that an organization can automate and analyze. Getting data capture right creates efficiencies, reduces errors, and improves outputs. Following the four tips for designing an effective data capture process—digital first, using the right digitization methods, considering different content structures, and being aware of who will use the data—will help ensure that your organization reaps the benefits of data capture without risking the potential hazards.