Data Extraction Techniques to Mitigate Unstructured Data Challenges

September 4, 2018

5 minute read

You wouldn’t expect to win at poker, if you could only see half of the cards in your hand. Unstructured data presents businesses with the exact same challenge: making decisions without all of the relevant information. The best way companies can meet this challenge is to implement an effective data extraction. Doing so gives companies access to all the data that had previously been dormant—creating larger, more complete data sets that increase the quality of business insights.

By applying an automated data extraction process to unstructured data, enterprises can quickly find and prepare all of the data they need for any analytics project.

In a previous post we looked at how data extraction techniques form the backbone of successful data preparation efforts. In this post we’ll drill down and examine the five steps required to run an effective process for extracting data.

Lessons learned from big data analytics

Over the past few decades, big data analytics has shown us that data preparation is the most time-consuming and important part of any analytics project. A recent survey by CrowdFlower reveals that data scientists estimate 80% of the work on a project is just preparing the data for analytics. And, keep in mind, that’s when you’re dealing with structured data—the orderly, readable kind that comes from point of sale information, and loyalty program data, etc.

Even when your data is structured, you still have to prepare it for analytics. Part of the prep work is defining the associations and correlations between the information you’re searching for. For example, if you have rows and columns of data, in order to analyze it, you will need to match it with the rows and columns of the other data it's associated with. In order for data sets to be what is called “transcodeable”, they have to align enough that they can be sensibly and clearly compared and contrasted. They don't have to include the exact same number of rows and columns but they do have to at least share a basic common structure.

Another complication is that the data sets have to be complete and accurate—for instance, if one data set has every third variable missing, the analytics tend to break down. Your system is trying to read a field that's blank or contains incorrect data, which generates errors. If there are enough of these errors in the data sets you’re working on you quickly get mired in a ‘data swamp.’

The 5 key data extraction techniques

Now consider the data preparation effort required when enterprises set out to perform analytics on unstructured data. The first challenge is making sure your company’s data sets are intelligible at all—that the files are even readable. That can be difficult when the majority of data is unstructured (emails, nested emails, TIFFs, CAD files, etc.). This “dormant” or “dark” data cannot be read by an analytics engine and so, in order to run a successful data analysis project using unstructured data, the first step is to prepare the content using an automated data extraction process.

To see how that would work let’s look at the five steps involved in a good extraction process.

Step One: Ingest

The first step in the data extraction process is to ingest all of the required data. This means each of the relevant systems must be identified and readied to be digitally crawled—the process can’t be inclusive if it doesn’t include data contained in Fileshares, finance systems, ECMs, cloud-based content or ‘unsanctioned’ repositories.

Keep in mind we’re talking about relevant systems here. The company may not always have to find every instance of a particular piece of data. While it can be crucial when dealing with PII issues, it is often less important when working on a smaller research project in which one complete data set is all that’s needed and copies of the data don’t need to be retained.

Step Two: Convert

It’s likely that all of the required content for a project won’t be immediately accessible, if the 20% that's sitting in TIFFs or other unreadable formats is ignored or forgotten about. So, once all the necessary data has been ingested the next step is to assess the content by applying intelligent recognition. This means examining each piece of data to see if it is already discoverable. If it is not discoverable it needs to be converted to a readable and searchable format.

To do that companies use intelligent recognition to examine the data and apply optical character recognition (OCR) to it, to create an image of the content and make it text searchable.

Step Three: Classify

Once the data has been converted the next step is to classify it. Each piece of data needs to fit into a logical, accurate category. Once the categories are set up, the key identifier for what class a document is can be written into the document’s metadata. This classification means that, going forward, it will always be clear what the document is and where it fits into the company content structure—no matter where the document actually resides.

Step Four: Identify

The next step in the data extraction process is to apply regular expression (Regex) technologies to look for single or multiple, concurrent, specific entities within the content. Regex search is essentially an advanced search technique in which the desired search entities are programmed in advance. These entities could be a particular term, or a particular string—such as a social insurance number or a credit card number. Regular expressions usually cover all of the advanced searches companies need to perform on their data.

Once all of the required regular expressions have been created, the entire data set will become searchable.

Step Five: Extract

Completing the first four steps creates a smart system that identifies all the necessary data. The final step is extraction. Extraction takes place within a rules-based framework—meaning that rules are set up for each piece of data so that, once it is found, an action can be performed on it.

For instance, a search for the words ‘claim number’, ‘claim *’ or ‘claim#’ could be programmed to cause any instance to be automatically migrated to a specialized repository that holds all of the claims information. Or, if a social insurance number is required, and the data is outside the corporate group that needs to see social insurance numbers, the data will be assigned to a work flow that redacts that social insurance number. The goal is to remediate sensitive content in place, before moving the document anywhere else.

Wrap Up

A data extraction solution that includes ingestion, conversion, classification, identification, and extraction, gives you an absolutely known set of identified values that you can use for a variety of outcomes. Using data extraction techniques within the process, you can automatically prepare all the unstructured data you require for any analytics projects. Being able to search for, read, and analyze that wealth of previously unstructured data will generate higher-value insights into your business and your customers.

Don’t forget to share this post