How Unstructured Data Fuels Big Data Analytics
By Scott Mackey | February 5, 2018
Traditionally, big data analytics has relied on structured data. Data analytics, however, doesn’t start and stop with the tidy data that’s locked in the rows and columns within your databases. Organizations can garner a lot of value by harnessing the power of “dark” or unstructured data (think nested and threaded emails, image files, outdated file formats, and paper documents) that make up as much as 90 percent of the data available to a company.
However, utilizing this colossal corpus of data means first getting a handle on unstructured data analytics—the process by which unstructured data is collected, analyzed, cleaned, categorized, and enhanced for use by automated analytics tools. Keep reading for the nuts and bolts of how this works, and how unstructured content is being used to fuel big data analysis.
Unstructured data management
The technology now exists to effectively (and automatically) process vast volumes of unstructured data and extract meaningful business value from this information through big data analytics. If you think of your business like a refinery and your data like crude oil, data analytics engines allow you to refine that raw material and turn it into the fuel that drives real-world business improvements. In the energy sector, for example, a company may have been purchasing lots of land for test drilling over the course of years. Each of those tests likely generated a lot of data, much of it unstructured (think of all the paperwork around land purchases, surveys, legal documents, and then all the testing procedures and results). All of this data is stored somewhere, but accessing it would require a lot of time, resources, and manual processing. In practice, attempting to access this data would result in an operational nightmare.
When advances in drilling and processing technology make previously undesirable sites suitable for work, the organization faces a challenge. They need to determine which old lots of land would now be potentially profitable drill sites. However, manually searching decades-old records to figure that out would be time-consuming, expensive, and, depending on the company’s record-keeping efficacy, potentially fruitless. In this scenario, what’s needed is a way to automatically conduct a search and convert the historical content into a format that can be processed by an automated analytics engine.
How unstructured data fuels big data analytics
Consider, for example, the challenges faced by a global re-insurance company that processes half a billion pages of contracts annually. Because they can automatically process this unstructured content into a format that is usable by their analytics tools, they can feed the contract data into IBM’s Watson and quickly assess risks and trends.
By refining and analyzing unstructured contract data, the company was able to discover which areas have more claims based on natural disasters and integrate that with coverage levels of policy-holders in the area, allowing the company to optimize coverage around predicted risks.
Once unstructured data analysis methods are in place, the dark data can be fed into big data analytics tools to find ways to improve the client experience. For instance, a large Scottish bank has a huge unstructured information load. To make matters worse, that content is housed in different divisions of the bank, which manage the data separately. There is no easy way to get a sense for what might be duplicated across business lines. But through the application of an unstructured content process—which feeds the newly structured data into their big data analytics tool—it’s possible for the bank to see when a customer has purchased insurance on an account and has also purchased similar insurance on a line of credit on another occasion. As a result, the bank can suggest that the customer consolidates their insurance, saving the client money and increasing satisfaction.
The challenges of using unstructured data
Given these challenges, why don’t those energy companies (and other organizations that operate within highly regulated industries) implement unstructured data analysis methods to address these critical business issues? Therein lies the challenge.
Because unstructured content represents hundreds of formats spanning generations of applications—often in non-searchable formats and even multiple languages—it can be difficult to see how to process this content into a usable format without throwing dozens of people and millions of dollars at the problem (something few companies have an appetite for when there is no guarantee of success).
Take the case of a large pharma company. The organization has over 5 TB of data in its email system alone, and they know that this content poses a risk since it contains sensitive information. The organization knows it should address the issue, but the challenge of manually looking at all of that data is just too daunting. They don’t have the budget or resources to address the problem, and the issue seems too “nebulous” to tackle—so nothing happens, and the documents keep piling up.
Transforming unstructured data into a format that can be used by big data analytics
The best way for a company to overcome the inertia that huge and complicated volumes of dark data can create is to implement the right unstructured data processing strategy, starting with a few straightforward steps:
Step #1: Take a phased approach
Overcome inertia by reducing the process to bite-sized, achievable milestones. Start with a Proof of Concept focused on a well-defined business process with clear requirements, and then plan for a phased project that tackles separate lines of business one at a time. Focus on the low-hanging fruit—areas that offer maximum value with minimum technical risk—and build on early wins to create momentum in future phases.
Step #2: Source an enterprise-grade solution
The scale of most company’s dark data challenges require enterprise-grade tools designed to operate in high-volume situations. The tools need to have comprehensive capabilities to deal with the broadest collection of content sources and formats. The platform must be highly configurable to address changing business needs over time.
Step #3: Design and implement the right unstructured data processing method
When it comes to processing unstructured content, the final step is for a company to define the right methodology. The process starts with the automatic removal of all duplicate content and preparing what remains for processing. Next, the content must be standardized to a common, searchable format. Finally, processed content is ready for enhancement and the extraction of values that can be fed into analytics engines.
Increasing the volume of quality content being fed into big data analytics tools dramatically increases the value of the output—whether it’s improved decision-making or better product design, risk reduction, and enhanced customer experience. To realize these benefits, however, organizations must develop the capability to process massive storehouses of unstructured data into a format that big data analytics tools can work with.
Although the challenges associated with unstructured data management are not small by any means, the technology exists today to make automated processing possible. Enterprises that implement effective unstructured data analysis methods to feed more and better content into their big data analytics engines are the ones who will see significant competitive advantages.
About the Author
As a senior executive, Scott has spent the last 20 years building Adlib into the thriving organization it is today. Scott has held customer-focused leadership roles spanning success, professional services, marketing, and support. He is passionate about business growth, the human impact of technology, and the pursuit of an ideal customer experience measured in the customers’ terms.