How File Analytics Overcomes Legacy Challenges of Document Classification

Posted 22 October 2018 1:00 PM by Chris Hibbits

For as long as businesses have been using computers, they’ve been trying to determine the best way to classify the massive amounts of data stored within their machines and networks, in order to more efficiently and effectively drive operations and produce actionable business insights. Document classification is also crucial for reducing compliance risks and other issues, helping companies to spot risky documents – such as those that contain personally identifiable information – so that they can be handled appropriately.

A look at legacy methods of document classification reveals some of the key challenges businesses have faced in the past, and how newer methods overcome these problems.

Natural language processing: The earliest days of classification

Though still in use today, the earliest automatic data classification technique dates back to a simpler era, when I Love Lucy graced TVs every Monday night and Buddy Holly songs wafted through the airwaves. Called natural language processing, this automated document classification technique uses a pre-programmed set of dictionary terms to extract information from the content of text documents. Starting with the introduction of machine learning in the 1980s, data scientists paired natural language processing with machine learning algorithms, coding hard “if-then” rules into programs in order to sort content into categories and clusters by theme.

While natural language processing can effectively read syntax and assign meaning based on combinations of words, it lacks the ability to integrate other important classification signals – such as metadata, file type, author and title – information that can yield a more sophisticated and useful understanding of content. Natural language processing is also limited in its ability to classify documents by the inputs or instructions given, for example, by the dictionaries and sorting rules it was programmed to follow.

Fast forward to the 1990s

As computing went mainstream in the 1990s, new ways of classifying and managing data emerged. Unlike natural language processing which used clues that existed within documents, these new methods utilized signals that described the files themselves to help organize information.

E-discovery platforms, used within the legal sector to help organize documents related to lawsuits, use a method called predictive coding – searches and filtering based on metadata – to sort documents and weed out the ones that aren’t relevant. Being a rule-based system, this type of classification first requires the manual sorting of a sample volume of documents – potentially in the thousands – in order to teach the computer what documents that are relevant (called responsive) and irrelevant (called unresponsive) look like. Another classification technique, called storage-centric file analysis, uses a set of file-level descriptors – including author, file types or format, title and date – to identify duplicate information, which was very important to do back then, when digital storage was at such a premium. 

Challenges and limitations of these legacy techniques

As businesses have become increasingly reliant on data to help automate processes and develop meaningful insights and intelligence, the classification of data has become more important than ever. However, while these legacy methods of document classification were often effective enough to serve a limited purpose, none yielded the sophisticated classification that is the basis of next-gen data utilization.

For starters, none of these older techniques takes a sufficiently multifaceted approach to analyzing content that would yield the degree of granular and highly-specific data that is required to make groundbreaking insights. More importantly, because documents are classified based on human inputs, they are inherently subject to bias.

Enter file analytics

Unlike legacy data classification methods, file analytics is a machine-based process that automatically crawls the document contents themselves as well the corresponding metadata and other structural information, yielding a richer document description. This helps to more effectively weed out redundant files and provides a greater sense of confidence in the information within – crucial if you are going to use it for important business decisions.

File analytics also comes with the added benefit of eliminating human bias from the data classification process, being machine led and therefore not confined to inputted terms and algorithms in its analysis. Practically speaking, this means that if an algorithm has been trained to classify documents based on an inputted list, the machine only knows to look for that specific information. With file analytics, the content is analyzed from a fresh perspective, allowing the machine to determine what content and themes are present (automatically assigning metadata accordingly) rather than just looking for the signals it was instructed to find.

Wrap up

As companies become more reliant on data to fuel business operations and growth, it’s crucial that they first understand what data they possess. While data classification isn’t new, file analytics allows businesses to classify their documents more effectively and efficiently than they could using legacy methods, allowing for greater accuracy and confidence in the information upon which important decisions are based.

Progressive Classification

Learn how the application of Progressive Classification helps organizations to mitigate risk and meet compliance standards.