Structured, unstructured, and everything in between
By Roger Beharry Lall | December 9, 2016
3 minute read
Following up on Rupin’s recent blog post on unstructured content and the nomenclature challenges thereof, I thought I’d dig a little bit deeper and add some more confusion to an already muddy topic.
So certainly, as Rupin pointed out, we know what structured content is truly all about: it’s your SQL bound data sets that live in organized systems like ERP. It’s easily extracted, organized, and ripe for analytics.
Unstructured on the other hand, refers to things like word processing docs, well logs, contracts, submissions, and the like. Certainly these are not database-prone assets and require a degree of Content Elevation, or other such treatments, in order to find, filter, and focus on the relevant information contained within.
But is it really black and white? Is content just structured and unstructured, or are there, with apologies to fiction fans, 50 Shades of Grey? To this end, I offer 3 interesting areas to consider:
- Highly unstructured content: To the extreme side of unstructured we see organizations grappling with what I might call highly unstructured content. Stuff like social media posts, random paragraphs (like this blog post!), and even perhaps the content of email. Here the very idea of structure is absent, making it that much harder, but not impossible, to analyze. It turns out that within the chaos, you can interpret some amount of order by using text analytics and natural language processing technologies, and applying noun-verb breakdowns, sentence order, sentiment indicators, word frequency/predictions, and other techniques to gain insights. Suddenly patterns emerge, and structure seems to appear. The challenge though is that as you go deeper down this rabbit hole you get less and less objective . “Cool” can mean a Canadian winter, an off-putting temperament, or Fonz-like awesomeness. While there are a number of evolving technologies in this space, there remain significant system training requirements, and the results are increasingly spurious.
- Semi-structured data: Somewhere in the middle, we might think of semi structured data – the archetypical example being forms. These start off looking like fairly structured data: 10 defined fields, database integration, no big deal. The problem, as we found out during a recent POC with an insurance customer, is that those 10 fields are never quite where you think they should be! Forms become multiplied across language, version, region, policy, format, paper size, etc. And all of a sudden “Name” in the upper right, becomes “Nom” in the lower left, and then 2 fields of “nombre de pila / apellido” somewhere in the middle. Being able to understand this kind of unstructured structure is where file analysis and extraction technologies come into play.
- Data vs. content: On the other extreme, in the weeds of structured content, is the notion of structured and unstructured data. As if we weren’t confused enough with structured/unstructured content (see Rupin’s post for a great explanation), the data we extract, whether it comes from a structured database, or some unstructured document, can itself have a range of structure. My personal example (and pet peeve) is a date stamp in Microsoft Excel. An Excel sheet contains structured content... but the data within it may be poorly structured. So for example is my birthday 07/08 or 08/07*? Where is there structure? Has it been applied properly? Can the next system interpret accordingly? Certainly rapidly growing data preparation and data validation technologies help address this challenge, along with good policy and enforcement.
So where does that leave you? Regardless of where you sit in the organizational structure, there’s lots of opportunity if you can navigate this convoluted world of information. My own advice would be to look for opportunities that result in solid wins for the business, but require minimally complex investments and installations.
If you’re a certified database architect, then fine, go chase data structures. Similarly, if you’re a mathematical doctorate with a linguistics penchant, then perhaps dig into the semantic side of things. But for the rest of us, there’s somewhere around 90% of organizational information that is unstructured content which is constantly being ignored, underused, and not leveraged to its full potential. Technologies like file analytics can help organizations like yours to take advantage of that low-hanging, and potentially high-value fruit.
*Note: My birthday is, in fact, July 8th. However, if you would like to buy me a backup gift on August 7th, I would have no complaints – structured or otherwise!