1-905-631-2875 | 1-866-991-1704 | Contact Us | Portal Login | en


Managing the Document Life Cycle Effectively Webinar Series – Part 1: Capture

Date: Thursday June 23, 2016
Time: 2:00 PM to 2:30 PM (EST)
Watch Now

Event Information

Part 1: Capturing Content Effectively

The ingestion and capture of your content is the critical first step in healthy document management and sets the organization up well for the stages that follow.  Content can and should be captured centrally and consistently so that vital information does not end up locked in silos throughout various departments.  In addition, once information is captured, it needs to be “findable”.  Advanced Rendering can help you to capture and control documents from multiple file types by automatically integrating content into workflow and Enterprise Content Management (ECM) systems and adding enhancements to improve enterprise-wide collaboration and efficiency.

Learn how you can capture your critical content more effectively. Watch this On-Demand Webinar.


Video & Audio Transcription

Roger: Today we're going to be looking at the document life cycle and really how to manage content within the document life cycle more effectively. Specifically, we're going to be drilling into the notion of capture, the very first stage of that life cycle. When you look across the entire life cycle, this is a fairly standard diagram, it's one that we create in conjunction with AIIM, the Association for Information and Image Management. There are a number different stages. Content must go through capture, coming into your organization, archiving and storage.

It needs to be managed or work-flowed throughout the processes. There are stages involving enterprise search and content transformation, and ultimately, it needs to be delivered now to the end user substituent. There's a number of different facets and well, not every document will go through all stages. Assuming all stages or all phases are things that you need to consider when looking at your document processes within the organization.

Today, we're going to be looking specifically at capture, and focusing on how you can eliminate manual rendering, how you can address multiple content types that are coming into the organization, how you can use advanced OCR software, Optical Character Recognition, and ultimately how you can support various business initiatives within the context of capture.

An interesting thing just to start setting the stages is notion of what capture is all about and the evolution that we are seeing. Certainly, a few years back, and this is sort of 2010-2012 data, things like scanners and MIPs were the dominant force, and that was certainly where information was coming into the organization from. Fast forward to today and it's a very different picture. Sure, there's scanning and it's still a significant part of as are MIPs, but now you have things like faxes, e-mail has a significant share, mobile increasingly contributes to that.

It's a much more complex landscape. In organizations when they look at what they are actually getting into the organization, it's not just scanned images. More and more we see electronic or digitally born content. PDF files, interestingly enough, e-mails, Office files, Microsoft Word, etc. Faxes, forms and claims. A lot of different types of information are coming to your organization through a number of different mediums and really they are contributing to a more complex environment for IT managers to deal with.

Let me look at this here, they believe there is an increase in focus on what we might think of as advanced capture. Again, we are shifting from that scanning to archive which is that bottom blue chunk, which you can see peaks up in 1998 or so, that's several years ago. Surely by 2014 and moving into 2020 that drops off quite a bit. We are seeing that basic idea, scanning to capture, becoming a much less-- interestingly, a much less critical area. Whereas data extraction in capturing information to get it into a process, that seems to be a growing area.

So you can see in 1998, the start point, there was in fact almost no capture to process. It wasn't a concept. Extracting data surely wasn't as dominant. The idea of contact initiated processes triggering things through metadata, automating your capture, automating the extension processes derived out of capture. None of that was even a glimmer in anyone's eyes back in '98.

In 2014, and again moving toward 2020, those areas, the auto decision processes, the content initiated automation and the capture to process, they become more and more part of the market and, in fact, represent the lion's share of what people are doing. The reconstruct for this shift, organizations are no longer just looking at capture as sort of a nice to have or as a business, very basic rudimentary thing. They're really looking at what business drivers flow out of capture.

This chart represents the strongest drivers for scanning capture within organizations and it's a 2010 stat from AIIM. What we see is-- that's what I might suspect, improved searchability still drives well over 50% of the demand. We start seeing a lot of interesting things. Reducing storage space, improving speed of access for customer service.

Again, something like customer service or even the next one, compliance, those are not necessary things that organizations would typically associate with capture, but we can certainly see a big change.

I've condensed the AIIM options a little bit: Searchability, productivity, ease of access, storage, and this might be both physically storing a paper or virtually reducing the hard drive space that is required, reducing paper, an environmental thing, or simply other. We are seeing a lot of these being different drivers that are new to the industry and are definitely shifting the way that people look at capture. Moving it from strictly an IT sort of technical or tactical requirement to much more a strategic business decision.

Clearly, this audience has a good understanding of what capture really is all about and the significant ways it can benefit your organization. Of course, in order to look at this, you need a big picture stance of a capture and as we suggested, it's clearly not just that ingestion phase. so certainly information coming from MIPs and mail and faxes, and carrier pigeons in some cases. Now we are seeing movements towards mobile and e-mail and scanning, so more and more that electronically or digitally-born content. That ingestion is clearly only the first part.

You have to think about the rest of the picture looking at how you integrate with your different ECM systems. Each of these systems actually tip people how to capture component terms. You often find things like Documentum or Captiva which are also part of the story but beyond the ECM specific capture tools; there is the ECM repository side. You find that you're documenting your SharePoint, your OpenText.

For the purposes of today’s conversation, we're going to be looking at the role that PDFs or content transformation or really advanced rendering play a sort of a centerpiece. You can see in this diagram information coming in from the left-hand side as being ingested, being transformed into a PDF for standardization and other facets, and then being stored into and integrated directly into these different repositories. With that, I’ll pass it over to Jeff to start digging into advanced rendering and how that relates to advanced capture.

Jeff: Thanks, Roger. If you are like me you are probably thinking to yourself, but PDF is an output format. How could PDF be something that I use in my capture process? For the most part, if we were talking 15 or 20 years ago you'd be absolutely right. When we saw low-cost or free applications really creating PDF was the last thing you did before you deliver your content to somebody. But now we're seeing a lot of benefits of PDF with the advanced-- we call advanced rendering really streamline a lot of the workflows in the whole document life cycle.

Data capture is the first stage. We went from in the evolution of rendering, shareware, and freeware, basic rendering where you had desktop products. Then get into embedded systems. This is where we first started seeing capture show up in products like Documentum, you could trigger a workflow so whenever someone uploads a document we send it for OCR so that we create a text layer so that document's findable. But that's just in an embedded system and does not cover your enterprise. We're going to talk about onboarding and capturing content into your entire organization.

With advanced rendering, we're talking about, one, a perfect copy of the original document so that you can be confident when you have a PDF version of all of the content that you're bringing into your organization it's an exact replica. Also, it is automated, high volume and metadata driven meaning that we can tailor the specific requirements of the content rendering to your specifications and for each of your department's specific specifications based on the document type.

Any other metadata associated with or embedded within the content and supporting practically every document file format. Adlib PDF Enterprise converter is our product that enables these workflows, taking practically any file format and putting it through the process of conversion to PDF or a number of other file formats, making the document findable by applying industry leading OCR on top, and then using what we call publication or publishing features to merge multiple documents to automatically create a table of contents, automatically create navigation tools such as bookmarks, applying headers and footers, digital signatures.

If you want to stamp a document to indicate when you received it automatically we could do that. If you wanted to apply digital signature to indicate where the document was received we can do that.

We can output for ingestion into the system by the output as an input so that you have a normal file format for communication internally and externally that's common and everyone can use.

PDF is so ubiquitous, in fact, it's the most popular document file format on the Internet today. You can guarantee collaboration internally and externally by normalizing to this file format.

By applying an OCR or subjecting your content to OCR when it's coming into your systems, we can guarantee that all of your systems that have search capability can find the content.

If you have a scanned image or perhaps some file format that your repository or search engines aren't familiar with, we can normalize all of your content to completely tech searchable PDF.

As well as building an associated index for allowing you to extract the tax for full Text Indexing Engines in various repositories, making sure that all of your content is completely findable at a moment's notice regardless of if it's the original or a scanned signed copy of a particular document.

Roger: Just on that note, we have a quick question for the audience regarding the nature of their OCR programs or the scanning programs.

If you look on your screen we’re just going to open the folder here.

Question is this, how would your organization rate its content in terms of that readability to the findability of readability or searchability that Jeff was referencing?

We've got a few different options and hopefully you can find yourself somewhere on that continuum. There are folks that only scan the image. We're finding a number of organizations, especially in insurance and finance and organizations of that ilk, where they do in fact only or primarily scan to image.

Maybe in the next level up is you do some OCR but it varies by document process or by the specific tools, or perhaps you do image only out of format but you do full OCR out of document.

Next level up is maybe 50% rating, so all key materials or OCR to the point of entry and accuracy in completeness.

Next level up 75% rating, so all materials globally including legacy documents are accurately scanned and OCR-ed and stored properly. And a 100% you simply achieve Nirvana.

Let’s see where folks are out on that sort of random continuum. We've got some good responses here. It might be suspected that it’s pretty standard normal distribution. We've got just around a quarter of you that only scan to image.

Maybe, actually, numbers are shifting, maybe about a third that only scan to image. We’ve got just over half that had some OCR-ing but it's certainly varies by the process or the document. That makes sense for where organizations would be at.

Just around 15% all key materials are being OCR-ed at the point of entry, but again the degree of accuracy or completeness might remain a bit of an issue.

About 5% of you, and kudos and congratulations, claim that all materials globally including legacy documents are being accurately OCR-ed and stored. To those people my hat's off.

That’s interesting to see were organizations are at on that sort of continuum.

Jeff: Thank you, everyone. So per zonal OCR, this allows us to extract information that is represented on your documents. It's especially useful for documents that are coming in physically or through other means such as fax where it's image only.

But it's also useful for digitally- born content where the metadata isn't represented as metadata, but instead this information is just represented as content in the document.

In this example we can see some barcodes, a checkbox, an address, and on the right although it's hard to see, within a structured XML, this is how we're able to present this content showing what page and the name of the field where we pulled the data from including the full address of the barcode information.

It's easy to automate processes based on content that appears on the document even if it's on a paper document or coming in as a fax.

Search enablement is clearly a driving factor for performing OCR and normalizing your content to a searchable format such as PDF. When you're ingesting into your repository system so that you don't have to rely on the taxonomy or the classification you've applied to your systems.

Instead you can rely on the content to find and powerful search engines to find your content. Here we're searching for the word requisitioner. In the signed copy of a document, if it hadn't been subject to OCR it wouldn't actually find that content because it's been scanned in.

Any scanned images by default are not searchable. You subject it to high-quality OCR you'll still be able to find that document.

It's not just a matter of finding the word requisition within this document but if I was searching for-- say this was a sale proposal or a purchase order, I'd be able to find the signed version of the purchase order, not the original digital one which is not the one that I'm looking for.

Also, being able to pull up metadata, I'd spoken about extracting metadata from the content using zonal OCR. A lot of times metadata is hidden within the document. The subject, author, title, lots of different examples of embedded metadata that you may wish to extract either to put into your repository so that you can associate it with that document such as the received-on date, or to stamp on the document. In this example we're showing how we can stamp the status of the document approved does a watermark across the document.

All kinds of ways of pulling out and vetting and displaying the metadata that's either associated with or embedded with or a part of the content within your document.

Roger: Now we’re getting into some really advanced features here of advance rendering as well as advanced capture. I’m just curious from the audience perspective.

In terms of this idea of extracting information we've got a bit of an open pole here. If you look on your screen the question is, to what extent does your capture program extract that key data?

When Jeff was talking about be able to extract some information from the document you sort of apply that metadata logic and take it out of the document of putting it into the system.

To what extent does your organization’s capture program extract key data? First option is, new ideas but were mostly focused on archiving the image.

Next layer up, maybe some data extraction as part of specific processes. Perhaps the next level up is some metadata extraction to supplement the repository set of information.

Next level up is, again, that sort of notion of Nirvana but full data extraction for analysis, auto categorization etc. We’re going to give people a few seconds to rate themselves on where their organizations are at.

And the degree to which you’re able to able to extract information. Some really interesting responses again. It's almost an even split with the first few options.

Just around a third of you sort of it's a neat idea but you're mostly focused on the image. Another third are saying it's some data extraction is part of a process. Maybe just a little bit over third, about 40%, some metadata extraction supplementing the repositories information.

Then, 10% of you, again, I'm guessing it's the same few of you that had answered previously to be at the 75% mark, you've reached full data extraction for complete analysis.

Again, congratulations to those of you who have achieved that level. I'm guessing that those are long time Adlib customers, so good for you.

Jeff: Thank you Roger. I've been talking about, you know, bringing in and extracting information. I've spoken already about, you know, the number of document file formats and even practically any document file format you can normalize to PDF so that you can ensure collaboration and shareability with all of your content, regardless of what format it was when it came in.

One of the more complex file formats is email and how do you deal with email. Emails are surprisingly complex, they come with a body, they can have attachments, they can be any file format, they have headers with metadata that you can't see.

There's data that comes from the SMTP and POP servers that you may want to rely on in order to classify or further process this content. E-mail processing can be complex, but what we can do is simplify that process by allowing you to really bring in all of your content, even through email, and pull out the attachments, allow advanced processes like ROS analysis and cleanup. Migrating your archives when outlook leaves you holding the bag per se when they hand you with a PSC file and say, "Here's your archive." What are you going to do with that? If you're in a large organization, hopefully your IT's going to help you with that.

We can enable automated processes for pulling the content out of those PSC files and putting it into your repository so it's completely searchable and completely accessible for the foreseeable future. Even using normalized file formats such as PDF and PDF/A. Also, by enabling thumbnail viewing within your repository rather than just viewing an icon that shows you what kind of file it is, we can prepare and insert thumbnails into your repositories such as SharePoint we're seeing here or in Documentum or other repositories so that you can see a preview of the document within a library or collection of files, so that you can ensure that you're picking the right file at the right time even if you didn't name it using the most efficient standards as many of us do.

Then also image comparison and data classification. Sometimes we can very easily look at documents and see how close they are to each other so that you can classify them based on the type of form or not only classification but comparison for validation. If you're in a validated environment and you need to do the installation qualification, that process qualification, this allows you to see very quickly quality issues that may arise from one version to another or one system to another. When you have a set of known content we can show you exactly what's happening to that content with each version of your system by comparing those documents.

Roger: Thanks, Jeff. Clearly, there's a lot going on here. It's clearly a lot more than just the PDF that you mentioned at the very beginning as a freeware type of scenario. There's a lot of complexities, a lot of nuances. Which brings me to painting. An odd analogy, but to connect the dots a little bit, when Jeff and I were preparing this presentation we think about capturing and the role that it plays within the document life cycle. We were saying, "It's like painting. A lot of times people when they think about painting projects they get images like these, with happy people painting pretty walls and it's very fun and frivolous."

Jeff: They never paint close to the edges.

Roger: They never do. So that part of the challenge is that you've got to do the preparation. In many ways, capture is that preparation. It's that very, very first stage. It's oftentimes laborious. It can be difficult. It can be challenging. If you don't do it right, if you don't, in this case, do the taping properly and make sure all those corners and edges are perfectly taped off.

In the case of capture, that your information is coming into your organization just right, you end up with a bit of a mess. Some multi colors and different shades. You can imagine that the document equivalent is information that is perhaps image only, that is unsearchable, that is unusable, that is not properly classified, that is not properly metadata tagged, etc.

Of course, applying the lessons that you've learned today in terms of applying advanced rendering to advanced capture processes you end up with a perfectly painted house with lovely walls and perfectly done corners. We hope that this information has helped you on your journey towards painting your own house, or in a more apt way, capturing the appropriate information for your organization's business processes.

Jeff: You're saying Adlib PDF is the masking tape of your capture processes.

Roger: There you go.

Jeff: Excellent. That brings us to the end of our part one today on capture and how advanced rendering helps that. Next up, we have at the bottom of this cycle here, we are going to be looking at archiving.

Watch this On-Demand webinar now. Watch Now