eTMF Document Auto-Classification: What Used To Take 8 Minutes, Now Takes Seconds!
In this live session, Keith Parent, founder of Court Square Group and RegDocs365, and Anthony Vigliotti, Chief Product Office at Adlib Software, walked us through the current challenges affecting the life sciences organizations, specifically in clinical trial management, and the emerging technology that helps automate many manual processes associated with processing and classification of documents.
Current Challenges Putting Pressure On Life Sciences Organizations
Keith Parent: As part of a clinical trial, there's going be a lot of things that sponsors or CROs run into overtime. Many of these problems are around the clinical trial management, where you may have decentralization of trials.
- The trials are becoming more complex on a regular basis. You have staff shortages and people don't get as well-trained on some of the systems that they have to use. Specifically, around documentation. The volumes of documentation are continuing to increase.
- There's complexity in the documents and there are a lot of manual processes happening around some of these documents.
- And then when you get the documents in, the regulatory requirements that you have to deal with on the submission side are increasing as well.
You may have multiple agencies that you're going to be submitting to and multiple submissions across different areas. So we understand a lot of the challenges that you're going to be faced with, and our goal is to figure out how we can help shorten the timeframe to get to the market.
Challenges Created Via Current Processes
Keith Parent: There are more trials! Our CRO customers have multiple clinical trial sites providing trial documentation. As our customers deal with more trials across more sites, that’s going to create more documents.
There are multiple delivery methods and formats of documents. Some come in as PDF, Word or Excel files. Some people may email them, they may do FTP, they may Dropbox them, and many different ways of getting documents to you. Our goal is to figure out how to best use the different ways that they come in.
A lot of times at a clinical site, they're really scrambling for people to get things done. They may take a pile of documents, throw 'em on a scanner, and now they've scanned these documents that were searchable before and made them unsearchable. Now that they're coming in as an image file, you may have multiple documents that are merged into one PDF that must be split up.
So, the industry is actually creating some of these problems just by the processes that we incorporate on a manual basis. How can we combat some of that and move on to the next way of using technology to solve that?
A Day In A Life Of A CRA (Clinical Research Associate)
Keith Parent: A CRA pulls that document in either through email or through an FTP site, they open up a document and they say, “oh, I've got multiple documents here. I have to split this document. And then when I split it, I have to use standard naming conventions to be able to save those documents somewhere to process them.”
A lot of times you'll find that there's a lot of working document libraries where people are working with their documents. This is where they look at the documents and update them by putting appropriate metadata on them, and figure out where they're supposed to go within the reference model, which is the eTMF reference model. It can take anywhere from 5 to 10 minutes dealing with a document, average times around 8 minutes per document.
The fact that you have to open up hundreds of documents over the course of a trial means it's going to be a very time-consuming process.
Clinical Trials By The Numbers
Keith Parent: One of the big issues that we see from a clinical trial perspective is every CRA may have multiple trials to deal with. You may have multiple sites per trial. A lot of trials may have only a few sites, but you may have 150 or more sites per trial. As soon as you have that, you're going to end up with lots of different documents per site. How do we search for all those documents? How do we triage those?
We may have 1000 to 1500 documents per trial depending on how long the trial is. If it's a multi-year trial we could have lots and lots of documents.
With the 1572 Forms alone, over 150 sites at 10 documents per site, you've got 1500 documents that are just 1572s that you have to search through.
A lot of people will say that email and correspondence is the hardest to identify. When they get the documents coming in and look at the correspondence, they have to determine if it is trial or site level.
Complexities Of The eTMF Framework
Keith Parent: When we look at the trial master file requirements, part of what we want to do is we want to look at what are the essential documents that we need to do every time? The completed forms, the checklists, the reports. In some of the eTMF systems, there may be dashboards to identify and make sure that we've got the right documents in the right place.
For investigational medical products, we want to have traceability of those documents. We want to have the auditability and be able to make sure that the documents and the system they are in are validated.
Superseded documents - what do we mean by that? Well, part of that is we may have obsolete documents over time. The 1572 Form is a key example of that. A new 1572 Form may make a workflow that you were working on obsolete because there may be a sub-PI or a PI on that 1572 who is no longer part of the trial. So now, this particular document doesn't need to be tracked anymore and needs to be updated.
And lastly, correspondence that needs to be put in the right place.
These are some of the requirements that we are working to come up with solutions to best utilize intelligent automation for manual workflows. Such as, putting things into a drop folder that will then analyze those documents and be able to do something with those documents, where people will always have the final say on where those documents should go.
Challenges With Automated Document Classification
- Sources: Documents are coming in through a variety of sources, email, fax, etc. and these sources change depending on the end user themselves. Unsophisticated research sites, such as those outside metropolitan areas or in third world countries, may have poor internet connection and rely on fax.
- Document Types: There are a series of file types, such as Word, Excel, Images, and this is just a small subset of file types that may be in your environment. You may be working with a Word document which is a digital document, but if you print it, it becomes a non-digital document that now has to be OCR-ed. So now you have these documents that contain both structured and unstructured data elements that need to be part of the eTMF process. That is the core challenge.
- Identification & Classification: A research associate or coordinator has to determine what these documents are. In this case, we are using the example of 1572 Form, but it could be a correspondence, it could be a superseding document. The end user has to have the knowledge of what that document is in order to place it in the appropriate eTMF zone, which is extremely manually intensive.
Some of the forms are pretty straightforward, but when you're dealing with unstructured documents such as an email, you need to actually read, understand the context of that particular document, and understand what zone it should be properly placed into.
- Metadata Extraction: So now that you have these documents classified into the appropriate zone, your job is not done! There is going to be critical data elements or metadata that need to be extracted from these documents and placed in the appropriate metadata locations in the eTMF folder structure.
This Is Where It Gets Exciting!
Anthony Vigliotti: The first thing we do is when we look at a document, we don't just go right to the critical data elements, we actually do a layout analysis first. Based on the layout of that document, we're able to determine potentially what type of document it is, and then OCR that document. If it was not tech-searchable and only an image file, we have the ability to now search and pull out and extract data from that document.
The second thing that we've done is we've trained the system using thousands of documents to be smart on the typical eTMF documents that make up a trial. We call it human-assisted machine learning. We believe that machine learning layout analysis is the unique element which makes the system fast, predictable, and will really start driving some time savings. If a document doesn't meet a high enough confidence threshold, we get a human in the loop to do some validation.
We've put in thousands and thousands of pages to help the system get to a base understanding of the eTMF folder structure and the types of documents, but there are going be some unique differences potentially in your environment. As the documents are being ingested by the platform, the machine learning continues to learn.
This isn't a one-time event where we've put in a few thousand pages and we're done. As we keep adding thousands of pages, the system becomes that much more intelligent. As we get the human in the loop and they start correcting some of the low confidence values, the system is benefiting from that human interaction and learning over time.
- Anthony Vigliotti, CPO, Adlib Software
Anthony Vigliotti: This is Reg Docs 365 and what we call a hot folder - an ingestion folder. What we have here is documents that haven't been processed in any way. They've simply just been placed in this location. Without the AI-systems, an individual would have to open up these documents one by one and classify into a zone, identify the metadata that needs to be extracted, place them in the appropriate folder with the right naming convention, so on and so forth. Doing this thousands of times is not fun and prone to error. What we're doing here is using this as a landing spot for these documents before they are processed by Adlib.
As you're collecting your trial documents, all you have to do is simply place them at one location, nothing more.
So what happens next? The next thing is that Adlib simply ingests the documents and will go through the layout analysis and ultimately the classification of that document.
Here is a summary that you should expect to receive per each trial. Our overall classification status is 90% for all the documents that we've ingested for this breast cancer trial and there are some review tasks that need to be done.
This gives you a high level overview of the manual steps that have now been automated. We've saved an end user having to physically touch every document, classify and extract.
- Anthony Vigliotti, CPO, Adlib Software.
When you click on the yellow explanation point, it pulls up a document in question and shows you the metadata fields, the index fields that we've pulled out. And we'll highlight the areas where we believe a human needs to intervene.
What used to be eight minutes per file is now seconds to validate any errors or low confidence scores.
Watch the rest of the session and the Q&A here.