For businesses that rely heavily on documents to gather customer data, extracting the information required for business insights can be time-consuming and challenging. This is especially true in industries such as Financial Services, Insurance, Law, and Healthcare, where industry-specific knowledge is required to properly interpret documents.
It’s no surprise then that Intelligent Document Processing (IDP) is a growing area of interest for document-heavy organisations. IDP combines Optical Character Recognition (OCR) and Natural Language Processing (NLP) to quickly and accurately extract valuable insights from unstructured documents such as forms, contracts, emails, spreadsheets, etc.
However, the use of industry-specific terminology can still present a challenge.
Even the best-in-class large language models (LLMs), such as AWS Comprehend and ChatGPT, will not extract industry-specific information without being retrained on documents from the domain. Training these LLMs from scratch is often not possible, even for larger organizations, because costs can run into the millions of pounds and the work requires an enormous amount of sometimes proprietary training source documents (source).
Fortunately, through a process called ‘transfer learning’, we can fine-tune AWS Comprehend to accurately detect these domain-specific terms using a very small number of labelled documents. In this post, we’ll explore how transfer learning enables you to harness the power of AWS Comprehend to efficiently extract domain-specific data, by looking at a customer use case in the insurance industry.
Named Entity Recognition with AWS Comprehend
AWS Comprehend is an LLM that has been trained on vast amounts of documents to complete NLP tasks, such as document classification using Named Entity Recognition (NER). NER is a machine learning task that automatically identifies words or phrases (entities) in text, such as organisations, names, dates, and so on. This enables efficient data extraction from large collections of documents at scale and is a critical component of Intelligent Document Processing (IDP). However, off-the-shelf NER models are limited in the types of entities that can be detected.
In our insurance use case, we’re looking to classify and extract information from various policy documents. This includes industry-specific information, such as deductibles and perils.
Using base Comprehend to analyse the sample above from an underwriter’s email, we can detect dollar amounts, percentages, and other quantities. However, the entities specific to the insurance industry, such as deductibles, are not identified in this case because the base model is not tuned for this task. This is where transfer learning comes in.
What is Transfer Learning?
Transfer learning is a process in which the learnt knowledge of a pre-trained machine learning model is used in a new, related task. It has become a popular research problem in the ML community and has been particularly effective with NLP tasks. Pre-trained models are trained in an unsupervised way, meaning they are exposed to huge volumes of unlabelled text data from around the web, and can learn semantic and syntactic relationships in the words and sentences in the training data.
In the example in the previous section, we saw that the Comprehend base model can identify a wide range of entities, but since this model is very general it does not identify many industry-specific terms.
Fine-Tuning AWS Comprehend
Retraining Comprehend from scratch with insurance documents would be extremely costly. These models can take days or weeks to train and would require a vast number of documents. Fortunately, we can use transfer learning to leverage the time and resources already put in to Comprehend and have it detect our own custom entities!
Fine-tuning Comprehend, from a high level, involves freezing all but the last layer in the network and re-fitting just that layer on the new, labelled text. See below for an illustration of this process applied to a custom layer entity – a “layer” in this context being a proposed insurance policy’s limit and excess.
The hard-earned knowledge of the original model trained on one collection of documents to predict one set of labels is “transferred” to a new task of learning to predict deductibles from the insurance documents. This can be done in a fraction of the time and, more importantly, with a fraction of the number of labelled examples.
To take advantage of transfer learning, all you need to do is provide the documents which are then processed and put into AWS SageMaker Ground Truth labelling jobs. In the Comprehend UI, the SME merely highlights the relevant entities, as shown below.
Comprehend is constantly being improved – the latest release has a minimum requirement of only 3 documents with 25 labelled entities of each type. In this blog post, AWS recommends a minimum of 250 documents containing at least 100 instances of each entity for good-quality predictions.
Here we see a custom Comprehend model’s predictions on examples of a LAYER entity embedded in other text. Note that it can recognise different formats of the entity.
So, how does a fine-tuned LLM work so well from so few labelled entities?
The pre-trained models already have a complex representation of words and patterns in language from the initial unsupervised training process, so when they recognise these around the entities in the new labelled data, they can quickly add these to their vocabulary with an understanding of the context in which they will typically be found. Amazon Comprehend even takes the location of the entities in the source documents into consideration!
Fine-Tuned NER in Practice
At Inawisdom, we have years of experience tackling this very problem. We have completed multiple IDP projects for international insurance firms. In fact, we have witnessed the revolution in the data requirements for fine-tuning LLMs like Amazon Comprehend – from the need for entire teams to complete the labelling jobs down to something that can be done in one underwriter’s afternoon.
On the insurance use case described above, we were able to drastically reduce the costs involved in fine-tuning the model to recognise industry-specific terms. With fewer than 300 labels, we achieved an 80% success rate when extracting complex layer data from a variety of files.
In a similar use case, we developed a solution capable of successfully extracting up to 75 data points from one document. This was fully productionised and is now actively used to save time and money in the insurance sector.
One challenge we’ve run into is that while Comprehend displays the performance metrics for a fine-tuned model, it uses a proprietary data pre-processing procedure which does not provide the documents in the training and testing (TT) sets and therefore does not allow for deeper evaluation on any subsets of the TT splits.
To address this, Inawisdom has developed custom software that enhances the labeller’s experience and allows us to give a complex evaluation of the model’s performance on any breakdown of the data we wish. The first step splits the labelling jobs into batches, which can be whatever size is most suitable for the labeller – meaning if they find a spare 5 minutes, they can quickly complete a labelling job or two.
Each one of the labelling jobs will produce an output manifest that records the input, their annotations and other relevant metadata. Comprehend can use a maximum of five of these manifests as a training input, but some labelling tasks might require a greater number of batches – for instance where the number of labelled files is very high and needs to be distributed across a large labelling workforce, or when the timeframe required to complete a labelling task is short.
To remedy this issue, Inawisdom’s software can concatenate a number of manifest files into a single input. By tracking the post-process action of the labelling output, it is possible to create TT sets such that we know exactly which files are used in each.
AWS also has examples of successful applications of fine-tuned NER models in insurance domains.
In the case study described on the AWS blog here, transfer learning is applied to Comprehend by fine-tuning it to a set of demand letters that were sent to insurers on a selection of entities including the insurance company name and address, policyholder name and pay-out amount. They labelled 300 letters and more than 200 entities of each type. After Comprehend was fine-tuned, the resulting model achieved 96-100% accuracy depending on the entity.
Transfer learning is a powerful technique that allows us to leverage the abilities of large language models for a fraction of the time and resources it takes to develop one from scratch. As we’ve seen, this has huge potential for value in sectors such as insurance and law, where training models on industry-specific terms has traditionally been time- and cost-prohibitive.
By using transfer learning, we can fine-tune Amazon Comprehend using just a handful of documents to produce a powerful data extraction model – making it possible for businesses to reap the rewards of Intelligent Document Processing much faster.