If you are services companies like banks or insurances, you will need to extract information from pdf files. This process is very common for Know your customer. Some of the pdf files are in the image, so you need to run an optical character recognition. Even if the pdf files are searchable in digital format. The data are in semi-structure form.
In some of the banks and insurances, they need to do know your customers. Basically, it is doing due diligence check. So you need to extract out the shareholder’s names, directors list, paid-up capital, etc and do a check with a blacklist database. However, to do a check, you need to extract out the names in pdf.
Why do you need to do intelligence extraction? However, as the business profile is different in other countries, some of the template-based pdf files extraction will not work well. Sometimes, the list of names in a table will also go to another page. It will make it difficult or impossible to use a template extraction.
When we do intelligence document extraction, we are not just looking at the location of the text on the page. Intelligence document extraction will understand the text content in order for it to do the extraction, it can learn the format and the meaning of the text to extract out the right data.
This will require training the machine learning to understand the text. The advantage of such a system is that it is more robust and can handle missing or data mixed with other noise. and the disadvantage is that it will require training data. This means that we need to annotate or tagged those pdf files that we want to extract the data.
From our experience in intelligence document extraction, when we annotate the training data, the system will give us much better accuracy. After we build a model, we can export these pdf files to excel with the clean data in the structured data columns.
This will make it easy for any automation. see demo
Written by: Christopher Lim