Records have been written for thousands of years, in many scripts and on many media. Clay tablets, stone tablets, wax tablets, papyrus, parchment, and paper all preceded digital media. In our hurry to move from paper to digital media, the most common shortcut has been to scan paper into PDF documents, which have the virtue of being digital and portable, but the drawback of being essentially unstructured.
What companies need as they streamline their operations is structured data, but getting from unstructured to structured documents has been time-consuming. There have been many products and services offered for OCR (optical character recognition) and text mining, without there being an overall dominant player in the field. To understand the size of the problem, consider that 80% to 90% of data is currently unstructured, and the volume of unstructured data is growing from tens of zettabytes to hundreds of zettabytes. (One zettabyte is one billion terabytes.)
The usual approach to parsing a PDF document involves segmenting each page, applying OCR (often accomplished using convolutional neural networks), identifying the layout, extracting the text of interest, and converting digits to numeric values. Some services can take the next steps as well, extracting entities and inferring sentiment from selected text fields, such as articles, comments, and reviews.
In this article we’ll discuss the document parsing and splitting services available from the big three public cloud providers: AWS, Microsoft Azure, and Google Cloud. The use cases these services cover include extracting text and tagged values from lending and procurement documents, contracts, driver’s licenses, and passports.