The need for OCR in Document Management
copyright by blogs.dropbox.com
This post will take you behind the scenes on how DropBox built a state-of-the-art Optical Character Recognition (OCR) pipeline for their mobile document scanner. They used computer vision and deep learning advances such as bi-directional Long Short Term Memory (LSTMs), Connectionist Temporal Classification (CTC), convolutional neural nets (CNNs), and more. The document scanner makes it possible to use the mobile phone to take photos and “scan” items like receipts and invoices. However, the mobile document scanner only outputs an image — any text in the image is just a set of pixels as far as the computer is concerned, and can’t be copy-pasted, searched for, or any of the other things you can do with text. Hence the need to apply Optical Character Recognition, or OCR. This process extracts actual text from our doc-scanned.
Ready-made vs. building own pipeline
When DropBox built the first version of the mobile document scanner, they used a commercial off-the-shelf OCR library, in order to do product validation before diving too deep into creating their own machine learning-based OCR system. This meant integrating the commercial system into the scanning pipeline, offering both features above to business users to see if they found sufficient use from the OCR. Once it was confirmed that there was indeed strong user demand for the mobile document scanner and OCR, they decided to build their own in-house OCR system for several reasons. First, there was a cost consideration: having their own OCR system would save them significant money as the licensed commercial OCR SDK charges based on the number of scans. Second, the commercial system was tuned for the traditional OCR world of images from flat bed scanners, whereas this operating scenario was much tougher, because mobile phone photos are far more unconstrained, with crinkled or curved documents, shadows and uneven lighting, blurriness and reflective highlights, etc.
Building the OCR system using AI
In fact, a sea change has happened in the world of computer vision that gave us a unique opportunity. Traditionally, OCR systems were heavily pipelined, with hand-built and highly-tuned modules taking advantage of all kinds of conditions they could assume to be true for images captured using a flatbed scanner. The process to build these OCR systems was very specialized and labor intensive, and the systems could generally only work with fairly constrained imagery from flat bed scanners. The last few years has seen the successful application of deep learning to numerous problems in computer vision that have given us powerful new tools for tackling OCR without having to replicate the complex processing pipelines of the past, relying instead on large quantities of data to have the system automatically learn how to do many of the previously manually-designed steps. Perhaps the most important reason for building their own system is that it would give DropBox more control over their own destiny, and allow them to work on more innovative features in the future […]