DropBox uses Deep Learning to OCR your Documents

DropBox OCR

The need for in Document Management

copyright by

This post will take you behind the scenes on how DropBox built a state-of-the-art Optical Character Recognition () pipeline for their mobile document scanner. They used computer vision and advances such as bi-directional Long Short Term Memory (LSTMs), Connectionist Temporal Classification (CTC), convolutional neural nets (CNNs), and more. The document scanner makes it possible to use the mobile phone to take photos and “scan” items like receipts and invoices. However, the mobile document scanner only outputs an image — any text in the image is just a set of pixels as far as the computer is concerned, and can’t be copy-pasted, searched for, or any of the other things you can do with text. Hence the need to apply Optical Character Recognition, or . This process extracts actual text from our doc-scanned.

Ready-made vs. building own pipeline

When DropBox built the first version of the mobile document scanner, they used a commercial off-the-shelf library, in order to do product validation before diving too deep into creating their own -based system. This meant integrating the commercial system into the scanning pipeline, offering both features above to business users to see if they found sufficient use from the . Once it was confirmed that there was indeed strong user demand for the mobile document scanner and , they decided to build their own in-house system for several reasons. First, there was a cost consideration: having their own system would save them significant money as the licensed commercial SDK charges based on the number of scans. Second, the commercial system was tuned for the traditional world of images from flat bed scanners, whereas this operating scenario was much tougher, because mobile phone photos are far more unconstrained, with crinkled or curved documents, shadows and uneven lighting, blurriness and reflective highlights, etc.

SwissCognitive LogoBuilding the system using

In fact, a sea change has happened in the world of computer vision that gave us a unique opportunity. Traditionally, systems were heavily pipelined, with hand-built and highly-tuned modules taking advantage of all kinds of conditions they could assume to be true for images captured using a flatbed scanner. The process to build these systems was very specialized and labor intensive, and the systems could generally only work with fairly constrained imagery from flat bed scanners. The last few years has seen the successful application of to numerous problems in computer vision that have given us powerful new tools for tackling without having to replicate the complex processing pipelines of the past, relying instead on large quantities of data to have the system automatically learn how to do many of the previously manually-designed steps. Perhaps the most important reason for building their own system is that it would give DropBox more control over their own destiny, and allow them to work on more innovative features in the future […]

read more – copyright by


Comments are closed.