HowToOcr

Abstract
Before the development of OCR programs, paper documents needed to be converted into digital copies by hand. Therefore, the main advantages of OCR technology are saved time, decreased errors and minimized effort. It also enables actions that are not capable with physical copies such as compressing into ZIP files, highlighting keywords, incorporating into a website and attaching to an email.
While taking images of documents enables them to be digitally archived, OCR provides the added functionality of being able to edit and search those documents. We can improve our model by feeding a ton of data to the deep neural network to recognize the different fonts and patterns. Then it will be able to scan throughout the document and extract the text and store them into a destined path that might be console or text file. the main purpose of this paper is to create a graphical user interface that takes the image or document as an argument an result in a extracting formed data.

Definition
Optical Character Recognition, usually abbreviated to OCR, is the conversion of scanned photo/ Images of typewritten or printed text into machine encoded/computer-readable text.

Introduction To OCR (optical character recognition)
OCR (optical character recognition) is the use of technology to distinguish printed or handwritten text character inside digital images of physical documents, such as a scanned paper document. The basic process of OCR involves examining the text of a document and translating the characters into code that can be used for data processing. OCR is sometimes also referred to as text recognition.
OCR systems are made up of a combination of hardware and software that is used to convert physical documents into machine-readable text. Hardware, such as an optical scanner or specialized circuit board is used to copy or read text while software typically handles the advanced processing. Software can also take advantage of artificial intelligence (AI) to implement more advanced methods of intelligent character recognition (ICR), like identifying languages or styles of handwriting.
The process of OCR is most commonly used to turn  hard copy  legal or historic documents into PDFs. Once placed in this soft copy, users can edit, format and search the document as if it was created with a word processor.

How optical character recognition works
The first step of OCR is using a scanner to process the physical form of a document. 
Once all pages are copied, OCR software converts the document into a two-color, or black and white, version. 
The scanned-in image or bitmap is analyzed for light and dark areas, where the dark areas are identified as characters that need to be recognized and light areas are identified as background.
The dark areas are then processed further to find alphabetic letters or numeric digits.
OCR programs can vary in their techniques, but typically involve targeting one character, word or block of text at a time. 

Characters are then identified using one of two algorithms:
Pattern recognition- OCR programs are fed examples of text in various fonts and formats which are then used to compare, and recognize, characters in the scanned document.
Feature detection- OCR programs apply rules regarding the features of a specific letter or number to recognize characters in the scanned document. 
Features could include the number of angled lines, crossed lines or curves in a character for comparison. For example, the capital letter “A” may be stored as two diagonal lines that meet with a horizontal line across the middle.
When a character is identified, it is converted into an ASCII code that can be used by computer systems to handle further manipulations. 
Users should correct basic errors, proofread and make sure complex layouts were handled properly before saving the document for future use.

OCR can be used for a variety of applications, including:
  1. Scanning printed documents into versions that can be edited with word processors, like Microsoft Word or Google Docs.
  2. Indexing print material for search engines.
  3. Automating data entry, extraction and processing.
  4. Deciphering documents into text that can be read aloud to visually-impaired or blind users.
  5. Archiving historic information, such as newspapers, magazines or phone-books, into searchable formats.
  6. Electronically depositing checks without the need for a bank teller.
  7. Placing important, signed legal documents into an electronic database.
  8. Recognizing text, such as license plates, with a camera or software.
  9. Sorting letters for mail delivery.
  10. Translating words within an image into a specified language.

 Problem definition 
  1. The model must be able to extract the data from the image of any size or pixel (distorted-image).
  2. The character must be properly separated for greater accuracy.
  3. Input given to the system must be in .bitmap , .png , .jpg , .jpeg File.
  4. There should be constant distance between characters and rows to ensure accuracy.
  5. System will recognize any set of character provided that they are written in recognized font by   model or neural network.
  6. Tool must work for all type of operating systems (variation in the extension of file as per system).
  7. The OCR must support multiple plugin or work for most of the major language by changing the language script only.

  Objective
  1. For static OCR, software should provide a way to load scanned images for recognition purpose.
  2. If scanned image is not having black background and white foreground, facility for image inversion should be provided by the software.
  3. Software should process the image and extract characters.
  4. User should have facility to extract the text in format of his desired interest.
  5. For dynamic OCR, the software should recognize characters in any font and color.
  6. If software is not giving proper output, there should be a way for training the database or model of software to give the better results for the next time.

  Description: Pytesseract-

     Python-tesseract is an optical character recognition (OCR) tool for python.That is, it will recognize         and “read” the text embedded in images.
     Python-tesseract is a wrapper for Google’s Tesseract-OCR Engine. It is also useful as a stand                     alone invocation script to tesseract, as it can read all image types supported by the Python                           Imaging Library, including jpeg, png, gif, bmp, tiff, and others, whereas tesseract-OCR by default           only supports tiff and bmp. Additionally, if used as a script, Python-tesseract will print the                         recognized text instead of writing it to a file.
     It will recognize and read the text present in images. It can read all image types - png, jpeg,                         gif, tiff, bmp etc. It’s widely used to process everything from scanned documents.

   Functions:
  1. Get_tesseract_version Returns the Tesseract version installed in the system.
  2. image_to_string Returns the result of a Tesseract OCR run on the image to string
  3. image_to_boxes Returns result containing recognized characters and their box boundaries
  4. Image_to_data Returns result containing box boundaries, confidences, and other information & Requires Tesseract 3.05+. For more information, please check the Tesseract TSV documentation
  5. Image_to_osd Returns result containing information about orientation and script detection.

    Installation:
      $ Sudo pip install Pytesseract

    Requirements:
     * Requires python 2.5 or later versions.
     * And requires Python Imaging Library (PIL).

Flow of execution-How tesseract works:
The line finding algorithm is one of the few parts of Tesseract that has previously been published. The line finding algorithm is designed so that a skewed page can be recognized without having to de-skew, thus saving loss of image quality. The key parts of the process are blob filtering and line construction. Assuming that page layout analysis has already provided text regions of a roughly uniform text size, a simple percentile height filter removes drop-caps and vertically touching characters. The median height approximates the text size in the region, so it is safe to filter out blobs that are smaller than some fraction of the median height, being most likely punctuation, diacritical marks and noise. The filtered blobs are more likely to fit a model of non-overlapping, parallel, but sloping lines. Sorting and processing the blobs by x-coordinate makes it possible to assign blobs to a unique text line, while tracking the slope across the page, with greatly reduced danger of assigning to an incorrect text line in the presence of skew. Once the filtered blobs have been assigned to lines, a least median of squares fit is used to estimate the baselines, and the filtered-out blobs are fitted back into the appropriate lines. The final step of the line creation process merges blobs that overlap by at least half horizontally, putting diacritical marks together with the correct base and correctly associating parts of some broken characters.

Baseline fitting:
Once the text lines have been found, the baselines are fitted more precisely using a quadratic spline. This was another first for an OCR system, and enabled Tesseract to handle pages with curved baselines, which are a common artifact in scanning, and not just at book bindings. The baselines are fitted by partitioning the blobs into groups with a reasonably continuous displacement for the original straight baseline. A quadratic spline is fitted to the most populous partition, (assumed to be the baseline) by a least squares fit. The quadratic spline has the advantage that this calculation is reasonably stable, but the disadvantage that discontinuities can arise when multiple spline segments are required. A more traditional cubic spline might work better.
Fixed pitched detection and chopping
Fixed Pitch Detection and Chopping Tesseract tests the text lines to determine whether they are fixed pitch. Where it finds fixed pitch text, Tesseract chops the words into characters using the pitch, and disables the chopper and associate on these words for the word recognition step.
Proportional Word Finding
Non-fixed-pitch or proportional text spacing is a highly non-trivial task. The gap between the tens and units of ‘11.9%’ is a similar size to the general space, and is certainly larger than the kerned space between ‘erated’ and ‘junk’. There is no horizontal gap at all between the bounding boxes of ‘of’ and ‘financial’. Tesseract solves most of these problems by measuring gaps in a limited vertical range between the baseline and mean line. Spaces that are close to the threshold at this stage are made fuzzy, so that a final decision can be made after word recognition.

Word Recognition:
Part of the recognition process for any character recognition engine is to identify how a word should be segmented into characters. The initial segmentation output from line finding is classified first. The rest of the word recognition step applies only to non-fixed pitch text.

Chopping Joined Character:

While the result from a word (see section 6) is unsatisfactory, Tesseract attempts to improve the result by chopping the blob with worst confidence from the character classifier. Candidate chop points are found from concave vertices of a polygonal approximation [2] of the outline, and may have either another concave vertex opposite, or a line segment. It may take up to 3 pairs of chop points to successfully separate joined characters from the ASCII set.

Associating Broken Characters: 
When the potential chops have been exhausted, if the word is still not good enough, it is given to the associator. The associator makes an A* (best first) search of the segmentation graph of possible combinations of the maximally chopped blobs into candidate characters. It does this without actually building the segmentation graph, but instead maintains a hash table of visited states. The A* search proceeds by pulling candidate new states from a priority queue and evaluating them by classifying unclassified combinations of fragments. It may be argued that this fully-chop-then-associate approach is at best inefficient, at worst liable to miss important chops, and that may well be the case. The advantage is that the chop-then-associate scheme simplifies the data structures that would be required to maintain the full segmentation graph. When the A* segmentation search was first implemented in about 1989, Tesseract accuracy on broken characters was well ahead of the commercial engines of the day. An essential part of that success was the character classifier that could easily recognize broken characters.

Static Character Classifier:
Features an early version of Tesseract used topological features developed from the work of Shillman ET. Al. Though nicely independent of font and size, these features are not robust to the problems found in reallife images, as Bokser describes. An intermediate idea involved the use of segments of the polygonal approximation as features, but this approach is also not robust to damaged characters.
The breakthrough solution is the idea that the features in the unknown need not be the same as the features in the training data. During training, the segments of a polygonal approximation are used for features, but in recognition, features of a small, fixed length (in normalized units) are extracted from the outline and matched many-to-one against the clustered prototype features of the training data., the short, thick lines are the features extracted from the unknown, and the thin, longer lines are the clustered segments of the polygonal approximation that are used as prototypes. One prototype bridging the two pieces is completely unmatched. Three features on one side and two on the other are unmatched, but, apart from those, every prototype and every feature is well matched. This example shows that this process of small features matching large prototypes is easily able to cope with recognition of damaged images. Its main problem is that the computational cost of computing the distance between an unknown and a prototype is very high. The features extracted from the unknown are thus 3- dimensional, (x, y position, angle), with typically 50- 100 features in a character, and the prototype features are 4-dimensional (x, y, position, angle, length), with typically 10-20 features in a prototype configuration.

Classification:
Classification proceeds as a two-step process. In the first step, a class pruner creates a shortlist of character classes that the unknown might match. Each feature fetches, from a coarsely quantized 3-dimensional lookup table, a bit-vector of classes that it might match, and the bit-vectors are summed over all the features. The classes with the highest counts (after correcting for expected number of features) become the short-list for the next step. Each feature of the unknown looks up a bit vector of prototypes of the given class that it might match, and then the actual similarity between them is computed. Each prototype character class is represented by a logical sum-of-product expression with each term called a configuration, so the distance calculation process keeps a record of the total similarity evidence of each feature in each configuration, as well as of each prototype. The best combined distance, which is calculated from the summed feature and prototype evidences, is the best over all the stored configurations of the class.

Training Data:
 Since the classifier is able to recognize damaged characters easily, the classifier was not trained on damaged characters. In fact, the classifier was trained on a mere 20 samples of 94 characters from 8 fonts in a single size, but with 4 attributes (normal, bold, italic, bold italic), making a total of 60160 training samples. This is a significant contrast to other published classifiers, such as the Calera classifier with more than a million samples, and Baird’s 100-font classifier with 1175000 training samples.

Linguistic Analysis:
 Tesseract contains relatively little linguistic analysis. Whenever the word recognition module is considering a new segmentation, the linguistic module (mis-named the permute) chooses the best available word string in each of the following categories: Top frequent word, Top dictionary word, Top numeric word, Top UPPER case word, Top lower case word (with optional initial upper), Top classifier choice word. The final decision for a given segmentation is simply the word with the lowest total distance rating, where each of the above categories is multiplied by a different constant. Words from different segmentation may have different numbers of characters in them. It is hard to compare these words directly, even where a classifier claims to be producing probabilities, which Tesseract does not. This problem is solved in Tesseract by generating two numbers for each character classification. The first, called the confidence, is minus the normalized distance from the prototype. This enables it to be a “confidence” in the sense that greater numbers are better, but still a distance, as, the farther from zero, the greater the distance. The second output, called the rating, multiplies the normalized distance from the prototype by the total outline length in the unknown character. Ratings for characters within a word can be summed meaningfully, since the total outline length for all characters within a word is always the same.

Adaptive classifier:
It has been suggested and demonstrated that OCR engines can benefit from the use of an adaptive classifier. Since the static classifier has to be good at generalizing to any kind of font, its ability to discriminate between different characters or between characters and non-characters is weakened. A more font-sensitive adaptive classifier that is trained by the output of the static classifier is therefore commonly used to obtain greater discrimination within each document, where the number of fonts is limited. Tesseract does not employ a template classifier, but uses the same features and classifier as the static classifier. The only significant difference between the static classifier and the adaptive classifier, apart from the training data, is that the adaptive classifier uses isotropic baseline/x-height normalization, whereas the static classifier normalizes characters by the centroid (first moments) for position and second moments for anisotropic size normalization. The baseline/x-height normalization makes it easier to distinguish upper and lower case characters as well as improving immunity to noise specks. The main benefit of character moment normalization is removal of font aspect ratio and some degree of font stroke width. It also makes recognition of sub and superscripts simpler, but requires an additional classifier feature to distinguish some upper and lower case characters.Shows an example of 3 letters in baseline/x-height normalized form and moment normalized form.

Technical execution:
Performing OCR by running parallel instances of tesseract 4.0: python
  1.   Installing Tesseract OCR Engine.
  2.   Running Tesseract with Command line.
  3.   Running Tesseract with Python
  4.   Running Parallel instances for Speed up
  5.   Building the Pipeline for Real World Application.

1. Installing tesseract OCR engine:
Tesseract is a popular open source project for OCR. You can visit the GitHub repository of Tesseract here. Much recently (in 2016), OCR developers had implemented LSTM based deep neural network (DNN) models (Tesseract 4.0) to perform OCR which is more accurate and faster than the previous conventional models.
Installing tesseract on windows is easy with the precompiled binaries found here. You can download and install the beta version exe from the Mannheim University Library page. Do not forget to edit “path” environment variable and add tesseract path.

2. Running tesseract: command line:
You can see the converted text on command line by typing the following:
  • Tesseract image path stdoud
  • To write the output to a file Tesseract image_path result.txt
  • To specify the language model name, by default it takes English: Tesseract image_path result.txt -1 eng
  • There are various page segmentation modes as a parameter. It directs the layout analysis that Tesseract performs on the page. There are 14 modes available which can be found here. By default, Tesseract fully automates the page segmentation, but does not perform orientation and script detection. To specify the parameter, type the following:
  • Tesseract image_path result.txt -1 eng –psm 6
  • OCR multiple pages in one run of Tesseract. Prepare a text file (savedlist.txt) that has the path to each image:
            Path/to/1.jpg
            Path/to/2.png
            Path/to/3.tiff
          Save it, and then give its name as input file to Tesseract. “output.txt” will contain text                          generated from all the files in the list demarcated by page separator character.
          Tesseract savedlist.txt output.txt

3. Running tesseract: python
There are few wrappers built on the top of tesseract library in python. Python-tesseract (Pytesseract) is a python wrapper for Google’s Tesseract-OCR. Type pip command to install the wrapper.
Pip install Pytesseract
Once you install the wrapper package, you are ready to write python codes for performing OCR. Just note that Pytesseract is only a wrapper to access the methods of tesseract and still requires tesseract to be installed in system.
We will write a simple python definition def ocr(image_path) to perform OCR. It takes an image path as input, performs OCR, writes the generated text to a .txt file and returns the output file name. 

4. Running parallel instance for speed up:
  • We defined a function which takes an input image path and converts it into readable text. An obvious question of scale comes in when we have to process large number of images for example 1 million images. Thinking of that, I am penning down some of the ideas which one can try.
  • Multi-Threading: If the system has 4 physical cores, one can run 4 parallel instances of tesseract and thus performing OCR of 4 images in parallel.
  • Multi-page Feature: Multi-page feature of tesseract is much faster than single image conversion sequentially. To speed up the process, one should make a list of image paths and feed it to tesseract.
  • Using SSDs or RAM as Disk: If there are large number of images, it can help in saving lot of I/O time. SSDs will have faster access and loading time.
  • Running in Distributed system: Use MPI for python on a distributed system and scale it as much as you want. It is different than multi-threading as it is not limited to number of cores of  a single system. You may have to bear more cost in terms of hardware.

5. Building the pipeline for real world applications:
The Architecture of the ICR system consists of the 4 main components. They are shown below in sequence:

1. Image Corrector:
  1. The idea is to prepare the input image in order to do better text recognition in OCR component.
  2. Rectification of Image (Image Correction)
  3. Removal of borders from image.
  4. Text Segmentation & background cleaning.
  5. Use of Open-CV and Image Processing tools like ImageMagick

2. Optical Character Recognizer:
  1. Implementation of State-of-the-art technique used in OCR.
  2. Using open source model : Tesseract.
  3. It is DNN based on Long short term memory( LSTM) published in 2016.
  4. Training of Tesseract required : For recognizing new fonts or hand written texts.

3. Text Processor & Corrector:
  1. Implementation of spell-checker to further improve accuracy.
  2. Generated text needs post-processing in order to extract important fields.
  3. Use of Regex and text processing libraries.
  4. if necessary, We may set up the layout of text.

4. Data Population & Insight Generation:
  1. Extracted fields to be populated in Database (Unstructured to structured data).
  2. It will augment the features/variables and improve the data quality.
  3. Insight generation to help business.
  4. Can be utilized for creating a documents exploration system.
A basic architecture of the end-to-end application is given below:


OCRopus: Introduction to ocropus:
OCRopus is a collection of document analysis programs, not a turn-key OCR system. In order to apply it to your documents, you may need to do some image preprocessing, and possibly also train new models.
In addition to the recognition scripts themselves, there are a number of scripts for ground truth editing and correction, measuring error rates, determining confusion matrices, etc. OCRopus commands will generally print a stack trace along with an error message; this is not generally indicative of a problem (in a future release, we'll suppress the stack trace by default since it seems to confuse too many users).

Requirements:
Numpy(1.9.2)
Scipy(0.15.1)
Matplotlib(1.4.3)
Pillow(2.7.0)
Lxml(3.5.0)

Features include:
  • Pluggable layout analysis.
  • Pluggable character recognition.
  • Pluggable language modeling.
  • Text line recognizer based on recurrent neural networks (and does not require language  modeling).
  • Models for both Latin script and Fraktur.
  • Tools for ground truth labeling.
  • Sample scripts illustrating recognition and training.
  • Layout analysis plugin does image preprocessing and layout analysis.
  • Unicode and ligature support.

Limitations of ocropus
  1. It is required that the image of a specified size or dimension.
  2. It works only in python 2.x
  3. It’s performance is slower than tesseract.
  4. It’s default language is German and need to be configured as English manually by downloading the English package.

Usage:
OCRopus can be used from the command line. Once installed, it can be invoked by specifying the input images. It will output the recognized text to standard output directly or write it as HOCR (HTML-based) code into files, from which it then can be transformed to a searchable PDF. If more precise control is needed, options can be specified on the command line to perform specific operations (e.g. recognizing a single line).
Example for the OCRopus calls to recognize the text in an image:

# perform binarization
ocropus-nlbin tests/ersch.png -o book

# perform page layout analysis
ocropus-gpageseg book/0001.bin.png

# perform text line recognition (with a fraktur model)
ocropus-rpred -m models/fraktur.pyrnn.gz book/0001/*.bin.png

# generate HTML output
ocropus-hocr book/0001.bin.png -o book/0001.html

Description
OCRopus was especially designed for use in high volume digitization projects of books, such as Google Books, Internet Archive or libraries. A large number of languages and fonts are to be supported. However, it can also be used for desktop and office applications or for application for the visually impaired people.

The main components of OCRopus are formed:
analysis of the document layout
optical character recognition
use of statistical language models

Single or multiple scripts are available for these components. The modular approach allows individual workflows to be used and individual steps to be exchanged.
By default, OCRopus comes with a model for English texts and a model for text in Fraktur. These models refer to the script and are largely independent of the actual language. New characters or language variants can be trained either new or in addition.
Recent text recognition is based on recurrent neural networks (LSTM) and does not require a language model. This makes it possible to train language-independent models for which good recognition results for English, German and French have been shown at the same time.[4] In addition to the Latin script, there are results for other scripts such as Sanskrit, Urdu, Devanagari and Greek.
Very good detection rates can be achieved through an appropriate training. This extra effort is particularly worthwhile for difficult documents or scripts that are no longer common today, which are not in the focus of other OCR software.


OCR PIPELINE:Problem Description and Pipeline
     • Photo OCR (Optical Character Recognition) Problem
1. Given picture, detect location of text in the picture
2. Read text at that location

     • Photo OCR Pipeline
1. Text detection
2. Character segmentation

     • Splitting “ADD” for example
Character classification
     • First character “A”, second “D”, and so on

When you design a machine learning algorithm, one of the most important steps is defining the pipeline
A sequence of steps or components for the algorithms
Each step/module can be worked on by different groups to split the workload


Text detection
Positive examples (y = 1), patches with text
Negative examples (y = 0), patches without text
Let us run a sliding window classifier on the image
o We have (on the bottom left) white areas that indicate text areas
o Bright white: classifier output a very high probability of text in the location
If we take one more text by taking the output of the classifier and apply an expansion operator
o It takes the white region and expand them
o If we use heuristics and discard those with abnormal height-to-width ratio

1. Text detection 
2. Character segmentation 
3. Character classification

Getting Lots of Data and Artificial Data
Artificial data synthesis
o Creating data from scratch
o If we have a small training set, we turn that into a large training set
Example of artificial data synthesis for photo OCR: Method 1 (new data)
o We can take free fonts, copy the alphabets and paste them on random backgrounds
o As you can see, the image on the right are synthesized
Discussion on getting more data

1. Make sure you have a low bias (high variance) classifier before expending the effort to get more data
Plot the learning curves to find out
Keep increasing the number of features or number of hidden units in the neural network until you     have a low bias classifier
How much work would it be to get 10x as much data as you currently have
Artificial data synthesis
Collect/label it yourself
Crowd course
Hire people on the web to label data (amazon mechanical Turk)
        Ceiling Analysis: What Part of the Pipeline to Work on Next
Ceiling analysis
o When you have a team working on a pipeline machine learning system
This gives you an indication on which part of the pipeline is worth working on
Ceiling analysis definition
o Estimating the errors due to each component
Photo OCR example
o Choose any metric you would like
Overall system
Text detection
o By putting a check mark on “text detection”
Going to go to the test set and give it the correct answers
It’s as if you have a perfect text detection system
Check the accuracy of the whole system (72% to 89%: 17% improvement)
You run the algorithm and go to the next component in the pipeline
You give it the correct “character segmentation”
Check accuracy of the whole system (89% to 90%: 1% only)
You run the algorithm mon the last component in the pipeline
Check accuracy of the whole system (90% to 100%: 10%)
o This shows the upside potential from each component

References:

https://searchcontentmanagement.techtarget.com/definition/OCR-optical-character-recognition
https://pdfs.semanticscholar.org/6a4b/4f04d5ce3c3592832eb40c23cc8fc5a9131e.pdf
https://micropyramid.com/blog/extract-text-with-ocr-for-image-files-in-python-using-pytesseract
http://machinelearningmedium.com/2019/01/15/breaking-down-tesseract-ocr
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/33418.pdf
https://www.researchgate.net/publication/304189593_Road_sign_recognition_system_on_Raspberry
https://www.cse.iitb.ac.in/~saketh/research/manasaMTP.pdf
https://appliedmachinelearning.blog/2018/06/30/performing-ocr-by-running-parallel-instances-of-tesseract-4-0-python
https://mobiles-han.blogspot.com/2018/07/performing-ocr-by-running-parallel.html
https://mlichtenberg.wordpress.com/2015/11/04/tuning-tesseract-ocr
https://nanonets.com/blog/ocr-with-tesseract
https://github.com/tesseract-ocr/tesseract/issues/928
https://en.wikipedia.org/wiki/OCRopus
https://www.ritchieng.com/machine-learning-photo-ocr
https://github.com/tmbdev/ocrop
https://www.c-4-c.com/ocropus-36



Comments