OCR: Make Your Documents Text-Searchable

Optical character recognition (OCR) is now available in TaxWorkFlow. This tool allows you to convert scanned paper data records to text. This technology is being developed and enhanced for 30+ years and nowadays it works perfectly with electronic documents. You can read more about it in Wikipedia.

So what are the benefits of this technology?

  1. All documents become text-searchable. You can search through the documents for names, addresses, numbers, just anything. There is no need to look through the whole document and that’s the biggest advantage of OCR.
  2. Documents can be edited during OCR process, and reassembled automatically after OCR finishes.
  3. Copying of text from documents to clipboard is available.

TaxWorkFlow uses Tesseract to OCR documents. This is an open source Optical Character Recognition (OCR) Engine, available under the Apache 2.0 license. It supports 100+ languages. Find out more about Tesseract here. The list of all languages is available inside the application. Turn on required language and the application will automatically install it.

When the languages are installed all you need to do to turn OCR on is to select necessary language while creating/editing new PDF documents in TaxWorkFlow.

TaxWorkFlow OCR tool can work with more than 100 image formats including rasterized PDF files. This instrument along with compression and resolution settings can dramatically decrease the size of the document. Let’s look at the example of this collaboration. We scanned a tax form and its original size is 8.9 MB. You can see “OCR and Image Quality” settings at the right pane of the image and the size of the page is in the bottom left corner of the picture:

A size of the document of 115 such pages would be 1 GB and for sure this is unacceptable size that could slow down the work process or even stop it. Now let’s change quick settings to “Color Document”. You can see that the size of the page is 241.6 KB now, which is 37 times less than original size. The quality didn’t change much, the image is still well-readable:

But this is not the limit of the TaxWorkFlow. As soon as our document is black and white, we can change settings to “B/W Document”. And here is an incredible result of JBIG2 compression – the size of the page is 55.1 KB which is 160 times less comparing to original size:

The quality of the document is still very high, the noise on the background is deleted and all the text of this document could be copied or searched because OCR language was set as English.

There is no doubt that built-in OCR tool will dramatically boost the productivity of your company.

OCR process needs quite much resources of your PC, especially if you work with large documents. The application allows you to control this process, make it smooth and avoid system overloading by using a resource management mechanism during OCR. Here is how it works:

Conserve memory mode is activated automatically when you work with a large document (more than 25 pages) or you can click “Conserve memory” button to activate it manually. In memory saving mode most cached images will be saved on the disk instead of loaded in the memory. This option prevents your PC to run out of memory and keeps it available for other processes and tasks.

Conserve CPU mode should be activated manually when you need to unload CPU during the OCR process. By default OCR uses all cores of your CPU. Each core works on one page at a time. For example, if you have a quad-core processor four pages will be processed at a time. It’s ok if you don’t need to work with other applications at the moment but it can disturb you if you want to work on some other tasks during the OCR process. If so, you can click “Conserve CPU” button, which will limit the CPU loading and keep it highly responsive for other tasks you may have.

Both memory and CPU conserving features are essential for pretty old and highly loaded computers. We advise you to use them both when you work on documents that contain 25+ pages. In addition, you can control the OCR process by disabling it while composing a document.

All these new features, including  OCR, JBIG2 converter and PDF Editor, will take your document management to the highs of business productivity!

You can find detailed information about all the features in TaxWorkFlow’s online help or by clicking a Help button inside the application.