Mastering the OCR for PDF Documents

In this article we look deeper into PDF documents creation and editing process. As we mentioned in “OCR: make your documents text-searchable” post, OCR process runs in multiple threads. The number of threads is equal to the number of CPU cores and each thread processes one page at a time. The time that OCR process takes for one page depends on multiple factors, such as page content, model of CPU and its utilization by other applications. Average time for a 1040 form page processing using Intel(R) Core(TM) i3-2120 CPU @ 3,30GHz is 22 seconds:

The CPU from this example supports AVX (Advanced Vector Extensions on Wikipedia). We optimized TaxWorkFlow to work with CPU’s that support AVX and it allowed us to increase the speed of images handling by 50%. CPU with the same specifications but without AVX will process the page of the 1040 form in about 44 seconds.

Since the OCR process goes on in the background, your final version of the document is not ready until the OCR finishes. So you can save the document only after the last page of the document was OCR-ed. While OCR is in progress, the document will be updated every time its page was OCR-ed so this page could be included in the document showed on the screen for preview. This may be annoying if you look through the document at this time as upon updating the document will be opened at its first page. To avoid this inconvenience you can disable refresh upon OCR by clicking “Disable Refresh Upon OCR”.

Most likely, not all of your clients’ scanners can OCR images and sometimes you get the documents in PDF format that were not OCR-ed. With TaxWorkFlow PDF tools, you can OCR such PDF documents. The PDF creator will either use the original images in PDF, or convert the PDF page into one big image, depending on how the original PDF was created. The final document format (compression quality, image formats, etc.) can optimized to your needs, just like in case of scanned images. Please note that the OCR process always works on an original document that provides the best possible quality while the final document’s quality depends on compression and resolution you select. We recommend you to use TaxWorkFlow “Quick settings” presets (more information on them is here) to keep a great quality of images along with a highest compression level.

You can find detailed information about all the features in TaxWorkFlow’s online help or by clicking a Help button inside the application.