Scanning and OCR-converting old reports to searchable PDFs

mpainesyd · March 5, 2023, 10:12pm

I have several technical reports from the 1990s (and earlier!) that I would like to scan and OCR-convert into searchable PDFs.
I have often used an HP Officejet Pro 8020 printer and macOS Image Capture to scan multiple pages into a single PDF file but the rersulting file is very large and is not searchable.
I have been looking at reviews of Mac software and none stands out as ideal for this purpose. Wondering if Tidbits subscribers have any tips?

dsh1705 · March 6, 2023, 12:43am

DEVONthink might be a great solution. The OCR engine is powerful, and the ability search through PDFs (and so much more) is (IMHO), second to none.

dennishenley · March 6, 2023, 1:13am

ABBYY FineReader will take a PDF and run OCR on it with very good results.

PDF Expert has an OCR engine built in which works fine.

And, as dsh1705 mentioned, DEVONThink use the FineReader engine for its OCR. I don’t know if you have to get the Pro version, though, to access it.

gingerbeardman · March 6, 2023, 5:11pm

I’ve tried them all and Acrobat gives the best results.

Acrobat is single-threaded, so doesn’t use you whole CPU so could be considered slow, but it works very well.

When I do multi-hundred-page books I simply put Acrobat in its own Space on macOS and leave it to it.

A free option is to upload your scanned images as a cbz (zip with renamed extension) or pdf to internet archive, where they will be OCR’d, you can keep them hidden whilst it happens, and then download and delete them when you’re done.

Acrobat can so optimise the size of a PDF, rotate/straighten all pages and more. I dislike Adobe’s policies, and Acrobat really needs updating for modern times, but it does a great job.

Fahirsch · March 6, 2023, 5:48pm

I suggest VueScan for scanning. It can scan to pdf and ocr it, and continue adding pages to the pdf.
Also, if you have to scan a lot lot of documents, consider buying a stand alone scanner. Much faster than combination printer-scanner.

fischej · March 6, 2023, 6:40pm

My tool of choice for OCR of PDF files is OCRKit from ExactCode. Been part of my weekly bill-paying workflow for years. $50.

mpainesyd · March 6, 2023, 9:20pm

Thank you everyone. I might start with the trial version of Finereader PDF:
https://pdf.abbyy.com/pricing/

Denny_H · March 6, 2023, 9:26pm

Using the Files app, you can scan directly with an iPhone or iPad, and the resulting PDF OCR is excellent.

james.cutler · March 7, 2023, 2:21pm

For scanning documents (wills, deeds, letters, receipts, etc.) I use a Fujitsu iX1600 and ScanSnap Manager which can automatically OCR pdf output. For the mountain of paperwork involved in Estate Management, this has saved many hours and compressed the aforesaid mountain of paper to the size of a 2.5" SATA drive. It also does a modestly competent job of scanning photos for record purposes.

dianed143 · March 7, 2023, 2:45pm

I’ve been using my Oki multifunction laser and DevonThink Pro. I am pretty sure it’s the Pro component that does the OCR.

My scanner does double sided and it’s allowed me to get bags of paper out of my office. If it dies I will be getting something similar to replace it.

Devonthink lets me name and tag the documents, but I only name them. The OCR is good enough that they come up in searches no matter what. I’m really happy with it. Integration with my scanner was automatic.

I use Devonthink for other cataloging as well, the scanning was just a bonus.

Diane

harriska2 · March 8, 2023, 3:47pm

I use a PC and old version of ABBYY. I scan hundreds of thousands of pages. I’ve found ABBYY and ScanVue are limited and buggy on mac, especially with my Cannon beast of a scanner that is rated at something like 100k pages a month.

levanah · March 8, 2023, 5:50pm

Me too; have done exactly this for years w/ CS6 (and earlier). Keeping my old 2013 iMac on High Sierra still running for exactly this purpose.

osric · March 8, 2023, 10:49pm

I’ve been using Evernote since it first came out, and had been a huge fan of their branded ScanSnap until they discontinued it and the software integration went wonky. For on-demand conversion, I was using Smile’s PDF Pen Pro, but it was acquired recently by Nitro and the OCR functionality just doesn’t work reliably any more on my Mac Studio (it just hangs the app). Interested in seeing the recommendations in this thread since my scanning workflow is pretty much busted at this point.

jimthing · March 8, 2023, 11:37pm

I concur with those saying to get a ScanSnap doc scanner (newer iX600 or older iX1500, are recommeded for speed) if using regularly. They’re worth their weight in gold if time efficiency is important to you.

And even if you only need them to do one lot of scanning (eg. you’re digitising all your paperwork in a batch over some weeks/months then won’t need to do as much afterwards), you can typically sell them on for a good amount of your money afterwards on Ebay or wherever, as they’re in-demand as tools with many users wanting additional machines for second office/home use.

OCR is included, or you can set the output to automatically open the doc in the PDF app you prefer, to OCR it there. Personally, I have it open the PDF it creates in Nitro Pro (formerly PDFpen Pro) which I have from my Setapp subscription – unlike @osric’s report, it works great for me.

gastropod · March 10, 2023, 12:12am

Four tentacles up for the Scansnap (USB 3), and for Devonthink Pro OCR (I have version 2).

I let both the scansnap and devon do ocr. scansnap is much faster, but devon is better at some things such as telling which way is up. I don’t have serious disk space limitations, so I just let both sets stay around. Sometimes it’s useful to use quicklook or gallery view in the scansnap folder because I rarely bother to rename the files–life is short and search is usually pretty good.

A good microcut shredder is the third component. Once shredded, all that paper makes nice mulch. If it’s going to be breezy though, soak it thoroughly…

mpainesyd · March 13, 2023, 3:54am

I have ended up using ABBYY FineReader PDF. It worked seamlessly and I now have a searchable 76 page report, including several coloured photos.

My only sticking point was that it won’t run under Mojave. I had to do the processing on a Macbook Air M2 (which was likely much faster anyway!).

Thanks for the tips everyone

jsrnephdoc · March 19, 2023, 5:00pm

Me, too, but lately I’m finding that some documents I converted from paper to pdf using ScanSnap inside Evernote have NOT been “OCRed.”

Just yesterday, I had occasion to review my and my spouse’s joint living trust. VERT readable on screen, but NOT searchable. I know of no way to go back and correct that other than wasting paper to print a complete copy and trying to get the OCR process to work on the re-import into Evernote. Any idea what I might have done wrong?

Thanks so much,

mpainesyd · March 19, 2023, 8:23pm

Some extra processing with the scanned report:
Before Finereader 39Mb (scanned image PDF file)
After Finereader 19Mb (text PDF file with images)
After PDFShrink 18Mb (“print” resolution)
After PDFShrink 5Mb (“web” resolution)

Did I miss a setting in Finereader that shrinks the PDF after the OCR process?

rbononno · March 19, 2023, 9:32pm

I used ABBYY for years (there are/were two versions) but had to discontinue use when I upgraded to Catalina because it’s a 32-bit application. The company doesn’t do much to promote it, either.

rbononno · March 19, 2023, 9:37pm

Strange. I ran ABBYY on Mojave for years. But it won’t run on Catalina or higher because it’s a 32-bit application. I’m confused by your comment.