Has anyone used OCR software to batch extract invoice data during due diligence?

searcher profile

June 27, 2025

by a searcher from INSEAD in San Francisco, CA, USA

We’re mid-way through diligence on a deal and uncovered that the seller’s homegrown portal stores ~20,000 invoices as PDFs but with no built-in reporting. We may need to extract key fields like invoice amount, vendor, and date to support the QoE analysis. Has anyone gone through something similar and used OCR tools (e.g., Amazon Textract) to automate this? My initial thought is that we can just pull a representative sample, but trying to assess feasibility if it ends up being needed. Thanks in advance!
2
33
278
Replies
33
commentor profile
Reply by a professional
in Lucknow, Uttar Pradesh, India
You can OCR the suckers. But it’s not push-a-button magic. Textract or Tesseract works fine if the invoices are clean and follow a pattern. If they’re scanned cattywampus with coffee stains and 17 layouts? Gonna need post-processing help. We ran a sample through Textract + some regex wrangling and got ~85% accuracy. Good enough for directional QoE, not audit-level. Honestly? If it’s just for QoE and you don’t need every invoice, pull 500 at random, batch process, sanity check 50 of ‘em manually. If it lines up, you’re probably good. If it’s spaghetti… either hire a data wrangler or go old school and pull insights with a French press and some interns. Just depends how far you wanna dig for gold vs prove it’s not landmines.
commentor profile
Reply by a searcher
from University of Virginia in Los Angeles, CA, USA
I worked in a law firm where we tried to create automations for extracting data from pdf files. It's a common challenge for law firms, and we hoped that in the advent of AI there would be new solutions. A couple of document management systems such as Clio and NetDocuments were working in this direction and might have achieved some progress by now. But if it's a one-time effort, I seriously suggest that you outsource it to a low-cost clerical team who will manually go through the information. We tried everything under the sun, including AWS. The accuracy is just not good enough. We were able to get outsourced help for a few cents per field. The cost of customizing a smart OCR solution to your needs will definitely be more than a one-time manual effort.
commentor profile
+31 more replies.
Join the discussion