How We Built AI Invoice Parsing for PDFs, Scans, and Photos

When people talk about invoice OCR, they usually focus on text recognition. In reality, that is only part of the job.

For FakturAi, the harder problem was taking messy invoice input — PDFs, scans, phone photos — and turning it into structured accounting data that a business can actually use. Reading the text was one step. Figuring out what each piece of text meant was the real challenge.

Why OCR Alone Was Not Enough

OCR engines are good at extracting text from documents. What they do not solve well is document structure.

Invoices are inconsistent by nature. The invoice number might be in the top-right corner on one file, near the footer on another, and hidden behind a local label on a third. The same applies to dates, VAT values, totals, and supplier information.

We started with a rule-based approach: regex patterns, keyword matching, and a set of heuristics. It worked on cleaner invoices, but the moment layouts became less predictable, accuracy dropped fast. The issue was not a lack of text. The issue was mapping raw text to the correct business fields.

The Pipeline We Settled On

The version that worked best was a layered pipeline:

Upload a PDF, photo, or scanned invoice
Preprocess the file for better OCR quality
Extract raw text through OCR
Send that text to an LLM with a strict extraction prompt
Validate the output against business rules
Let the user review before saving

That flow was much more reliable than trying to stretch OCR and regex rules to cover every possible invoice format.

Preprocessing Helped More Than Expected

This part is not very exciting, but it matters.

For PDFs, we often converted pages into images first so the OCR step saw a more consistent input. For photos and scans, we applied deskewing, contrast adjustments, and basic image cleanup. Those small steps reduced quite a few downstream errors.

A bad image can make the rest of the system look worse than it really is, so it was worth fixing that early.

OCR + LLM Was Better Than OCR + More Rules

We tested different OCR providers and found Google Vision to be one of the more reliable options for the types of invoices we were processing.

Once we had the raw text, we stopped adding more parsing rules and moved the extraction step to an LLM. That changed the problem in a useful way. Instead of asking "how do we write enough rules for every layout?" we asked "how do we describe the target output clearly enough that the model returns predictable JSON?"

That was a much better direction.

What Made the Prompt Stable

The prompt had to be strict.

The more generic the prompt, the more generic the output. The version that worked best included:

the exact fields to extract
expected formats for dates and numbers
clear instructions for ambiguous cases
fixed JSON output
rules for missing or uncertain values

Once we did that properly, the extraction became much more stable.

Multi-Language Invoices

Because FakturAi targets businesses in Central Europe, invoices came in Slovak, Czech, German, English, Hungarian, and Ukrainian.

The language itself was not always the hardest part. In many cases, formatting differences caused more trouble than vocabulary. Date formats, decimal separators, VAT labels, and currency placement varied a lot.

So instead of treating it mainly as a translation problem, we normalized values into one internal format and added language hints only where they actually helped.

Validation Was Essential

Even when the extracted result looked correct, we still needed validation.

We added checks such as:

subtotal plus VAT should match the total
due date should not be earlier than issue date
quantity times unit price should roughly match the line total
VAT IDs should be checked where relevant

This step mattered because users do not just need automation. They need enough confidence to trust the output without manually rechecking everything from scratch.

Pay by Square Integration

One Slovakia-specific feature that turned out to be genuinely useful was Pay by Square.

For invoice recipients, scanning a QR code in a banking app is much easier than manually entering payment details. So we added Pay by Square generation directly into the invoice flow.

It was not the most technically complex part of the system, but it delivered obvious value very quickly. Those features are often worth more than the clever ones.

What We Learned

A few lessons were pretty clear after building this:

First, extracting fewer fields well is better than extracting everything badly.
Users mostly care about the fields they actually use.

Second, OCR is only one layer of the problem.
The bigger challenge is turning unstructured documents into reliable business data.

Third, AI extraction needs guardrails.
Without validation and review, it is too easy to let bad data pass through.

And finally, real invoices are always messier than sample files. Designing for that early saves a lot of pain later.