The Dark Side of AI-Generated Text: When OCR Models Fall Short
However, these promotional claims don’t always match real-world performance, according to recent tests. "I’m typically a pretty big fan of the Mistral models, but the new OCR-specific one they released last week really performed poorly," Willis noted.
The Flaws of Mistral OCR
"A colleague sent this PDF and asked if I could help him parse the table it contained," says Willis. "It’s an old document with a table that has some complex layout elements. The new [Mistral] OCR-specific model really performed poorly, repeating the names of cities and botching a lot of the numbers."
AI app developer Alexander Doria also recently pointed out on X a flaw with Mistral OCR’s ability to understand handwriting, writing, "Unfortunately Mistral-OCR has still the usual VLM curse: with challenging manuscripts, it hallucinates completely."
The Leader in Document Processing: Google’s Gemini 2.0 Flash Pro Experimental
According to Willis, Google currently leads the field in AI models that can read documents: "Right now, for me the clear leader is Google’s Gemini 2.0 Flash Pro Experimental. It handled the PDF that Mistral did not with a tiny number of mistakes, and I’ve run multiple messy PDFs through it with success, including those with handwritten content."
The Key Advantages of Gemini 2.0 Flash Pro Experimental
Gemini’s performance stems largely from its ability to process expansive documents (in a type of short-term memory called a "context window"), which Willis specifically notes as a key advantage: "The size of its context window also helps, since I can upload large documents and work through them in parts." This capability, combined with more robust handling of handwritten content, apparently gives Google’s model a practical edge over competitors in real-world document-processing tasks for now.
The Drawbacks of LLM-based OCR
Despite their promise, LLMs introduce several new problems to document processing. Among them, they can introduce confabulations or hallucinations (plausible-sounding but incorrect information), accidentally follow instructions in the text (thinking they are part of a user prompt), or just generally misinterpret the data.
Conclusion
In conclusion, while AI-generated text has the potential to revolutionize the way we process documents, it’s essential to be aware of the limitations and drawbacks of LLM-based OCR models. As the technology continues to evolve, it’s crucial to prioritize accuracy and reliability in document processing.
FAQs
Q: What is LLM-based OCR?
A: LLM-based OCR refers to the use of Large Language Models (LLMs) in Optical Character Recognition (OCR) technology, which is used to process and extract text from digital documents.
Q: What are the limitations of LLM-based OCR?
A: LLM-based OCR models can introduce confabulations or hallucinations, accidentally follow instructions in the text, or generally misinterpret the data.
Q: Which OCR model is currently the best?
A: According to recent tests, Google’s Gemini 2.0 Flash Pro Experimental is currently the leader in AI models that can read documents, handling complex documents with handwritten content with a high degree of accuracy.

