C#: OCR (Optical Character Recognition)
The past few weeks we've been looking for a suitable OCR solution to
integrate into
our document management system.
One option we came across involves MODI (Microsoft Office Document Imaging) -
a tool available within Microsoft Office 2003 - 2007 (not available in Microsoft Office 2010).
Simply include the MODI Type library (COM Interop) and convert image(s) to text like this:
using MODI;
using System;
class Program
{
static void Main(string[] args)
{
DocumentClass doc = new DocumentClass();
doc.Create(@"some.tiff");
doc.OCR(MiLANGUAGES.miLANG_ENGLISH, true, true);
foreach (Image image in doc.Images)
{
Console.WriteLine(image.Layout.Text);
}
}
}
Its quite a powerful OCR engine, but the engine behind MODI isn't microsoft based - it is licensed under
ScanSoft inc - currently
Nuance.
There is one part I do find a bit dodgy though, we found quite a few rather expensive OCR tools out there (from $600),
that integrates with MODI - which obviously requires Microsoft Office.
I almost feel that those application belong in the freeware realm - since you already bought a license to the
core OCR functionality (via MS Office) and most of the non-OCR (part you will be paying for) seems rather mediocre.
My personal opinion though... ;)
Posted by - Christoff Truter
Date - 2010-05-03 18:24:05
Comments
Post comment