C#: OCR (Optical Character Recognition)



The past few weeks we've been looking for a suitable OCR solution to integrate into our document management system.

One option we came across involves MODI (Microsoft Office Document Imaging) - a tool available within Microsoft Office 2003 - 2007 (not available in Microsoft Office 2010).

Simply include the MODI Type library (COM Interop) and convert image(s) to text like this:
 
using MODI;
using System;
 
class Program
{
    static void Main(string[] args)
    {
        DocumentClass doc = new DocumentClass();
        doc.Create(@"some.tiff");
        doc.OCR(MiLANGUAGES.miLANG_ENGLISH, true, true);
 
        foreach (Image image in doc.Images)
        {
            Console.WriteLine(image.Layout.Text);
        }
    }
}

Its quite a powerful OCR engine, but the engine behind MODI isn't microsoft based - it is licensed under ScanSoft inc - currently Nuance.

There is one part I do find a bit dodgy though, we found quite a few rather expensive OCR tools out there (from $600), that integrates with MODI - which obviously requires Microsoft Office.

I almost feel that those application belong in the freeware realm - since you already bought a license to the core OCR functionality (via MS Office) and most of the non-OCR (part you will be paying for) seems rather mediocre.

My personal opinion though... ;)







Comments



Isnt this exactly what the example code does?


can anyone tell me that how i can do this .. how should i start the code in C# for image to text conversion.. ???


can anyone tell me that how i can do this .. how shoould i start the code in C3 for image to text conversion.. ???


I have a similar sample here: http://zamirsblog.blogspot.com/2010/12/ocr-using-ms-office.html


OCR for French Script MT convert successfully

hi frends i m converting french script mt scanned tiff images successfully with 99% acrsy. for more detail contact: dirtymind635@yahoo.com or: 08290729527


Professional Services

Companies like EMC (Captiva or InputAccel) can convert paper into digital images. I know one company - alicka inc, has professional services team to configure a high-volume scanning/ocr/data capture system at rates much cheaper than EMC's PSG. Their direct link: http://www.alicka.com/professionalservices.html <a href="http://www.alicka.com/professionalservices.html" target="_blank">Captiva Development</a>


OCR API

Well coding for all fonts and languages is not easy.I think using OCR Cloud 2.0 platform is  a good idea.It can convert virtually any image (TIF, JPG, PNG, BMP) or PDF to any standard text-based document type (TXT, DOC, RTF, XLS, PPT, XML, HTML) or searchable PDF.It also has auto-language detection and support for over 200 languages including: Latin based languages Cyrillic based languages Chinese, Japanese, Korean, Thai, and Hebrew. For free developer account signup here-http://www.ocr-it.com/ocr-cloud-2-0-api


OCR for French Script MT

this is not useful if the scanned image text is French Script MT font.If anyone has solution, please reply as soon as possible.


http://www.codeproject.com/KB/office/modi.aspx


Post comment

Name *
Email
Title
Body *
Security Code
*
* Required fields

Related Posts

Latest Posts

Be the best stalker you can be


2011-12-13 22:33:54

Syntactic sugar (C#): Enum


2011-08-04 16:50:18

Top 5 posts

Simple WYSIWYG Editor


Creating a WYSIWYG textbox for your website is actually quite simple.
2007-02-01 12:00:00

Moving items between listboxes in ASP.net/PHP example


Move items between two listboxes in ASP.net(C#, VB.NET) and PHP
2008-06-12 17:07:43

Cross Browser Issues: Firefox Word Wrapping


Firefox word wrapping issues
2008-06-09 09:51:21

Populate a TreeView Control C#


Populate a TreeView control in a windows application.
2009-08-27 16:01:03

C# YouTube : Google API


Post on how to integrate with YouTube using the Google Data API
2011-03-12 08:37:51