Subject: Re: OCR
From: "Ben Tilly" <btilly@gmail.com>
Date: Fri, 9 Jun 2006 18:32:31 -0700

On 6/9/06, Anderson, Kelly <KAnderson@dentrix.com> wrote:
> I don't want to get too involved with the Patent Troll thread... but
> there were some comments made about OCR that I couldn't leave alone...
> :-)
>
> Character recognition by computers was the topic of my master's thesis,
> so I have just a teeny bit of experience and LOTS of interest in the
> area, even if it's slightly outdated. I haven't looked closely at jocr,
> but all character recognition is done in the following basic steps:

Ah, so I've got to be careful what I say. :-)

> 1) Image acquisition. Garbage in, garbage out. A good image is paramount
> and probably the single most important aspect of OCR, although largely
> underappreciated.
> 2) Segmentation -- finding the characters to be recognized, including
> word segmentation, and possibly segmenting out "images", "columns" and
> other artifacts of that nature.
> 3) Feature computation on individual characters
> 4) Classification of characters -- Comparison of the feature sets of
> individual characters with known "training" sets of data.
> 5) Reassembling words - either raw, based on segmentation, or better
> based upon contextual information such as the language being scanned,
> etc.
> 6) (optional) Reassembling the document - that is, keeping the column,
> image, font information, etc.
>
> Wavelets can be used in Segmentation and Feature Extraction. They may be
> very good for segmentation, I haven't really studied that particular
> application in great depth. However, for Feature Extraction, I've found
> that it doesn't really matter what the feature set is, you can only
> recognize hand printed characters to about 85% accuracy and mechanically
> generated characters to around 98%-99.5% depending on how clean the
> image is and a few other things. It doesn't matter what the recognition
> features are. Without context, you cannot do better than this.

I saw figures like that 10 years ago.  I'm surprised that they're
still the same.

In another email I commented that wavelets matter more for voice
recognition than OCR.  I said that because I suspect that they are
better there at segmentation and feature extraction (particularly
feature extraction), and it is harder for me to imagine other
algorithms that would be good at turning rapidly varying air pressure
into phoenemes.

> With context, machines actually have the potential to recognize writing
> at a HIGHER rate than humans. They have greater knowledge, for example,
> of what names are known and used at a particular time and place in
> history. People are just looking at the image perhaps without that
> context.

That is interesting.

It tangentially reminds me of an irony from research on using wavelets
to detect breast cancer.

It turns out that wavelets are about as good as humans in detecting
breast cancer.  However people and wavelets detect different things -
wavelets don't miss breast cancer in "easy" areas of the breast while
humans are better at correctly detecting it in "difficult" areas of
the breast (near the edges where there is muscle).  When you combine
the two, you get far better results than either.  BUT humans discount
the mistakes of theirs that it catchs, and see the ones that it makes
as being bad.  So after a little while humans stop cooperating with
the software. :-(

> My conclusion is that open source OCR has probably not suffered too
> greatly from the fact that wavelets are patent protected. Most of the
> valuable discriminatory abilities derived from the use of wavelets can
> be derived just as easily from other transforms. I only waffle a bit
> because I don't know exactly how valuable wavelet technology may be to
> Segmentation.

What are you doing ruining a good argument with facts. :-)

Cheers,
Ben