Friday, 3 April 2009

Improved layout

If you've tried VelOCRaptor, you'll have found that it really didn't do a great job with lining its text over the right bit of the image. This is because I wrote each line where it should be, but I really don't know what the font size is, so that if we have it wrong, the characters get progressively out of sync.

I've improved this quite a bit by printing word by word rather than line by line. It makes the selection look a little wonky at times, but should improve your ability to select text, especially in multi-column layouts.

2 comments:

  1. The simplicity of the just dropping the graphic file is great. However the accuracy definitely could use some improvement. In testing, I continue to get better accuracy using Acrobat's built in OCR engine then with velOCRaptor. This is especially true when performing OCR on serif fonts. I will continue to look in from time to time on this project's progress. Good Luck.

    ReplyDelete
  2. Thanks for the feedback. We're trusting that the accuracy of OCRopus, the underlying OCR engine, will improve with time and training.

    Duncan

    ReplyDelete