Tuesday, 24 March 2009

PPC and Improved Accuracy

It's been almost a week since my last confession, but we've been hard at work. Simon has added a nice little preview to the mini-mode interface, and compiled Tesseract and OCRopus for PPC, so we now have a Universal Binary!

Meanwhile I've been post-processing the text by replacing mis-spelt words. This uses the Mac spellchecker, so it should pick up your custom words. I'm not entirely convinced that it improves overall performance that much, but it does make the output text a lot more plausible.

Wednesday, 18 March 2009

PDF writing with Quartz

Up to now I've been writing PDFs using XSL:FO and Apache FOP. This was the path of least resistance, but did mean shipping 11Mb of FOP, and shelling out to Java to do the work.

It's been painful, but I've now replaced that code with native Mac Quartz code to write the PDF. So we should write PDFs a lot quicker (still dwarfed by the OCR time mind), and our download is now on 3.3Mb zipped.

Saturday, 14 March 2009

Reading from PDF files

You asked for it, you got it. We now read images from PDF files as well as JPEG, PNG, and TIFF. We are currently limited to rendering the first page and reading that, but I think that should cover the vital 90%.

As a bonus it has led to the removal of ImageMagick in favour of SIPS, which can read PDF all by itself, and is built into Leopard. So we've just lost 30Mb!

Friday, 13 March 2009

205 Downloads

According to my server logs we've had 205 downloads of VelOCRaptor (that weren't me checking its OK).

Come on people - you've had a play, where's the feedback?

PDF reading

Due to the magic of SIPS I should be able to remove ImageMagick and support PDF reading in one fell swoop. I think I'll wait until I'm less tired before I commit though.

Thursday, 12 March 2009

PDF reading support

I've had other things to do today, but from the response so far it's clear that we need to add reading from PDF pretty quickly.

In the meantime Simon has at least set the app to reject dropped pdfs, so we won't be popping up nasty error dialogs.

Wednesday, 11 March 2009

MacInTouch

The good folks at MacInTouch gave us a mention - leading to 137 visits so far today.

Welcome MacInTouchers, be sure to let us know what you think.

Google activity

A full week after letting it know we exist, the Google machine has swung into action and found us. So I'm trying to cope with a whole few people trying VelOCRaptor.

Thanks to those who have downloaded the app and tried it out. Don't forget to vote for your itch to be scratched - at the moment I can see that we really do need to support reading PDFs, so I'm going to work on that next.

EDIT - Ah, looking at the logs, it's clear that MacInTouch and not Google are driving the traffic.

New release

I'm just rsyncing a new release. This should run about twice as fast as the last, by dint of not OCRing twice ;-)

Accuracy Results - SA-tax.jpg - revised

Embarrassingly I built (and released) a version that invoked ocroscript twice, throwing away the first results.

So while the accuracy results are unchanged - the times should be quicker.


$ src/script/velocraptor.rb testdata/SA-tax.jpg out.txt NORMALIZE_PROCESSOR; src/test/spell.rb out.txt
I, [2009-03-11T17:55:29.694300 #15844] INFO -- : Converting testdata/SA-tax.jpg to out.txt
I, [2009-03-11T17:55:49.983578 #15844] INFO -- : Times: CPU 19.93 Elapsed 20.2892169952393
57 unknown from 504 words = 11.3095238095238%


$ src/script/velocraptor.rb testdata/SA-tax.jpg out.txt CONVERT_PROCESSOR; src/test/spell.rb out.txt
I, [2009-03-11T17:58:00.720594 #15854] INFO -- : Converting testdata/SA-tax.jpg to out.txt
I, [2009-03-11T17:58:19.683518 #15854] INFO -- : Times: CPU 18.79 Elapsed 18.9628579616547
54 unknown from 506 words = 10.6719367588933%

Accuracy Results - SA-tax.jpg

Plain image
I, [2009-03-11T12:49:55.422169 #12001] INFO -- : Converting testdata/SA-tax.jpg to plain.txt
I, [2009-03-11T12:50:30.199773 #12001] INFO -- : Times: CPU 32.14 Elapsed 34.7775390148163
45 unknown from 500 words = 9.0%


Normalized
I, [2009-03-11T13:11:25.775800 #12092] INFO -- : Converting testdata/SA-tax.jpg to normalized.txt
I, [2009-03-11T13:12:02.039675 #12092] INFO -- : Times: CPU 35.58 Elapsed 36.2637679576874
57 unknown from 504 words = 11.3095238095238%

Recognition Accuracy

I've been working up a way of judging the accuracy of OCR. My simplistic approach is to assume that if a word is spelled correctly, it is correct. So make a set of each unique word, remove those which are actually words, and report the ratio of mis-spells to total words.

I'll report our results here soon.

Better Recognition

I've spent this afternoon working out how to distribute ImageMagick with VelOCRaptor so that we can pre-process the images to improve accuracy.

The latest version now uses histogram normalization to improve the image contrast prior to scanning. I'm now looking into the best way of measuring accuracy.

Tuesday, 10 March 2009

New Screencast

I've just uploaded a screencast showing the new mini-mode GUI. I'm trying YouTube this time, as it will convert my mov capture on the fly. The movie quality is worse though.

Monday, 9 March 2009

Cocoa GUI

I'm just in the process of uploading our Cocoa GUI for the first time. Up to now the download has been of my AppleScript droplet, but Simon has done some fantastic work this weekend so that we now have a genuine Mac front end.

If you drop a file onto the VelOCRaptor icon it behaves as it used to - writing a PDF in the current directory and then exiting, although now with a progress spinner and cancel button. We're calling this mini-mode.

If you open the app normally it offers a large drop target for your images (or File/Open). Drop one there and it is converted - once it's done you can select and drag the text out of it, or Save the PDF.

We've lots of polishing to do, but we now have the 2 basic workflows:
  • drop, convert, exit
  • open, convert, save
Also planned are AppleScript and Automator support.

Sunday, 8 March 2009

Samples online

I've posted my killer jpeg - a multi-column colour government monster form, with our output, on our samples page.

Saturday, 7 March 2009

Adwords results

As a little experiment I signed up with adwords and placed a listing, just in the UK. Google rejected my first ad for trademark infringement, so the revised ad ran for about 9 hours, from around midday. The search terms I used were free mac ocr pdf.

In that time the ad was shown 3,445 times, and got 3 clicks, all related to the search term 'free'. Looking at analytics it seems that these clicks were actually searching for 'free games' - obviously VelOCRaptor is such an attractive title that the respondants ignored the words in the ad.

What surprised me is that Google doesn't seem prefer to run the ad when more than one of the terms is matched. In fact, in all the times I've searched for 'free mac ocr pdf' it's never been shown to me. Given this, 'free' and 'mac' are rubbish keywords, as they trigger (infrequently, as they must be popular) when someone is looking for 'free holiday' or 'mac games'.

So I've changed my strategy and am looking at phrases. I now use 'image to text' 'mac ocr' and 'ocr to pdf'. I figure that these will be shown less frequently, but with far better targetting.

Friday, 6 March 2009

Mac Mac Mac Mac Mac (TM)

Our Google adwords advert was rejected because it has the word "Mac".

This leaves a dilemma - I don't want users costing click-through cash only to find we don't support their platform, but I can't say Mac, and I only have 2 x 35 char lines.

Finally plumped for spending precious characters on "computers rhyming with Nac" Thanks for your help Apple.

iWeb woes

Spent the day replacing the huge mess that was iWeb's published site with hand-crafted xhtml/css.
I was already having to process iWeb's output with Ruby to add UserVoice and Google analytics scripts, and I somehow broke iWeb's page navigation bar when I added a link to Blogger.

So after 3 hours on the bike at lunchtime its been a happy few hours working out how to centre pages and highlight the current page in CSS.

Please let me know if it doesn't work in your browser.