BBC HomeExplore the BBC
This page has been archived and is no longer updated. Find out more about page archiving.
Access 2.0 Banner>

Making CAPTCHAs less evil?

  • By Paul Crichton
  • 29 May 07, 03:54 PM

The Internet Archive is a not-for-profit organisation that is working to create a digital library by scanning 12,000 books a month. Carnegie Mellon University, who work with them on this project, has developed a way to improve both the speed and accuracy of this process by using CAPTCHAs and the brainpower of the millions of users who come across them every day.

Scanning a book to convert it into electronic format uses a process called Optical Character Recognition (OCR) to turn it from a picture of a page into text. This procedure is not yet 100% accurate however. Some words or characters might not be recognised properly, and a human must determine what the word actually is. When this happens, The Internet Archive sends problem words to Carnegie Mellon University to be deciphered.

The University realised that looking at an image of a word and working out what it says for converting books into a digital format is the same principle as that used in a CAPTCHA.

A CAPTCHA is typically an image of a word used to determine whether a user is human, or an automated spam programme. You find them all over the internet, from registration forms, to comment forms on a blog. Example: "please type the word in the above graphic into this box". It is estimated that humans solve 60 million CAPTCHAs like this every day.

Carnegie Mellon University has created reCAPTCHA to provide the same protection from spam as a traditional CAPTCHA, but instead of just using a random word for users to type, they use a word not recognised in the scanning books for The Internet Archive. If webmasters and bloggers adopt reCAPTCHA in significant numbers, then millions of words every day could be accurately deciphered.

That said, there is still a fundamental issue with reCAPTCHA and accessibility. Too many of the people who stand to benefit from better quality digital books in the long run will still be excluded from their short term goals of registering with a website, or commenting on a blog.

The pictures of words remain invisible and therefore inaccessible to screen reader users. Whilst an audio CAPTCHA is provided, it isn't easy to use because of background noise. Neither version is accessible to deafblind people either. So there is some way still to go before CAPTCHAs are fully redeemed.

But at least if all CAPTCHAs were like this, visually impaired people might not be quite so hacked off the next time they come across one and can't logon to a blog because of it.

Those of you with knowledge of the US bookshare.org project might suddenly have lightbulbs flashing above their heads. Could it be possible that libraries of scanned texts for visually impaired people, created by volunteers, could have their work augmented by such a project? Visually impaired people regularly have to read less than perfect scanned books so perhaps a similar system could be installed by bookshare and the long-awaited UK equivalent as championed by the Right to Read collaboration?

Blind students everywhere could finally be referring to scholars such as 'charles darwin' in their essays rather than the less well known 'char1e5 Darmh' for instance.

Comments Post your comment

We never use the captcha method, due to accessibility. But we have definitely been tempted. If only there wasn't any SPAM to worry about!

I can sympathise - with net-guide.co.uk, I get hundreds of spam emails a day. Either no spam or a fully accessible captcha would be a godsend!

Just do not use Captcha and use a different anti spam idea like type word, or add sum, tick box etc

Post a comment

Please note Name and E-mail are required.

Comments are moderated, and will not appear on this weblog until the author has approved them.

Required
Required(not displayed)

The BBC is not responsible for the content of external internet sites



About the BBC | Help | Terms of Use | Privacy & Cookies Policy