 Obscured text systems are a widely used anti-spam tool. |
Crumbling texts and books are being digitised thanks to anti-spam tools. To thwart spammers many websites force visitors to transcribe obscured words or characters before they get access. Now instead of random words many sites are taking text from old books and documents that have been scanned by character reading software. The words supplied are those the software cannot read but humans can, helping to complete the conversion of old texts to digital form. Site seeing The obscured text systems are called Captchas (Completely Automated Public Turing test to tell Computers and Humans Apart) and are widely used by websites to stop scammers and spammers exploiting them to send out junk mail or harvest addresses. It is estimated that Captcha schemes are used about 100 million times every day. Created by Luis von Ahn at Carnegie Mellon University in Pittsburgh, the Recaptcha project scoops up words that optical character reading software has marked as unreadable by computers. In some documents, where ink has faded and paper has yellowed, the character reading software can flag up to 20% of words as indecipherable. The hard-to-read words are then farmed out to the many thousands of sites that have signed up to be Recaptcha partners. Words are supplied to sites along with a control word that aims to ensure the person answering is human. The responses to the obscured text are added to a database and particularly mangled text will be put before several people to ensure it is read accurately. Reporting in the journal Science the Recaptcha team says the scheme is about 99.1% accurate - as good as professional transcribers and beyond the limit demanded by archivists. About 40,000 sites have signed up to use words supplied by Recaptcha and it now collects about four million responses every day. In the last year it has helped resolve more than 440 million words and has just helped to complete the conversion of the entire archive of the New York Times from 1908 into digital form.
|
Bookmark with:
What are these?