Handwriting & Speech Recognition – Unlocking the potential of digital archives
For more than over two decades, National Libraries, Universities, and other organizations world over have actively been digitizing their collections. The content includes rare books, journals, newspapers and a good deal of unorganized handwritten content ranging from letters, personal journals, travelogues etc.
These digitization initiatives have enabled users and researchers to access content that was otherwise inaccessible. In most cases, the scope of the digitization is restricted to simple scanning and dirty (uncorrected) OCR (Optical Character Recognition). The lack of accurate metadata and proper classification results in reduced discoverability of the content, thus diminishing the contents’ value.
In the recent past, many National Libraries and digital content providers have expanded the scope of digitization to include accurate metadata (selected), article classification and content tagging. This improvement has indeed improved the usability of the digitized content.
Let us take the case of the British Library (BL) (https://www.bl.uk/catalogues-and-collections/digital-collections), which is probably the gold standard for digitized content.
Here, one can search the British Newspaper Archive (though behind a paywall for access from outside the BL), Business & Management sections quite easily. The metadata has been captured to a high level of accuracy. In the case of Newspapers, the articles have been segmented, categorized to very high accuracy.
When one chooses to search the Manuscripts section, they will find that the searchability is restricted to keywords. And in case of sounds, it is limited to a set of names, dates and probably some important events.
The increasing difficulty to search the collection for specific content is quite evident. The difficulty is caused by the lack of production quality tools to “read” and create the content in digital form, unlike the OCR for printed material.
Luckily, the emerging technologies for handwriting recognition and speech recognition have been pushing the boundaries. Today production-ready applications for handwritten text recognition (for English and languages using the Roman script) extract text to a reasonable level of accuracy (65%-70%). While handwritten text recognition technology has been around for a while, their usage was mostly restricted to recognition of numerals, personal names or geographical names. With the advanced applications now available, the text from old (even dating back to 1800s) journals, letters can be extracted – thus making the documents more discoverable.
Similar improvements have been seen in the speech recognition area as well. The progress in speech recognition will make millions of hours of Audio and Video recordings, with broadcasting agencies, News agencies and other archives instantly usable. Combined with timestamping, the researches can actually locate the precise instant when their “Keyword” is spoken. This will eliminate hours of patient listening by the researchers, only to eliminate some content.
The advent of digital archive was hugely beneficial to the researches – in increasing the ease of access, in blurring the geographical boundaries and so on. The advanced OCR capabilities, combined with accurate metadata tagging and classification enhanced this experience. Researchers could now not only access the targeted books or content but also identify (and perhaps take a deeper read, if it is relevant) other content that match the criteria.
The rapid expansion in Speech Recognition & Handwritten Text Recognition software is sure to take this experience to the next level. This also opens the door for National Libraries to accelerate their digitization programs.
People without the knowledge of their history, origin, and culture are like trees without roots, goes an old saying. The latest Handwritten Text Recognition and Speech Recognition technology will help us discover our roots better.
Authored by Natraj Kumar, VP & Head – ITES & BPO, HTC Global Services