Sanskrit Optical Character Recognition Research at University of Buffalo , NY

January 21, 2004

-

"Bhattathiri" <mpmahesh

Tuesday, January 20, 2004 7:15 PM

> UB researchers, versed in Sanskrit and computer science, develop tools to

> bridge the digital divide

>

> Software will boost Web access to Indian-language documents

>

> So, you think searching for things in English on the Internet is

> frustrating?

>

> Well, try searching for documents written in ancient Sanskrit, modern

Hindi

> and any of dozens of Indian and South Asian languages that are based on

the

> beautiful, intricate symbols of the Devanagari script.

>

> The ability to put this valuable content online from printed sources in

> Devanagari, requires optical character recognition (OCR), the tool

necessary

> to turn any text document into a digital one.

>

> "The lack of a good OCR for Devanagari has made it very difficult to make

> available on the Web the vast majority of Devanagari documents," said Venu

> Govindaraju, Ph.D., associate director of the University at Buffalo's

Center

> of Excellence in Document Analysis and Recognition (CEDAR) and UB

professor

> of computer science and engineering.

>

> Now, with funding from the National Science Foundation, Govindaraju and

his

> UB colleagues are taking a major step toward boosting online access to

these

> documents.

>

> The UB researchers happen to share not only expertise in machine-print and

> handwriting recognition, but also a rare passion for -- and fluency in --

> Sanskrit and other Indian languages.

>

> Their project, funded under a $487,000 grant from the National Science

> Foundation's International Digital Libraries initiative, endeavors to make

> Devanagari documents, ranging from ancient Sanskrit masterpieces, such as

> the Bhagavadgita and the Vedas, to contemporary documents in Hindu,

Marathi

> and other Indian languages, easily accessible on the Web.

>

> The researchers, based at CEDAR, have created a software tool that is the

> first step in developing OCR for Devanagari, ultimately allowing documents

> in these scripts to be widely searchable on the Web.

>

> It will be presented by Govindaraju, who is the principal investigator, on

> March 11 at the 13th International Workshop on Research Issues on Data

> Engineering in Hyderabad, India.

>

> The UB researchers expect to make it available for free on the Web by the

> end of March.

>

> "We are developing machine technologies to read Devanagari documents,

> whether they are contemporary documents written in Hindi or ancient

> documents that were handwritten on palm leaves," said Govindaraju.

>

> The project, which involves collaboration with the Indian Statistical

> Institute in Kolkata, one of India's premier research institutions, takes

an

> important step toward bridging the digital divide between the developed

> world and some developing nations, according to the UB researchers.

>

> "The half-billion people around the world whose main language is Hindi, or

> based on Devanagari, are totally missing out on the 'information

> revolution,'" said Govindaraju. "In IT, the native languages all have

taken

> a back seat."

>

> While Sanskrit has been considered a "dead" language, he noted that in his

> native India a movement to revive it both in written and spoken forms has

> been gaining ground and in certain regions, schools are including Sanskrit

> in their curricula.

>

> He and his UB colleagues on the project are among those in the U.S. who

have

> rediscovered the language; they teach Sanskrit to their own children and

> hold classes in it at the Hindu Cultural Society of Western New York.

>

> "The Indian civilization is 5,000 years old," said Govindaraju. "So there

> are many, many documents written in Devanagari script, but if we want to

> include them in a digital library in order to preserve access to them, we

> need to develop software that recognizes the script."

>

> OCR, the UB researchers explain, essentially "trains" the computer to

> correctly interpret the images of a particular alphabet based on "truthed"

> data, that is, numerous scanned images of characters or words and their

> interpretation recorded by humans who have visually examined the original

> images.

>

> About 15 years ago, UB's CEDAR, the largest research center in the world

> devoted to developing new technologies that can recognize and read

> handwriting, developed the first comprehensive OCR for handwritten

documents

> in English.

>

> That turned out to be a milestone, spurring numerous new research projects

> into handwriting recognition that led to some of the applications now

taken

> for granted, such as personal digital assistants.

>

> "Similarly, we are expecting that the development of benchmarked OCR for

> Devanagari will trigger a groundswell of research in machine-reading

> technologies for these Indian languages," said Govindaraju.

>

> To develop benchmarked OCRs, the UB researchers have constructed a dataset

> of 400 pages of Hindi and Sanskrit documents from books and periodicals,

> both ancient and contemporary, that is representative of the huge variety

of

> documents available in these languages.

>

> The researchers have used the tool they developed to record information

> about these documents that indicate how OCR for Devanagari should

interpret

> each word. The researchers also plan to develop character databases and

> on-line dictionaries, text corpora and other tools for linguistic analysis

> that will be invaluable to the OCR community.

>

> "The availability of our truthing and evaluation tool together with the

> availability of new truthed Devanagari data, will spur greater research in

> the development of Devanagari OCR," said Srirangaraj Setlur, Ph.D., senior

> research scientist at UB's CEDAR and co-investigator.

>

> Vemulapati Ramanaprasad, Ph.D., senior research scientist at UB's CEDAR,

> also is co-investigator.

>

> In the future, the UB researchers plan to extend the scope of this tool to

> include OCR evaluation for other Indian languages such as Kannada,

> Malayalam, Tamil and Telugu, that do not use the Devanagari script, as

well

> for as Arabic and Urdu.

>

> Contact: Ellen Goldbaum, goldbaum

> Phone: 716-645-5000 ext 1415

> Fax: 716-645-3765

>

> --------

--

> ----

>

> Home | Technology transfer opportunities | Trade Fairs, Conferences

&

> Events | Archives | Current Issue

> Venture Capital | I P & Patents | Links | Science & Technology Policy

> About Us Subscription Feedback Contact

>

Sign In

Sanskrit Optical Character Recognition Research at University of Buffalo , NY

Recommended Posts

Guest guest

Link to comment

Share on other sites

Join the conversation

Support the Ashram

Join Groups

Top Downloads