Guest guest Posted January 21, 2004 Report Share Posted January 21, 2004 - "Bhattathiri" <mpmahesh Tuesday, January 20, 2004 7:15 PM > UB researchers, versed in Sanskrit and computer science, develop tools to > bridge the digital divide > > Software will boost Web access to Indian-language documents > > So, you think searching for things in English on the Internet is > frustrating? > > Well, try searching for documents written in ancient Sanskrit, modern Hindi > and any of dozens of Indian and South Asian languages that are based on the > beautiful, intricate symbols of the Devanagari script. > > The ability to put this valuable content online from printed sources in > Devanagari, requires optical character recognition (OCR), the tool necessary > to turn any text document into a digital one. > > "The lack of a good OCR for Devanagari has made it very difficult to make > available on the Web the vast majority of Devanagari documents," said Venu > Govindaraju, Ph.D., associate director of the University at Buffalo's Center > of Excellence in Document Analysis and Recognition (CEDAR) and UB professor > of computer science and engineering. > > Now, with funding from the National Science Foundation, Govindaraju and his > UB colleagues are taking a major step toward boosting online access to these > documents. > > The UB researchers happen to share not only expertise in machine-print and > handwriting recognition, but also a rare passion for -- and fluency in -- > Sanskrit and other Indian languages. > > Their project, funded under a $487,000 grant from the National Science > Foundation's International Digital Libraries initiative, endeavors to make > Devanagari documents, ranging from ancient Sanskrit masterpieces, such as > the Bhagavadgita and the Vedas, to contemporary documents in Hindu, Marathi > and other Indian languages, easily accessible on the Web. > > The researchers, based at CEDAR, have created a software tool that is the > first step in developing OCR for Devanagari, ultimately allowing documents > in these scripts to be widely searchable on the Web. > > It will be presented by Govindaraju, who is the principal investigator, on > March 11 at the 13th International Workshop on Research Issues on Data > Engineering in Hyderabad, India. > > The UB researchers expect to make it available for free on the Web by the > end of March. > > "We are developing machine technologies to read Devanagari documents, > whether they are contemporary documents written in Hindi or ancient > documents that were handwritten on palm leaves," said Govindaraju. > > The project, which involves collaboration with the Indian Statistical > Institute in Kolkata, one of India's premier research institutions, takes an > important step toward bridging the digital divide between the developed > world and some developing nations, according to the UB researchers. > > "The half-billion people around the world whose main language is Hindi, or > based on Devanagari, are totally missing out on the 'information > revolution,'" said Govindaraju. "In IT, the native languages all have taken > a back seat." > > While Sanskrit has been considered a "dead" language, he noted that in his > native India a movement to revive it both in written and spoken forms has > been gaining ground and in certain regions, schools are including Sanskrit > in their curricula. > > He and his UB colleagues on the project are among those in the U.S. who have > rediscovered the language; they teach Sanskrit to their own children and > hold classes in it at the Hindu Cultural Society of Western New York. > > "The Indian civilization is 5,000 years old," said Govindaraju. "So there > are many, many documents written in Devanagari script, but if we want to > include them in a digital library in order to preserve access to them, we > need to develop software that recognizes the script." > > OCR, the UB researchers explain, essentially "trains" the computer to > correctly interpret the images of a particular alphabet based on "truthed" > data, that is, numerous scanned images of characters or words and their > interpretation recorded by humans who have visually examined the original > images. > > About 15 years ago, UB's CEDAR, the largest research center in the world > devoted to developing new technologies that can recognize and read > handwriting, developed the first comprehensive OCR for handwritten documents > in English. > > That turned out to be a milestone, spurring numerous new research projects > into handwriting recognition that led to some of the applications now taken > for granted, such as personal digital assistants. > > "Similarly, we are expecting that the development of benchmarked OCR for > Devanagari will trigger a groundswell of research in machine-reading > technologies for these Indian languages," said Govindaraju. > > To develop benchmarked OCRs, the UB researchers have constructed a dataset > of 400 pages of Hindi and Sanskrit documents from books and periodicals, > both ancient and contemporary, that is representative of the huge variety of > documents available in these languages. > > The researchers have used the tool they developed to record information > about these documents that indicate how OCR for Devanagari should interpret > each word. The researchers also plan to develop character databases and > on-line dictionaries, text corpora and other tools for linguistic analysis > that will be invaluable to the OCR community. > > "The availability of our truthing and evaluation tool together with the > availability of new truthed Devanagari data, will spur greater research in > the development of Devanagari OCR," said Srirangaraj Setlur, Ph.D., senior > research scientist at UB's CEDAR and co-investigator. > > Vemulapati Ramanaprasad, Ph.D., senior research scientist at UB's CEDAR, > also is co-investigator. > > In the future, the UB researchers plan to extend the scope of this tool to > include OCR evaluation for other Indian languages such as Kannada, > Malayalam, Tamil and Telugu, that do not use the Devanagari script, as well > for as Arabic and Urdu. > > Contact: Ellen Goldbaum, goldbaum > Phone: 716-645-5000 ext 1415 > Fax: 716-645-3765 > > > > -------- -- > ---- > > Home | Technology transfer opportunities | Trade Fairs, Conferences & > Events | Archives | Current Issue > Venture Capital | I P & Patents | Links | Science & Technology Policy > About Us Subscription Feedback Contact > > > > Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You are posting as a guest. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.