Jump to content
IndiaDivine.org

Scanning Sanskrit: Why Everything Old is New Again

Rate this topic


Guest guest

Recommended Posts

Sanskrit, in which classical Indian literature was composed, is

among the world's oldest recorded languages. But putting works

created over the past 3,000 years on to the web has not been easy.

 

Documents written in Devanagari, the script used for Sanskrit and

other South Asian languages, can be scanned as images. But optical

character recognition (OCR) software for turning Devanagari texts

into digital information that can be searched and reformatted has

not been commercially available.

 

That has not been for lack of effort. Because Devanagari is also

used for widely spoken contemporary languages such as Hindi, several

research teams based in India are working on OCR technology to

capture it.

 

But Venugodal Govindaraju, of the Centre of Excellence in Document

Analysis and Recognition at the State University of New York, said a

lack of collaboration had limited their efforts.

 

"They report their research in journals and at conferences but they

don't make the data sets they develop available to other

institutions," he said.

 

In an effort to accelerate the development of OCR software for

Devanagari (meaning "city of immortals"), CEDAR and the Indian

Statistical Institute are distributing a script-recognition tool

that they hope will become the international standard for Devanagari.

 

Their script-recognition software, which can be downloaded free at

www.cedar.buffalo.edu/ILT, can separate lines and individual

characters written in the flowing script.

 

It then offers an on-screen transliteration in Roman characters for

proofreading.

 

Govindaraju said that in the early 1990s CEDAR gave away similar

tools it had created for the United States Postal Service to analyse

handwriting. "That spurred work in the Roman alphabet on handwriting

recognition," he said. "There has been tremendous progress since

then."

 

Those earlier tools allowed CEDAR to develop the first successful

Roman alphabet handwriting recognition system capable of operating

on a mass scale. Govindaraju said the tools became the standard for

comparing results among all handwriting recognition groups.

 

Work by those groups also gave rise to handwriting recognition

programs such as Graffiti 2, for users of some newer hand-held

computers, and Microsoft Windows Journal, for users of the Windows

XP tablet PCs.

 

Like the Roman alphabet and some Japanese and Chinese characters,

Devanagari may eventually be embraced by makers of commercial

recognition software.

 

Working from a common base for Devanagari would make it easier for

researchers to exchange data from their efforts. Most of the work

being done focuses on refining the recognition software to cope with

variations, a task that is likely to be more complex with Devanagari

script than the Roman alphabet.

 

With OCR technology, Sanskrit documents could be transformed into

digital text that could be viewed on computer monitors using

existing Devanagari screen fonts.

 

"Sanskrit is a dying language," Govindaraju's said, "but I love

Sanskrit."

 

Source: The Age, Melbourne

URL: http://www.theage.com.au/articles/2003/05/28/1053801437724.html?

oneclick=true

Link to comment
Share on other sites

Join the conversation

You are posting as a guest. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...