R.M.K. Sinha
Department of Computer Science and Engineering
Indian Institute of Technology, Kanpur
RESEARCH & PROJECTS
R.M.K. Sinha works primarily in the area of Applied Artificial Intelligence. He applies AI techniques to document processing, text recognition, computer vision, speech processing, natural language processing and in design of knowledge based systems. Intercommunicating layers of knowledge and their integration is key to his design approach. R.M.K. Sinha also applies artificial neural networks and fuzzy computing techniques in pattern recognition. In natural language processing, one of the primary aims is to design machine aids for translation from English to Indian languages & vice-versa and among Indian languages. R.M.K. Sinha's approach is based on a new concept of using Pseudo-Interlingua, word expert model utilizing Karak theory, pattern directed rule base and hybrid example base. His investigations also include exploring design and development of special parallel architectures for computer vision and natural language processing.
R.M.K. Sinha has been working on R & D for Indian Language Technology for the last three decades and his research has touched and provided direction to almost all facets of providing technological solution to the problem of overcoming the language barrier in the country. The multi-lingual GIST technology and several other packages for Indian language processing have been developed under his supervision.
Some of the major projects that have been initiated and executed / currently being executed under his supervision are the following:
In order to be able to process data in Indian languages, the first task was to develop mechanism for inputting the Indian script and then represent it internally for editing, storage, retrieval and further processing.
Romanization of Indian script was not a desirable approach as this may not be error-free. Indian scripts are phonetic in nature (i.e. you write the same way as you speak) and is a logical composition of constituent symbols in two dimensions.
Our scheme for keyboarding and internal representation exploited the phonetic nature of our scripts. We proposed the phonetic order of the symbols to be sequence of symbols for inputting rather than the visual order as followed in the mechanical typewriters.
In fact the phonetic order is the way in which children learn writing the script. Since this phonetic ordering was applicable to all Indian scripts, our keyboarding scheme became universally applicable to all Indian scripts.
Indian scripts typically have 12-15 vowels, 35-40 consonants and a few diacritical marks. Besides this for each vowel, there is a corresponding modifier symbol and for each consonant, there is a corresponding pure consonant form (called half-letter).
This makes the total set of symbols to be larger than what a normal keyboard could accommodate.
Our keyboarding layout and the keyboarding scheme utilized the phonetic groupings and derivational property to limit the number of physical keys and achieve a logical layout based on frequency analysis of the symbols of the script.
This became possible as the task of inputting was kept distinct from the task of rendering for display.
This gave birth to the INSCRIPT keyboard and keyboarding scheme.
Similarly, for the internal representation, an
8-bit Indian Standard Script Code for Information Interchange (ISSCII-8)
was designed utilizing phonetic properties of the script.
ISSCII-8 is an extension of ASCII which has been designed during early 1980's and it caters to the entire set of Indian scripts in an uniform way.
Editing operation became very much like that for English.
Also, transliteration among Indian scripts became simply switching the script rendering device as the ISSSCII code for all Indian scripts texts remained the same.
ISSCII-8 has undergone further modifications and a modified version has been accepted by Bureau of Indian Standards as ISCII (8-bit Indian Script Code for Information Interchange) code in 1991. ISCII forms the basis for UNICODE code assignments for Indian scripts.
The paper on "Computer processing of Indian languages and scripts - Potentialities and Problems", Jour. of Inst. Electron. & Telecom. Engrs., vol.30,no.6, 1984, carries a detailed discussion on various aspects of coding.
Some Relevant Publications
- R. Mahesh K. Sinha, A Journey from Indian Scripts Processing to Indian Language Processing, IEEE Annals of the History of Computing, Jan-March 2009. pp. 2-25.
- R.M.K. Sinha, ‘Standardizing Linguistic Information - An Overview’ Proceedings of Second Regional Workshop on Computer Processing of Asian Languages, Tata McGraw-Hill, New Delhi, 1992, pp 272-290.
- R.M.K. Sinha, ‘Non-Latin Information Systems: Some Basic Issues’, in Information Processing, 1986, H. Kugler (Ed.), Elsevier Science Publishers, 1986. Conference Proceedings.
- R.M.K. Sinha,‘Computer processing of Indian languages and scripts - potentialities and problems’, Jour. of Inst. Electron. & Telecom. Engrs., vol.30,no.6, 1984,pp. 133-49.
- R.M.K. Sinha and A. Raman, ‘A modular Indian language data terminal’, Computer Graphics, Vol.14,1980,pp. 39-72.
- M.P.Sastri, A.Raman and R.M.K.Sinha, ‘An universal I/O device for Indian scripts’, Annual convention of Computer Society of India, 1978, pp 151-165.
- R.M.K.Sinha,‘Machine oriented Devanagari script (MODS) from information theoretic viewpoint’, Symposium on Linguistic Implication of Computer Based Information Systems, Delhi,Nov.10-12,1978.
- P.V.H.M.L.Narasimham and R.M.K.Sinha, ‘Phonetically coded keyboarding in Indian languages’, Symposium on Linguistic Implications of Computer Based Information Systems, Delhi, Nov.10-12,1978.
- R.M.K.Sinha and H.V.Sahasrabuddhe,‘Hyphenation in Indian scripts for computer aided printing’, Symposium on Linguistic Implications of Computer Based Information Systems, Delhi,Nov.10-12,1978.
- A.K.Pathak, A.Raman and R.M.K.Sinha,‘A modular Indian language I/O terminal’, Symposium on Linguistic Implication of Computer Based Information Systems,Delhi,Nov.10-12.1978.
- R.M.K.Sinha,H.V.Sahasrabuddhe and V.K. Vaishnavi, ‘Mechanization of Indian scripts’,Symposium on Linguistic implications of Computer Based Information Systems, Delhi, Nov.10-12,1978.
- R.M.K. Sinha, ‘Teaching script on a digital computer ‘, Jour. of Inst. Telecom. Engrs., Nov. 1976, pp 720-22.
- R.M.K. Sinha and H.N. Mahabala, ‘MODS - machine oriented Devanagari script’ Jour. of Inst. Telecom. Engrs., vol.19.no.3,1973,pp 623-28.
- R.M.K.Sinha(Chief Investigator), ‘Integrated Devanagari Computer’, Project Report, Dept. of Elect.Engg.,I.I.T.,Kanpur 1984.
- R.M.K. Sinha. ‘Character Code Standardization’, a report prepared for UNESCO, Paris, 1992.
TOP
In 1983, Department of Electronics, Govt. of India, sponsored a project
on design and development of 'Integrated Devanagari Computer(IDC)' terminal.
In this project we implemented our basic strategies for phonetic keyboarding scheme for Devanagari inputting, used our ISCII code for internal representation of the script, and a script composition module for rendering the script on the display and other output devices.
The IDC terminal was designed in a record time of about 8 months and was demonstrated at the Third World Hindi Convention at Delhi.
It was developed using Intel 8086 processor with multitasking firmware.
The IDC project was further extended to implement the same technology using the 32-bit 68000 microprocessor and the outcome was named as GIST (Graphics and Indian Script Terminal) technology.
Since ISCII was designed to cater to all Indian scripts exploiting commonality and phonetic nature of the scripts, the GIST technology could easily cater to all Indian scripts by merely incorporating specific script composition rendering module.
A number of companies bought this technology for manufacturing multilingual computer terminals. This GIST technology was adapted by the Centre for Development of Advanced Computing (C-DAC ) when the research engineer working on the project at IIT Kanpur (Mohan Tambe) joined C-DAC and took the technology with him without a formal transfer of technology.
The GIST technology represented a major breakthrough in solving our complex problem of man-machine linguistic interface for Indian languages. This technology incorporated several desirable features. A natural phonetically oriented keyboarding scheme directly converting to internal representation codes (ISSCII-8), a human engineered keyboard layout, a display which dynamically changes as the input progresses, built-in intelligence to disallow illegal compositions such as attaching two vowel modifiers on the same character, automatic transliteration from one Indian script to another, are some of the key attractive features making it user friendly.
More details of IDC/GIST technology can be seen in the Project Report on Integrated Devanagari Computer, Dept. of Elect.Engg.,I.I.T.,Kanpur 1984.
Some Relevant Publications
- R. Mahesh K. Sinha, A Journey from Indian Scripts Processing to Indian Language Processing, IEEE Annals of the History of Computing, Jan-March 2009. pp. 2-25.
- R.M.K. Sinha, ‘Standardizing Linguistic Information - An Overview’ Proceedings of Second Regional Workshop on Computer Processing of Asian Languages, Tata McGraw-Hill, New Delhi, 1992, pp 272-290.
- R.M.K. Sinha, ‘Non-Latin Information Systems: Some Basic Issues’, in Information Processing, 1986, H. Kugler (Ed.), Elsevier Science Publishers, 1986. Conference Proceedings.
- R.M.K. Sinha,‘Computer processing of Indian languages and scripts - potentialities and problems’, Jour. of Inst. Electron. & Telecom. Engrs., vol.30,no.6, 1984,pp. 133-49.
- R.M.K. Sinha and A. Raman, ‘A modular Indian language data terminal’, Computer Graphics, Vol.14,1980,pp. 39-72.
- M.P.Sastri, A.Raman and R.M.K.Sinha, ‘An universal I/O device for Indian scripts’, Annual convention of Computer Society of India, 1978, pp 151-165.
- R.M.K.Sinha,‘Machine oriented Devanagari script (MODS) from information theoretic viewpoint’, Symposium on Linguistic Implication of Computer Based Information Systems, Delhi,Nov.10-12,1978.
- P.V.H.M.L.Narasimham and R.M.K.Sinha, ‘Phonetically coded keyboarding in Indian languages’, Symposium on Linguistic Implications of Computer Based Information Systems, Delhi, Nov.10-12,1978.
- A.Raman,P.V.H.M.L.Narasimham and R.M.K.Sinha, ‘System modules for business machines, computer terminals and printing in Indian languages’,Symposium on Linguistic Implications of Computer based information Systems, Delhi Nov.10-12,1978.
- A.K.Pathak, A.Raman and R.M.K.Sinha,‘A modular Indian language I/O terminal’, Symposium on Linguistic Implication of Computer Based Information Systems,Delhi,Nov.10-12.1978.
- K.P.Laturkar and R.M.K.Sinha, ‘Devanagari script composition from phonetically coded symbol strings’, Symposium on Linguistic Implications of Computer Based Information Systems, Delhi,Nov.10-12,1978.
- M.P.Sastri, A.Raman and R.M.K.Sinha, ‘An Indian language script generator for CRT terminals and matrix printers’, Symposium on Linguistic implications of Computer Based Information Systems, Delhi, Nov.10-12,1978.
- R.M.K.Sinha,H.V.Sahasrabuddhe and V.K. Vaishnavi, ‘Mechanization of Indian scripts’,Symposium on Linguistic implications of Computer Based Information Systems, Delhi, Nov.10-12,1978.
- R.M.K. Sinha, ‘Teaching script on a digital computer ‘, Jour. of Inst. Telecom. Engrs., Nov. 1976, pp 720-22.
- R.M.K. Sinha and H.N. Mahabala, ‘MODS - machine oriented Devanagari script’ Jour. of Inst. Telecom. Engrs., vol.19.no.3,1973,pp 623-28.
- R.M.K.Sinha(Chief Investigator), ‘Integrated Devanagari Computer’, Project Report, Dept. of Elect.Engg.,I.I.T.,Kanpur 1984.
- R.M.K. Sinha. ‘Character Code Standardization’, a report prepared for UNESCO, Paris, 1992.
TOP
Transliteration among Indian scripts is easily achieved using ISCII (Indian Script Code for Information Interchange).
ISCII has been desinged using the phonetic property of Indian scripts and caters to the superset of all Indian scripts.
By attaching an appropriate script rendering mechanism to ISCII, transliteration from one Indian script to another is achieved in a natural way.
However, transliteration from Indian script requires use of heuristics to convert the non-phonetic script to its probable intended spoken form before it could be transliterated. Similarly, transliteration from an Indian script to Roman requires using a standardized mapping table to easily readable.
In our work on transliteration, we have suggested heuristics and tables.
Several other workers have come up with their own suggestions.
Recently, TDIL has come up with a standardization of this table called INSROT which uses only lower case letters to facilitate standard search.
Some Relevant Publications
- R. Mahesh K. Sinha, A Journey from Indian Scripts Processing to Indian Language Processing, IEEE Annals of the History of Computing, Jan-March 2009. pp. 2-25.`
- R.M.K. Sinha,‘Computer processing of Indian languages and scripts - potentialities and problems’, Jour. of Inst. Electron. & Telecom. Engrs., vol.30,no.6, 1984,pp. 133-49.
- R.M.K. Sinha and B. Srinivasan,‘Machine transliteration from Roman to Devanagari and Devanagari to Roman’, Jour. of Inst. Electron. & Telecom. Engrs., vol.30, no.6, 1984, pp 243-45.
TOP
For Indian scripts, there is a very loose concept of a spelling. Writing in Indian scripts is a direct mapping of the inherent phonetics and you write as you speak.
There are geographical variations in the spoken form and so the spellings vary.
Our approach to design of a spell checker is to develop an user error model for each class of user where the source of error may the due to incorrect phonetics, inaccurate inputting or other influences.
The spell-checker uses this error-model in making suggestions for the error.
Some Relevant Publications
- R. Mahesh K. Sinha, A Journey from Indian Scripts Processing to Indian Language Processing, IEEE Annals of the History of Computing, Jan-March 2009. pp. 2-25.
- R.M.K. Sinha,‘Computer processing of Indian languages and scripts - potentialities and problems’, Jour. of Inst. Electron. & Telecom. Engrs., vol.30,no.6, 1984,pp. 133-49.
- R.M.K. Sinha and K.S. Singh, ‘A program for correction of single errors in Hindi words’ , Jour. of Inst. Electron & Telecom. Engrs., vol.30, no.6,1984, pp 249-51.
TOP
The work on Devanagari OCR started in early seventies.
Devanagari script is a logical composition of symbols in two dimensions as opposed to mere juxtaposition of symbols in Roman.
A methodology for segmentation of words into composite characters and decomposition into constituent symbols were developed.
A pattern description language called PLANG was developed and used in syntactic recognition of Devanagari symbols.
A script composition grammar and confusion matrix obtained through training were used to recompose the script from the recognized symbols.
This was part of a Ph.D. thesis work in 1973.
Subsequently, use of higher level knowledge layers interacting with each other, in the form of word level dictionary, language model and confusion matrices obtained through training, primarily formed the basis for disambiguation, word hypothesis generation and verifcation, and for tackling the problem of character fusions and fragmentation.
This technique was used both for English OCR and for Devanagari.
Further work on Devanagari OCR was carried out with TDIL, Govt. of India, sponsored project named, DEVDRISHTI,
on Recognition of Handprinted Devanagari script.
The investigations were carried on in developing new features and in integrating decision making taking into account large variations in shape.
Further, an automated strategy for training for construction of prototypes and confusion matrices,
from true ISCII files was developed. This had to be very much distinct from their Roman counterpart due to script composition being involved in case of Devanagari script.
This work was further expanded incorporating blackboard model for knowledge integration in Ph.D. thesis of Veena Bansal titled
"Integrating Knowledge Sources in Devanagari Text Recognition"
Some work has also been carried out on On-line character recognition for Roman using handwriting modeling.
Investigations on on-line isolated Devaganagi characters have also been carried out and further investigations are in progress on the subject.
Some Relevant Publications
- R. Mahesh K. Sinha, A Journey from Indian Scripts Processing to Indian Language Processing, IEEE Annals of the History of Computing, Jan-March 2009. pp. 2-25.
- Gaurav Gupta, Shobhit Niranjan, Ankit Shrivastava and R.M.K. Sinha, Document layout analysis and classification and its application in OCR, IEEE International Workshop on the Electronic Document Management in an Enterprise Computing Environment (IEEE EDM 2006), October 2006, Hong Kong.
- Veena Bansal and R.M.K. Sinha, Partitioning and Searching Dictionary for Correction of Optically Read Devanagari Character Strings, Int. Jour. On Document Analysis and Recognition, Vol. 4, 2002 pp 269-280 (Presented at 5th International Conference on Document Analysis and Recognition 1999, Bangalore, India.)
- Veena Bansal and R.M.K. Sinha, Segmentation of touching and fused Devanagari characters, Pattern Recognition, Vol. 35, 2002, pp 875-893.
- Veena Bansal and R. M. K. Sinha, A Complete OCR for Printed Hindi Text in Devanagari Script, Sixth International Conference on Document Analysis and Recognition, IEEE publication, Seattle, USA, 2001.
- Veena Bansal and RMK Sinha, A Devanagari OCR and A Brief Overview of OCR Research for Indian Scripts, Proc. Symposium on Translation Support Systems (STRANS2001), February 15-17, 2001, Kanpur, India.
- Veena Bansal and R.M.K. Sinha, Integrating Knowledge Sources in Devanagari Text Recognition System, IEEE Transaction on Systems, man and Cybernetics , Vol. 30, 4, 2000.
- Scott D. Connell, R.M.K. Sinha and Anil K. Jain, Recognition of Unconstrained On-Line Devanagari Characters, International Conference on Pattern Recognition, (ICPR2000), Sept 3-8, 2000, Barcelona, Spain.
- Veena Bansal and R.M.K. Sinha, On how to Describe Shapes of Devanagari Characters and Use Them for Recognition, 5th International Conference on Document Analysis and Recognition(ICDAR ‘99), 1999, Bangalore, India.
- Veena Bansal and R.M.K. Sinha, Segmentation of Touching characters in Devanagari, Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP’98), pp. 371 - 376, December 21-23, New Delhi, 1998.
- Veena Bansal and R.M.K. Sinha, ‘On Integrating Diverse Knowledge Sources in Optical Reading of Devanagari Script’ International Conference on Information Systems Analysis and Synthesis (ISAS’96), Orlando, 1996.
- Veena Bansal and R.M.K. Sinha, ‘Designing a Front End OCR System for Machine Translation - A Case Study for Devanagari’, Symposium for Machine Aids for Translation and Communication (SMATAC96), New Delhi 1996.
- R.M.K. Sinha and V. Bansal, ‘On Devanagari Document Processing’, 1995 IEEE International Conference on Systems, Man and Cybernetics, Vancouver, Canada, 1995, pp 1621-1626.
- R.M.K. Sinha, B. Prasada, G. Houle and M. Sabourin, ‘Hybrid Contextual Text Recognition with String Matching’, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 15, No. 9, September 1993, pp. 915-925.
- Bruno Simard, Birendra Prasada and R.M.K. Sinha, ‘On-line character recognition using hand-writing modeling’, Pattern Recognition, Vol. 26, No. 7, 1993, pp. 993-1007.
- R.M.K. Sinha, ‘On using syntactic constraints in text recognition’ Proc. Second International Conference on Document Analysis and Recognition, Tsukuba Science City, Japan, 1993, pp 858-861.
- R.M.K. Sinha, ‘On partitioning dictionary for visual text recognition’, Pattern Recognition, Vol. 23, No.5, 1990, pp 497-500.
- Subodh Harmalkar and R.M.K. Sinha, ‘Integrating word level knowledge in text recognition’, 10th International Conference on Pattern Recognition, Atlantic City, NJ, June 17-21, 1990.
- R.M.K. Sinha, and H.C. Karnick, ‘PLANG based specification of patterns with variations for pictorial data bases’, Computer Vision, Graphics, and Image Processing, Vol 43, 1988, pp. 98-11.
- R.M.K. Sinha and Birendra Prasada, ‘Visual text recognition through contextual processing’, Pattern Recognition, Vol. 21, No.5, 1988, pp 463-479.
- R.M.K. Sinha, ‘Some characteristic curves for dictionary organization with digital search’ IEEE Trans. on Systems, man and Cybernetics, 1987, Vol SMC-17, No.3,1987,pp 520-527.
- R.M.K. Sinha, ‘ Rule based contextual post-processing for Devanagari text recognition’ Pattern Recognition, 1987, Vol 20,No.5, 1987, pp.475-485.
- R.M.K. Sinha, ‘Role of Context in Devanagari Script Recognition’, Jour. of Inst. Electron & Telecom. Engrs., Vol 33, No.3,1987, pp 86-91.
- R.M.K. Sinha, ‘A width-independent algorithm for character skeleton estimation’, Computer Vision, Graphics, and image Processing, 1987, vol 40,1987,pp 388-397
- R.M.K. Sinha, Comments on ‘Fast thinning algorithm for binary images’, Image and Vision Computing, vol.4, 1986, pp 57-58.
- R.M.K. Sinha, ‘PLANG - a picture language schema for a class of pictures’, Pattern Recognition, vol. 16, 1983, pp 373-383.
- R.M.K. Sinha, ‘A knowledge based script reader’, Seventh International Conference on Pattern Recognition Montreal. 1984, pp 763-765.
- R.M.K. Sinha, ‘Primitive recognition and skeletonization via labeling’, International Conference on Systems, Man and Cybernetics, Halifax, Canada, 1984, pp 272-279.
- R.M.K. Sinha, ‘A parallel architecture for recognition of pictorial patterns’ IEEE International Conf. on Computers, Systems and Signal processing, Bangalore, 1984, pp 1523-1526.
- H.Karnick and R.M.K.Sinha,‘A representational framework for recognizing patterns with variations’, IEEE International Conf. on Computers, Systems and Signal processing, Bangalore, 1984,pp 19-22.
- S.S.Marwah, S.K.Mullick and R.M.K.Sinha, ‘Recognition of Devanagari characters using hierarchical binary decision tree classifier’, International Conference on Systems, man and Cybernetics, Halifax, Canada,1984, pp 414-420.
- R.M.K. Sinha, ‘Methodology for computer recognition of Devanagari script’, IEEE-SMC International conference, Delhi-Bombay, Dec.30,1983 - Jan.7,1984, pp 1220-1224.
- R.M.K. Sinha and H.N. Mahabala, ‘Machine recognition of Devanagari script’, IEEE Trans. on Systems, Man and Cybernetics, 1979, pp 435-441.
- R.M.K. Sinha and H.N. Mahabala, ‘Towards design of a natural picture description language’, IEEE Conf. on Pattern Recognition and Image Processing, Chicago, Ill., USA, 1978, pp 416-420.
- R.M.K.Sinha, ‘Primitive recognition via labeling schemata’, Annual convention of Computer Society of India,1975.
- R.M.K.Sinha and H.N.Mahabala,‘On design of a syntactic pattern analysis system’, Annual convention of Computer Society of India,1975.
- R.M.K. Sinha et.al., ‘DEVADRISHTI: A Devanagari text reader - version I’, Technical Report TRCS-93-181, Department of Computer Science and Engineering, IIT Kanpur, 1993.
Ph.D.Thesis supervised :
- Harish C.Karnick,‘On Learning Recognizing Patterns with Natural Variations’.
- Veena Bansal, ‘Role of Knowledge in Document Recognition- A case study for Devanagari Script’.
TOP
-
Vision Course Projects
Titles of some of the projects supervised:
- Object Recognition using shape descriptors
- Object Recognition using neural network
- Robot Motion planning in static unknown environment
- Human Face Recognition
- Age-Invariant Face Indentification
- Cartoon Character Identification
- Hand Guesture Recognition
- Skew correction
- Handwritten Form-fill
TOP
-
Biometrics Course Projects
Titles of some of the projects supervised:
- Signature Verification
- Fingerprint Verification
- Iris Recognition
- Age-Invariant Face Indentification
- Cursor/keystroke dynamics Biometrics
TOP
Our work on machine translation started in early eighties when we proposed using Sanskrit as interlingua for translation to and from Indian languages (See
the paper on "Computer processing of Indian languages and scripts - Potentialities and Problems", Jour. of Inst. Electron. & Telecom. Engrs., vol.30,no.6, 1984).
This was further elaborated in CPAL-1 paper presented at Bangkok in 1989.
Later in 1991, the concept of a Pseudo-Interlingua was developed which exploited structural commonality of a group of languages. This concept has been used in development of machine-aided translation methodology named ANGLABHARTI for translation from English to Indian languages.
Anglabharti is a pattern directed rule based system with context free grammar like structure for English (source language).
It generates a `pseudo-target' (Pseudo-Interlingua) applicable to a group of Indian languages (target languages) such as Indo-Aryan family (Hindi, Bangla, Asamiya, Punjabi, Marathi, Oriya, Gujrati etc.), Dravidian family (Tamil, Telugu, Kannada & Malayalam) and others.
A set of rules obtained through corpus analysis is used to identify plausible constituents with respect to which movement rules for the `pseudo-target' is constructed.
Within each group the languages exhibit a high degree of structural homogeneity.
We exploit the similarity to a great extent in our system.
A language specific text-generator converts the 'pseudo-target' code into target language text.
Paninian framework based on Sanskrit grammar using Karak (similar to case) relationship provides an uniform way of designing the Indian language text generators.
We also use an example-base to identify noun and verb phrasals and resolve their semantics.
An attempt is made to resolve most of the ambiguities using ontology, syntactic & semantic tags and some pragmatic rules. The unresolved ambiguities are left for human post-editing.
Some of the major design considerations in design of Anglabharti have been aimed at providing a practical aid for translation wherein an attempt is made to get 90% of the task done by the machine and 10% left to the human post-editing; a system which could grow incrementally to handle more complex situations; an uniform mechanism by which translation from English to majority of Indian languages with attachment of appropriate text generator modules; and human engineered man-machine interface to facilitate both its usage and augmentation.
The translation system has also been interfaced with text-to-speech module and OCR input.
This project also received funding from TDIL programme of Govt. of India during 1995-97 and 2000 onwards.
The English to Hindi version named AnglaHindi, of Anglabharti machine aided translation system has been web-enabled and is available at
http://anglahindi.iitk.ac.in (withdrawn temporarily)
The technical know-how of this technology has been transferred on a non-exclusive basis to ER&DCI/CDAC Noida for commercialization.
The AnglaBharti technology has also been transferred to eight different organizations under AnglaBharti Mission for development of MAT systems for English to different Indian languages catering to 12 regional languages of the country. Under this mission, IIT Mumbai will be working on Marathi & Konkani and will be developing AnglaMarathi & AnglaKonkani;
IIT Gwahauti will be working on Asamiya & Manipuri and will be developing AnglaAsamiya & AnglaManipuri; CDAC Kolkata will be working on Bangla and will be developing AnglaBangala; CDAC(GIST group) Pune will be working on Urdu, Sindhi & Kashmiri and will develop AnglaUrdu, AnglaSindhi & AnglaKashmiri; CDAC Thiruananthpuram will be working on Malyalam and will be developing AnglaMalayalam; TIET Patiala will be working on Punjabi and will be developing AnglaPunjabi; JNU New Delhi will be working on Sanskrit and will be developing AnglaSanskrit; and Utkal University Bhuvaneshwar will be working on Oriya and will be developing AnglaOriya.
In 1995, we developed another approach for MT which was example-based. Here the pre-stored examples form the basis for translation. The translation is obtained by matching the input sentence with the minimum 'distance' example sentence.
In our approach, we do not store the examples in the raw form. The examples are abstracted to contain the category/class information to a great extent.
This makes the example-base smaller in size and further partitioning reduces the search space.
The creation and growth of the example-base is also done in an interactive way.
This methodology, named ANUBHARTI, has been used for Hindi to English translation.
The Anubharti approach works more efficiently for similar languages such as among Indian languages. In such cases the word-order remains the same and one need not have pointers to establish correspondences.
Both of these system architectures, AnglaBharti and AnuBharti, have undergone a considerable change from their initial conceptualization. In 2004, phase-II of system development has been launched which addresses many of the shortcomings of the earlier architectures. These are named AnglaBharti-II and AnuBharti-II. While AnglaBharti-II is primarily a rule-based system and AnuBharti-II uses EBMT as the basic paradigm for translation, both of these systems are hybridized with varying degree of hybridization of different paradigms.
AnglaBharti-II uses a generalized example-base (GEB) for hybridization besides a raw example-base (REB). During the development phase, when it is found that the modification in the rule-base is difficult and may result in unpredictable results, the example-base is grown interactively by augmenting it. At the time of actual usage, the system first attempts a match in REB and GEB before invoking the rule-base.
In AnglaBharti-II, we have made provision for automated pre-editing & paraphrasing, generalized & conditional multi-word expressions, recognition of named-entities, domain customization tools and incorporated an error-analysis module and statistical language-model for automated post-editing. The purpose of automatic pre-editing module is to transform/paraphrase the input sentence to a form which is more easily translatable. Automated pre-editing may even fragment an input sentence if the fragments are easily translatable and positioned in the final translation Such fragmentation may be triggered by in case of a failure of translation by the 'failure analysis' module. The failure analysis consists of heuristics on speculating what might have gone wrong. The entire system is pipelined with various sub-modules. All these have contributed significantly to greater accuracy and robustness to the system.
AnuBharti system is designed to translate Hindi to English and other languages. It started with attempt to design MAT for translation from Hindi to English using EBMT paradigm. The strategy has now been generalized in AnuBharti-II to cater to Hindi as source language for translation to any other language, though the generalization of the example-base is dependent upon the target language. The core of AnuBharti-II architecture is a generalized hierarchical example-base. In absence of availability of large parallel corpora, the example-base is augmented interactively during the development phase. Development of such an example-base for Hindi to other Indian languages is lot easier as compared to a dissimilar language. Hindi like all other Indian languages is a relatively free word-group order language. The example-base size grows enormously if all variations are incorporated into it.
The input Hindi sentence is converted into a standardized form to take care of word-order variations. This requires a shallow grammatical analysis of Hindi. This makes the paradigm used in AnuBharti-II a hybrid paradigm. The standardized Hindi sentence is matched with a top level standardized example-base. In case no match is found then a shallow chunker is used to fragment the input sentence into units that are then matched with a hierarchical example-base. The translated chunks are positioned by matching with sentence level example base. Analysis of Hindi sentence is rule and heuristic based and is primarily used for deriving standardized form and for shallow chunking.
The boundary friction problem of chunk translation composition is handled by the sentence level example-base with chunks. Here the chunk properties are used for distance computation and the chunk translation among the alternatives yielding minimum distance is picked up. An error-analysis module and statistical language model on lines similar to AnglaBharti-II are also being incorporated. Human post-editing is performed primarily to introduce determiners that are either not present or difficult to estimate in Hindi.
Besides these, we are also currently engaged in development of translation system for bi-lingual text in Hinglish (Hindi mixed with English) and system for speech to speech translation.
Some Relevant Publications
- R. Mahesh K. Sinha, A Journey from Indian Scripts Processing to Indian Language Processing, IEEE Annals of the History of Computing, Jan-March 2009. pp. 2-25.
- R. Mahesh K. Sinha and Anil Thakur, A Study of the Translation Divergence in
English and Hindi MT, CSI Journal, to appear.
- Pawan Goyal and R. Mahesh K. Sinha, Divergence, Third Int'l Sanskrit Computational Linguistics Symposium, Jan , 2009, Hyderabad.
- Pawan Goyal and R. Mahesh K. Sinha, A Study towards Design of an English to Sanskrit Machine Translation System, Second Int'l Sanskrit Computational Linguistics Symposium, May15-17, 2008,
Brown University, Providence.
- R. M. K. Sinha, V. N. Shukla and S. S. Agrawal, A Framework for Integrating ASR into a Machine Translation System, Workshop on Technologies and Corpora for Asia-Pacific Speech Translation
, 3rd IJCNLP, Hyderabad, Jan 11, 2008.
- R. Mahesh K. Sinha, A hybridized EBMT system for Hindi to English Translation, CSI Journal, volume 37 no. 4, 2007, pp.3-9.
- R.M.K. Sinha and Anil Thakur, Disambiguation of 'kyaa' in Hindi for Hindi to English machine translation, Indian Linguistics journal, Vol. 68 (1-2), 2007, pp. 59-70. {First presented at S
ixth International Conference of South Asian Languages (ICOSAL-6), Hyderabad, INDIA, 6-8 January 2005}.
- R. Mahesh K. Sinha: Using rich morphology in resolving certain Hindi-English machine translation divergence. MT Summit XI, 10-14 September 2007, Copenhagen, Denmark. Proceedings; pp.429-4
33
- Mrityunjay Gautam and R.M.K. Sinha, A Hybrid Approach to Sentence Alignment Using Genetic Algorithm, Proceedings of International Conference on Computing: Theory and Applications (ICCTA 2
007), IEEE Computer Society Press, March 2007, pp 480-484.
- R.M.K. Sinha, Designing Multi-lingual Machine-Translation System: Some Perspectives, International Workshop on Intelligent Linguistic Technologies (ILINTEC'07), Proceedings of Internatio
nal Conference on Machine Learning: Models, Technologies & Applications (MLMTA 2007) , June 25-28, 2007 LasVegas, pp. 244-249.
- R.M.K. Sinha, On Design Of A Question-Answering Interface For Hindi In A Restricted Domain, ICAI'06 - The 2006 International Conference on Artificial Intelligence, Las Vegas, Nevada, USA
, June 26-29, 2006.
- R.M.K. Sinha and A. Thakur, On Translation Of Interrogative Sentences From Hindi To English, MLMTA'06-The 2006 International Conference on Machine Learning; Models, Technologies and Appli
cations, Las Vegas, Nevada, USA, June 26-29, 2006.
- Gaurav Gupta, Shobhit Niranjan, Ankit Shrivastava and R.M.K. Sinha, Document layout analysis and classification and its application in OCR, IEEE International Workshop on the Electronic
Document Management in an Enterprise Computing Environment (IEEE EDM 2006), October 2006, Hong Kong.
- Mrityunjay Gautam and R.M.K. Sinha, A Hybrid Approach to Sentence Alignment Using Genetic Algorithm, Proceedings of International Conference on Computing: Theory and Applications (ICCTA 2
007), IEEE Computer Society Press, March 2007.
- R.M.K. Sinha and Anil Thakur, Machine Translation of Bi-lingual Hindi-English (Hinglish) Text, 10th Machine Translation summit (MT Summit X), Phuket, Thailand, September 13-15, 2005.
- R.M.K. Sinha and Anil Thakur, Dealing with Replicative Words in Hindi for Machine Translation to English, 10th Machine Translation summit (MT Summit X), Phuket, Thailand, September 13-15, 2005.
- R.M.K. Sinha and Anil Thakur, Divergence Patterns in Machine Translation between Hindi and English, 10th Machine Translation summit (MT Summit X), Phuket, Thailand, September 13-15, 2005.
- R.M.K. Sinha and Anil Thakur, Handling ki in Hindi for Hindi-English MT, 10th Machine Translation summit (MT Summit X), Phuket, Thailand, September 13-15, 2005.
- R.M.K. Sinha, Interpreting Unknown Lexicons in Machine Translation from Hindi to English, 4th IASTED International Conference on Computational Intelligence(CI 2005), Calgary, Alberta, Canada, July 4-6, 2005.
- R.M.K. Sinha, Dealing with Mixing of English Verbs in Hindi for Machine Translation, ICAI'05 - The 2005 International Conference on Artificial Intelligence, Las Vegas, Nevada, USA, June 27-30, 2005.
- R.M.K. Sinha, Integrating CAT and MT in AnglaBharti-II Architecture ,
EAMT 2005, 30th-31th May, 2005, Budapest, Hungary.
- R.M.K. Sinha and Anil Thakur, Translation Divergence in English-Hindi
MT, EAMT 2005, 30th-31th May, 2005, Budapest, Hungary.
- R.M.K. Sinha and Anil Thakur Disambiguation of kyaa in Hindi for Hindi to English machine translation, Sixth International Conference of South Asian Languages (ICOSAL-6), Hyderabad, INDIA, 6-8 January 2005.
- R.M.K. Sinha, An Engineering Perspective of Machine Translation: AnglaBharti-II and AnuBharti-II Architectures, Invited Paper, Proceedings of International Symposium on Machine Translation, NLP and Translation Support System (iSTRANS- 2004), November 17-19, 2004, Tata Mc Graw Hill, New Delhi.
- Anil Thakur and R.M.K. Sinha, Rules for Determining the Head of the Relative Clause Constructions in Hindi for Machine Translation from Hindi to English, 26th All India Conference of Linguists (26THAICL)Nov29-Dec1 2004, Shillong
- R.M.K. Sinha and Anil Thakur, Pre-/post-positions Selection in Text Generation for Hindi and other Indian Languages for Translation from English, Proceedings of International Symposium on Machine Translation, NLP and Translation Support System (iSTRANS- 2004), November 17-19, 2004, Tata Mc Graw Hill, New Delhi, pp: 40-45.
- R.M.K. Sinha and Anil Thakur, Synthesizing Verb Form in English to Hindi Translation: Case of Mapping Infinitive and Gerund in English to Hindi, Proceedings of International Symposium on Machine Translation, NLP and Translation Support System (iSTRANS- 2004), November 17-19, 2004, Tata Mc Graw Hill, New Delhi, pp: 52-55.
- R.M.K. Sinha and Anil Thakur, Disambiguation and Mapping Strategies for Adverbial Chunks for Machine Translation, Proceedings of International Symposium on Machine Translation, NLP and Translation Support System (iSTRANS- 2004), November 17-19, 2004, Tata Mc Graw Hill, New Delhi, pp: 95-101.
- R.M.K. Sinha and Anil Thakur, Multi-word Expressions in English and Hindi: Problems in Contextualization, Proceedings of International Symposium on Machine Translation, NLP and Translation Support System (iSTRANS- 2004), November 17-19, 2004, Tata Mc Graw Hill, New Delhi, pp: 111-116.
- R.M.K. Sinha and Anil Thakur, Identification of Subject and Object NPs in Hindi, Proceedings of International Symposium on Machine Translation, NLP and Translation Support System (iSTRANS- 2004), November 17-19, 2004, Tata Mc Graw Hill, New Delhi, pp: 166-171.
- R.M.K. Sinha and Anil Thakur, Definiteness Marking Strategies in Hindi and their Mapping to English for Machine Translation, Proceedings of International Symposium on Machine Translation, NLP and Translation Support System (iSTRANS- 2004), November 17-19, 2004, Tata Mc Graw Hill, New Delhi, pp: 178-181.
- R.M.K. Sinha and Anil Thakur, Syntax and Semantics of 'kaa' in Hindi, Proceedings of International Symposium on Machine Translation, NLP and Translation Support System (iSTRANS- 2004), November 17-19, 2004, Tata Mc Graw Hill, New Delhi, pp: 226-229.
- R.M.K. Sinha, Translating News Headings from English to Hindi, 6th IASTED International Conference on Artificial Intelligence and Soft Computing (ASC2002), Banff, Canada, July 17-19, 2002.
- R.M.K. Sinha, Towards Speech to Speech Translation, Key-note presentation at Symposium on Translation Support Systems (STRANS2002), March 15-17, 2002, Kanpur, India.
- Vartika Bhandari, R.M.K. Sinha and Ajai Jain, Disambiguation of Phrasal Verb Occurrence for Machine Translation, Proc. Symposium on Translation Support Systems (STRANS2002), March 15-17, 2002, Kanpur, India.
- Ajai Jain, R.M.K. Sinha and Renu Jain, On Translating Unconstrained Text, Proc. Symposium on Translation Support Systems (STRANS2002), March 15-17, 2002, Kanpur, India.
- R.M.K. Sinha, Multilinguality and Global Digital Divide, Joint IAMCR/ICA International Symposium on the Digital Divide, November 16-17, 2001, Austin, USA.
- R.M.K. Sinha, Dealing with Unknown Lexicons in Machine Translation from English to Hindi, Proc. of IASTED International Conference on Artificial Intelligence and Soft Computing, May 21-24, 2001, Cancun, Mexico, pp 333-336.
- R.M.K. Sinha, Renu Jain and Ajai Jain, Translation from English to Indian Languages: ANGLABHARTI Approach, Proc. Symposium on Translation Support Systems (STRANS2001), February 15-17, 2001, Kanpur, India.
- Renu Jain, R.M.K. Sinha and Ajai Jain, ANUBHARTI: Using Hybrid Example-Based Approach for Machine Translation Proc. Symposium on Translation Support Systems (STRANS2001), February 15-17, 2001, Kanpur, India.
- R.M.K. Sinha, Hybridizing Rule-Based and Example-Based Approaches in Machine Aided Translation System, 2000 International Conference on Artificial Intelligence (IC-AI’2000) June 26-29, 2000, Las Vegas, USA.
- R.Jain, R.M.K.Sinha, A.Jain, Translation between English and Indian Languages, Journal of Computer Science and Informatics, March 1997, pp 19 -25.
- R.M.K. Sinha, Machine Translation, Key-Note Plenary Address at the International Multi-Conference on Systematics, Cybernetics and Informatics (SCI’97) at Caracas, Venezuela, July 7-12, 1997.
- Renu Jain and R.M.K. Sinha, ‘Machine Translation using Examples for Similar and Dissimilar Languages’, International Conference on Information Systems Analysis and Synthesis (ISAS’96), Orlando, 1996.
- R.M.K. Sinha, ‘R & D on Machine Aided Translation at IIT Kanpur: ANGLABHARTI and ANUBHARTI Approaches’, Invited paper at Convention of Computer Society of India, (CSI’96), Bangalore, 1996.
- R.M.K. Sinha, ‘Strategies for Machine Translation for Application in Research, Education and Science Popularization’ Invited paper at INSA National Expository Workshop on Information and Communication Technology (INSA NEW-ICTE’96), New Delhi, 1996.
- R.M.K. Sinha and Ajai Jain, ‘Relevance and Strategies of Machine Translation in Global Environment and an Integrated Approach to MT in Indian Context’ Theme paper at Symposium for Machine Aids for Translation and Communication (SMATAC96), New Delhi 1996.
- Renu Jain, R.M.K. Sinha and others, ‘Some Experiences in Development of ANGLABHARTI and ANUBHARTI Systems’, Symposium for Machine Aids for Translation and Communication (SMATAC96), New Delhi 1996.
- Renu Jain and R.M.K. Sinha, ‘On Multi-lingual Dictionary Design’, Symposium for Machine Aids for Translation and Communication (SMATAC96), New Delhi 1996.
- R.M.K. Sinha and others, ‘ANGLABHARTI: A Multi-lingual Machine Aided Translation Project on Translation from English to Hindi’, 1995 IEEE International Conference on Systems, Man and Cybernetics, Vancouver, Canada, 1995, pp 1609-1614.
- Renu Jain, R.M.K. Sinha and A. Jain, ‘Role of Examples in Machine Translation’ 1995 IEEE International Conference on Systems, Man and Cybernetics, Vancouver, Canada, 1995, pp 1615-1620.
- R.M.K. Sinha, R. Srivastava and A. Agrawal, ‘Designing Hindi Text Generator for Machine Translation’ SNLP’95 - Symposium on Natural Language Processing, Bangkok, Thailand, 1995, pp 286-296.
- Renu Jain, R.M.K. Sinha, A. Jain and R. Srivastava, ‘HFSM: A Finite State Machine for Analyzing Hindi Sentences’ SNLP’95 - Symposium on Natural Language Processing, Bangkok, Thailand, 1995, pp 317-324.
- Renu Jain, R.M.K. Sinha and A. Jain, ‘A Pattern Directed Hybrid Approach to Machine Translation through Examples’ SNLP’95 - Symposium on Natural Language Processing, Bangkok, Thailand, 1995, pp 325-335.
- R.M.K. Sinha, ‘Machine Translation: The Indian Context’, Invited paper at the International Conference on Applications of Information Technology in South Asian Languages, AKSHARA’94, New Delhi 1994, pp 275-284.
- R.M.K. Sinha, ‘Correcting ill-formed Hindi sentences in machine translated output’ Proceedings of Natural Language Processing Pacific Rim Symposium (NLPRS’93), Fukuoka, Japan, 1993, pp 109-119.
- R.M.K. Sinha, ‘A Sanskrit based Word-expert model for machine translation among Indian languages’, Proc. of workshop on Computer Processing of Asian Languages’, Asian Institute of Technology, Bangkok, Thailand, Sept.26-28, 1989, pp. 82-91.
- R.M.K.Sinha, ‘CALP: Some Perspectives’ Symposium on Computer Aided language processing, New Delhi 1987.
- R.M.K. Sinha,‘Computer processing of Indian languages and scripts - potentialities and problems’, Jour. of Inst. Electron. & Telecom. Engrs., vol.30,no.6, 1984,pp. 133-49.
- A.K. Bansal and R.M.K. Sinha, ‘Some aspects of pronoun disambiguation using real world knowledge’, Comp. Soc. of India 1984.
- R.M.K. Sinha and G.C. Pathak, ‘ A heuristic based question answering system in natural Hindi’, IEEE-SMC International conference, Delhi-Bombay, Dec.30,1983-Jan.7,1984, pp 1009-13.
- R.M.K. Sinha. ‘Computers for Indian languages’, Annual convention of Computer Society of India (invited paper), 1982, pp 163-174.
- R.M.K. Sinha, ‘Computer processing of Indian languages’, Fourth International Conference on Computer in Humanities’, Hanover, NH (USA), Aug. 19-22, 1979.
- R.M.K.Sinha,, ‘Some thoughts on computer processing of natural Hindi’, Annual convention of Computer Society of India, 1978,pp 151-165.
- R.M.K. Sinha, K. Sivaraman, Aditi Agrawal, T. Suresh and C. Sanyal, ‘On logical design of multi-lingual lexicon for machine translation’, Technical Report TRCS-93-174, Department of Computer Science and Engineering, IIT Kanpur, 1993.
- R.M.K. Sinha and K. Sivaraman, ‘Ambiguity resolution in ANGLA-BHARTI’, Technical Report TRCS-93-175, Department of Computer Science and Engineering, IIT Kanpur, 1993.
- K. Sivaraman and R.M.K. Sinha, ‘On Tamil text generator’, Technical Report TRCS-93-176, Department of Computer Science and Engineering, IIT Kanpur, 1993.
- R.M.K. Sinha, Aditi Agrawal and C. Sanyal, ‘Morphological Analyzer’, Technical Report TRCS-93-177, Department of Computer Science and Engineering, IIT Kanpur, 1993.
- Aditi Agrawal and R.M.K. Sinha, ‘On Hindi text generator’, Technical Report TRCS-93-178, Department of Computer Science and Engineering, IIT Kanpur, 1993.
- T. Suresh and R.M.K. Sinha, ‘On Telugu text generator’, Technical Report TRCS-93-179, Department of Computer Science and Engineering, IIT Kanpur, 1993.
- T. Suresh and R.M.K. Sinha, ‘On Man-machine interface in ANGLA-BHARTI’, Technical Report TRCS-93-180, Department of Computer Science and Engineering, IIT Kanpur, 1993.
Ph.D.Thesis supervision topics:
- Renu Jain, ‘HEBMT: A Hybrid Example-Based Approach for Machine Translation (Design and Implementation for Hindi to English)’.
TOP
-
Speech to Speech Translation
The speech to speech (S2S) translation requires a tight coupling of the automatic speech recognition (ASR) module, MT module, and the target language text to speech (TTS) module.
A mere interfacing of ASR, MT and TTS modules does not yield an acceptable S2S translation. S2S requires an integration of these modules such that the hypotheses are cross verified and appropriate parameters get generated.
In our environment, it has to cater to bi-lingual (Hindi mixed with English) speech with commonly encountered Indian accent variations.
The MT also needs be a chunk translator with multiple translation engines. Our investigations are directed to domain specific applications in Indian environment.
Some Relevant Publications
- R. M. K. Sinha, V. N. Shukla and S. S. Agrawal, A Framework for Integrating ASR into a Machine Translation System, Workshop on Technologies and Corpora for Asia-Pacific Speech Translation, 3rd IJCNLP, Hyderabad, Jan 11, 2008.
- R.M.K. Sinha, Towards Speech to Speech Translation, Key-note presentation at Symposium on Translation Support Systems (STRANS2002), March 15-17, 2002, Kanpur, India.
TOP
-
Lexical Knowledge-Base Development
Lexical knowledge base is the fuel to the translation engine. It contains various details for each word in the source language, like their syntactic categories, possible senses, keys to disambiguate their senses, corresponding words in target languages, ontology and word-net information/linkages.
We are also working towards development of Indian language wordnet named ShabdKalpTaru in association with Dr. Om Vikas and Dr. Pushpak Bhhattacharya.
Some Relevant Publications
- Renu Jain and R.M.K. Sinha, ‘On Multi-lingual Dictionary Design’, Symposium for Machine Aids for Translation and Communication (SMATAC96), New Delhi 1996.
- R.M.K. Sinha, K. Sivaraman, Aditi Agrawal, T. Suresh and C. Sanyal, ‘On logical design of multi-lingual lexicon for machine translation’, Technical Report TRCS-93-174, Department of Computer Science and Engineering, IIT Kanpur, 1993.
TOP
-
NLP Course Projects
Titles of some of the projects supervised:
- Dealing with named entities in Hindi
- Text summarization in Hindi
- Domain Identification in Hindi
- Multi-lingual Information Retrieval-Hindi-English
- Children story summarization
- English-Hindi Corpora Alignment
- English-Hindi SMT
- Learning Sense Disambiguation using neural network
- Learning Lexical choice and Users preferences
- Verb phrasal Disambiguation
- Hindi-English Code mixing
- Hindi chunk parsing
- Statistical dictionary
- Translation of News Headings
- Spoken Hindi digit recognition
- Parser Implementation
TOP