Document is anything that conveys information. Traditionally,
documents meant paper-based written media or historical
palmleaf and papyrus inscriptions. Today, documents are more
diverse and are increasingly electronic and entirely in digital
form. Document contents are also no longer only text, but
comprise photographs, drawings, tables and plots. Even video is
sometimes considered a document.
Document Analysis and Recognition (DAR) is a
specialised field concerned with designing and developing
algorithms and techniques that process and extract information
from documents using computers. Documents are always input as
scanned images to computers. DAR, for a long time, had been
associated with Optical Character Recognition
(ODR), a task that extracted textual data in digital
form from scanned images of documents. Today, DAR has moved
beyond OCR and researchers are exploring higher levels of
abstraction as well as the inter-relationships between textual
and non-textual components.
In this lab, we investigate the different aspects of documents
but also specially focus on Indian documents, which
are rich with multilingual and culturally strong
content. Indian documents provide a number of challenges
due to the complexity of the scripts, quality of the
underlying media and the printing processes, etc. In
particular, analysis and
recognition of handwritten documents is in its infancy.
OCR Lab@SCIS has been one of the pioneers in DAR research in
India and is recognised throughout the Indian research
community for its work.
Research
- Deep Image Priors for
Binarisation
- Table Question Answering
Systems
- Visual Question Answering
Systems
- Zero-shot Learning for Handwritten Character
Recognition
- Telugu OCR System
- Saara: A Machine Translation
System for Kannada
- Software Tools for Forensic
Analysis of Documents
Funded Projects:
- Resource Centre for Indian Language Technology Solutions
(Telugu) (2001) - ₹98 L
- Development of Software Tools for Analysing Additions,
Deletions and Alterations in Documents (2002) - ₹20 L
- Development of Robust Document Analysis and Recognition
System for Printed Indian Scripts - Phase I (2007 - 2010) - ₹36
L
- Development of Robust Document Analysis and Recognition
System for Printed Indian Scripts - Phase II (2011 - 2015) - ₹78
L
The list is incomplete pending updates to this webpage.