Authors:
James Philips
and
Nasseh Tabrizi
Affiliation:
Department of Computer Science, East Carolina University, Greenville, North Carolina, U.S.A.
Keyword(s):
Historical Document Processing, Archival Data, Handwriting Recognition, Optical Character Recognition, Digital Humanities.
Abstract:
Historical Document Processing (HDP) is the process of digitizing written material from the past for future use by historians and other scholars. It incorporates algorithms and software tools from computer vision, document analysis and recognition, natural language processing, and machine learning to convert images of ancient manuscripts and early printed texts into a digital format usable in data mining and information retrieval systems. As libraries and other cultural heritage institutions have scanned their historical document archives, the need to transcribe the full text from these collections has become acute. Since HDP encompasses multiple sub-domains of computer science, knowledge relevant to its purpose is scattered across numerous journals and conference proceedings. This paper surveys the major phases of HDP, discussing standard algorithms, tools, and datasets and finally suggests directions for further research.