Historical Documents and Automatic Text Recognition

Historical Documents and Automatic Text Recognition lead image

Historical Documents and Automatic Text Recognition, special issue of Journal of Data Mining and Digital Humanities (JDMDH)

With this special issue of the Journal of Data Mining and Digital Humanities (JDMDH), we wish to bring together in one single volume several experiments, projects and reflections related to automatic text recognition on Historical documents.

Many projects now include automatic text acquisition in their data processing chain. The integration of this technology into increasingly powerful processing chains has led to an automation of tasks that affects the role of the researcher in the textual production process. This new data-intensive practice makes it urgent to collect and harmonise the corpora necessary for the constitution of training sets, but also to make them available for exploitation. This issue will be an opportunity to propose articles combining philological and technical questions to make a scientific assessment of the use of automatic text recognition for ancient documents, its results, its contributions and the new practices induced by its use in the process of editing and exploring texts. We hope that practical aspects will be questioned on this occasion, while raising methodological challenges and its impact on research data.

This special issue is the outcome of an event that took place at the Ecole Nationale des Chartes in Paris on June 23 and 24, 2022, which brought together scholars from various backgrounds to discuss the use of HTR and OCR in their researches. During these days, problems of engineering, machine learning or infrastructure were raised. Many technical subjects such as segmentation or the development of models linked to philological questions were discussed. The different speeches covered a large number of documents: manuscripts, archives, epigraphic materials, documents, sometimes in languages with their own specificities such as Hebrew, Vietnamese languages as CHAM or ancient Greek from the 11th to the 20th century.

This call is open not only to participants of these event, but to anyone working with HTR or OCR.

To address these issues, we propose the following three axes:

  • Axis 1: Sources, constitution and sharing of training data.
  • Axis 2: Machine learning
  • Axis 3: Feedback and data exploitation

This special issue aims to provide an overview of the use of HTR or OCR on historical documents at a time when its uses are multiplying and more and more research projects and cultural heritage institutions are interested in it. Through the Journal of Data Mining and Digital Humanities, we are delighted to offer an opportunity to all those who wish to make their own contribution to the field or to share their experience by exposing their successes, their questions and their difficulties, or even failure. By publishing this special issue, we hope to present a state of the art of the uses of automatic handwriting recognition today.