Digital Humanities research in pré-contemporary textual data is full of challenges. In order to automatically extract knowledge from historical texts, an interdisciplinary approach is needed. Understanding linguistic, cultural and historical phenomena is essential to deal with the problems of developing automatic systems.
In the 18th century, Portuguese texts still did not follow a standardization concerning spelling. For NLP, this constitutes a problem as there are no known tools for the 18th-century Portuguese language able to deal with a significant spelling variation, which diminishes the accuracy of results obtained with tools developed for contemporary Portuguese.
Regarding human readers, the spelling variation may make access to the text problematic for non-linguists as we may find unusual word forms deriving from linguistic or orthographical phenomena, the dialectal region of the writer or even eventual mistakes of the transcriber, in what concerns texts transcribed from handwritten sources.
This article shows how a linguistic annotation and a rule-based approach may describe the phenomena more accurately and may contribute to solving the problem of spelling variation, allowing better information retrieval, better processing and a wide range of possibilities linking past to present.
We describe the development of an annotated corpus built from the handwritten transcribed Memórias Paroquiais (1758). It is a compilation of handwritten texts from an inquiry made by the King of Portugal to all the parishes, after the big earthquake in Lisbon. Memórias Paroquiais is a crucial document as, independently of its historical value, it contains answers from different age and literacy level priests, from various regions, making this corpus an undeniable rich sample of the 18th-century Portuguese language.
We annotated a subcorpus, where 26,1% of tokens have at least one spelling variant facing the contemporary form and 5% of them have more than two.
From the corpus, we developed a set of rules that allow automatic generation of a dictionary, with standard and variant forms. The variance dictionary obtained can also serve as input for other tasks such as named entity recognition, improving the accuracy of results. The annotated corpus may also be used for training a machine learning model able to automatically identify and standardize variants, applicable to other texts of the same historical and linguistic period.