Department of Theoretical and Applied Linguistics
A Historical Corpus of the Welsh Language
AHRB award no. RE11900
Principal investigator: David Willis
Research associate: Ingo Mittendorf
The project aims to produce a historical corpus of Welsh texts from the period 1500-1850 in an electronic format. Although similar corpora exist for such languages as English, French, German and Irish, there is a shortage of such resources in Welsh. This project will deal with the Early Modern Welsh period; a similar project to edit fourteenth-century Welsh manuscripts is currently in progress at the University of Wales.
The corpus will reflect the rich diversity of the texts attested for the period by including texts and samples of texts from different stylistic levels and of varying geographical provenance. The wide stylistic range of the texts will make it particularly useful for the study of language change. It aims to make these texts accessible in a readily searchable format to researchers in Celtic studies and historical linguistics, and will allow scholars to search a substantial body of texts for particular linguistic features or other points of literary or historical interest to them. Many of the texts proposed for inclusion are not available in modern editions or are available only in modernised form, hence the project will also broaden access to the texts in question.
The entire corpus will be produced in a format that conforms to the standards of the Text Encoding Initiative (TEI), and it will be annotated within the framework of the TEI in a number of ways in order to facilitate its usefulness as a source of linguistic data. The project began in February 2001 and we aim to complete it in January 2004.
The corpus will be arranged into different groups of text types in order to represent the stylistic diversity of the Welsh language, while allowing for differences in the specific range of text types actually available at different periods. The text types under consideration are drama, personal letters, ballads, political (didactic) prose, scripture, historical narrative, narrative prose, and religious prose.
Although not all the text types are attested at all periods of the corpus, three, namely narrative prose, historical narrative and religious prose, are well represented in all periods covered by the corpus, and will therefore allow users to trace over time the development of texts occupying a similar stylistic level.
For each text, either a representative sample of approximately 15,000 words, or the entire text if this is less than 15,000 words, will be included. In design, the corpus will broadly follow the conventions established in the Helsinki Corpus of Historical English texts. In total, we aim to produce a corpus of approximately 500,000 words, making use of around 35 texts. The corpus will contain both texts attested in manuscript only and printed texts.
A particularly crucial period is the sixteenth and seventeenth century. Although some dialectal variation has been demonstrated for Middle Welsh, this is the first time that large scale dialect variation is evident, and is crucial for the study of the emergence of the modern Welsh dialects in their current form and the formation of the standard language. Nevertheless, there is a particular lack of reliable editions of texts from this period. The project will therefore make a particular contribution to widening access to texts in this period.
The focus on variation naturally brings with it problems of diversity. For research on linguistic change it is necessary that the corpus should fully represent the spelling and graphical variation of the original texts. However, linguistic change and orthographic variation make it difficult to find any linguistic form with absolute certainty. Furthermore, in the case of manuscript texts, the editor is faced with familiar editorial issues. Although, for practical reasons, large-scale linguistic tagging of the corpus will not be possible, we are considering the best ways to help users to find the information that they require.
We are currently working on a pilot project involving three of the more difficult texts of the corpus: one seventeenth-century drama text ('Y Rhyfel Cartrefol', National Library of Wales Cwrtmawr ms. 42), and two samples of scripture (from Y Testament Newydd (London, 1567), and Y Beibl Cysegr-Lan (London, 1588)). These texts will be available soon.
