A bilingual writer corpus for research on biliteracy

By : David M. Palfreyman, Zayed University, UAE

A bilingual writer corpus for research on biliteracy
 

The language corpus (a large, structured collection of authentic language texts) has become a valuable resource for research on language. Corpora offer large, representative samples of 'real world' language use, which can be searched for examples of words or constructions; they are often enriched with annotations so that they can be searched for particular ideas, errors or other features. However, language corpora have until very recently not been prepared with a focus on biliteracy.

It is estimated that more than half the world's population use more than one language every day; and many of these are literate to some level in more than one language. In Arab countries, for example, education systems tend to aim for a certain level of biliteracy in Arabic and another language; in the Gulf region, literacy rates are high (e.g. over 90% in the UAE), and English as well as Arabic takes a leading role. In contrast, language corpora have tended to focus on a single language; even research on learner corpora of writing in English (or in another language) tends to compare this writing with a reference corpus of writing by other, 'native' users of the same language.

Together with Prof. Nizar Habash at New York University Abu Dhabi, I have been preparing a new kind of corpus, which focuses instead on a large set of bilingual writers writing in *both their languages*. Unlike a so‑called 'parallel corpus', which pairs texts in one language with translations of those same texts into another language, the Zayed Arabic-English Bilingual Undergraduate Corpus (ZAEBUC) matches comparable texts in different languages written by the same writer on different occasions. Currently, ZAEBUC comprises short essays written by several hundred (mainly Emirati) incoming university students: 388 English essays (97,500 words) and 215 Arabic essays (34,400 words).

ZAEBUC is provided in uncorrected and corrected versions, so that errors in spelling and basic sentence grammar can be identified and analyzed.  Both Arabic and English texts are also rated by three assessors using the Common European Framework of Reference (CEFR).  Additionally, the corpus is annotated for part of speech, lemmas and other features, following commonly used standards for Arabic and English to allow the use of the corpus in computational linguistics as well as cross-linguistic research.  For example, we used the Universal Dependencies part-of-speech standards as they are designed to facilitate comparison between languages.  Finally, metadata about each writer/text enables researchers to focus on subcorpora, for example comparing texts on one topic with texts on a different topic.

ZAEBUC will be an open research resource, aligned with the recent 'multilingual turn' in linguistics.  It will enable researchers to investigate a range of questions such as:

  • Do students who use more complex constructions in their first language writing also tend to use more complex constructions in their second language?
  • Do male students use different vocabulary from female students when writing in their first language?  And is a similar pattern evident when they are writing in their second language?
  • To what extent is English a 'second language' for these students? For example, which language is dominant in students who studied at an English-medium high school, compared with those who studied high school in Arabic?

For updates and information about using ZAEBUC for your research, please see https://www.researchgate.net/project/Zayed-Arabic-English-Bilingual-Undergraduate-Corpus-ZAEBUC or email david "dot" palfreyman "at" zu.ac.ae.