ACAW - Aachen Corpus of Academic Writing


The Aachen Corpus of Academic Writing (ACAW) is a corpus of advanced L2 writing in English. It is designed to represent the register of English academic writing in its narrow sense, a register characterized by its conventionalized and compressed nature (cf. Biber & Gray, 2010, Callies & Zaytseva, 2013). Like the Corpus of Academic Learner English (CALE), ACAW is designed to support research on advanced L2 learning and related fields such as English as a Second Language (ESL), English for Specific Purposes (ESP), and English for Academic Purposes (EAP). ACAW consists of two components: (1) a central component representing non-native (L2) English academic writing by L1 German advanced learners of English and (2) a control component representing native German academic writing by the same learners. The learners are 2 nd or 3 rd year undergraduates enrolled in the bachelor programmes of the English Department at the RWTH Aachen University with 7-9 years of formal instruction of English before entering university and thus meet the institutional status criteria for advanced learner status of English specified in Callies (2009: 116f.). This design allows comparisons between native and non-native writing at both group and individual level.


Corpus Design

Corpus type: Parallel (comparable) corpus
Corpus components: Native and non-native academic writing by university students.
Target language: English
First language: German
Medium: Written
Text type: Academic research writing
Proficiency level: Advanced
Availability: Under development
Size: Expanding; in its current stage (October 2015), the L2 component consists of ~240,000 words. The L1 component consists of ~225,000 words.
Format: Both components of ACAW are in an XML format following the TEI format specifications.
Non-Linguistic Annotation: All texts contain meta-data about learner variables gathered through a self-report questionnaire: Age, gender, knowledge of other foreign language, reading exposure (i.e. reading of scientific/non-scientific texts in English) and time spend in English-speaking country.
Linguistic Annotation: All data are annotated using the components from the Stanford CoreNLP toolkit (Manning et al., 2014): (1) Tokenization (TokenizerAnnotator), (2) Sentence splitting (WordToSentenceAnnotator), (3) Part-of-Speech (POS) tagging (PostTaggerAnnotator), (4) Lemmatization (MorphAnnotator) and (5) Syntactic Parsing (ParserAnnotator).