ACAW - Aachen Corpus of Academic Writing

 

The Aachen Corpus of Academic Writing (ACAW) is a corpus of advanced L2 writing in English. It is designed to represent the register of English academic writing in its narrow sense, a register characterized by its conventionalized and compressed nature (cf. Biber & Gray, 2010, Callies & Zaytseva, 2013). Like the Corpus of Academic Learner English (CALE), ACAW is designed to support research on advanced L2 learning and related fields such as English as a Second Language (ESL), English for Specific Purposes (ESP), and English for Academic Purposes (EAP). ACAW consists of two components: (1) a central component representing non-native (L2) English academic writing by L1 German advanced learners of English and (2) a control component representing native German academic writing by the same learners. The learners are 2 nd or 3 rd year undergraduates enrolled in the bachelor programmes of the English Department at the RWTH Aachen University with 7-9 years of formal instruction of English before entering university and thus meet the institutional status criteria for advanced learner status of English specified in Callies (2009: 116f.). This design allows comparisons between native and non-native writing at both group and individual level.

References:

Biber, D. & Gray, B. (2010). Challenging stereotypes about academic writing: complexity, elaboration, explicitness. Journal of English for Academic Purposes 9(1), 2-20.
Biber, D., Gray, B. & Poonpon, K. (2011). Should we use characteristics of conversation to measure grammatical complexity in L2 writing development? TESOL Quarterly 45(1), 5-35.
Callies, M. (2009). Information highlighting in advanced learner English: The syntax-pragmatics interface in second language acquisition. Amsterdam: Benjamins.
Callies, M. & Zaytseva, E (2013). The Corpus of Academic Learner English (CALE) – A new resource for the assessment of writing proficiency in the academic register. Dutch Journal of Applied Linguistics 2(1), 126-132.

 

Corpus Design

Corpus type: Parallel (comparable) corpus
Corpus components: Native and non-native academic writing by university students.
Target language: English
First language: German
Medium: Written
Text type: Academic research writing
Proficiency level: Advanced
Availability: Under development
Size: Expanding; in its current stage (October 2015), the L2 component consists of ~240,000 words. The L1 component consists of ~225,000 words.
Format: Both components of ACAW are in an XML format following the TEI format specifications.
Non-Linguistic Annotation: All texts contain meta-data about learner variables gathered through a self-report questionnaire: Age, gender, knowledge of other foreign language, reading exposure (i.e. reading of scientific/non-scientific texts in English) and time spend in English-speaking country.
Linguistic Annotation: All data are annotated using the components from the Stanford CoreNLP toolkit (Manning et al., 2014): (1) Tokenization (TokenizerAnnotator), (2) Sentence splitting (WordToSentenceAnnotator), (3) Part-of-Speech (POS) tagging (PostTaggerAnnotator), (4) Lemmatization (MorphAnnotator) and (5) Syntactic Parsing (ParserAnnotator).