The following are current on-going corpus-related research projects being conducted by the directors and students of the Second Language Studies Corpus Lab, along with additional collaborators in some cases. Click on any of the titles below to read a description of the project.

Charlene Polio, Jianwu Gao, Quy Pham, Jared Kubokawa

This is an ongoing project examining qualitative and quantitative empirical studies in applied linguistics. Part 1 includes a link to the publication in Research Methods in Applied Linguistics.  Part 2 is being written up.  Part 3 will be a follow-up qualitative study.

Part 1 (published in Research Methods in Applied Linguistics): In a 1.3 million-token corpus of the introductions and literature reviews from 577 full-length articles (quantitative: 316; qualitative: 261) in four leading journals in L2 learning between 2013 and 2019, we performed linear mixed-effects modeling on the frequencies of theor–  Using the logDice statistic and Sketch Engine, results showed that (1) the qualitative studies used theories more frequently, and (2) although theories were used by both bodies of research for structuring and grounding, the other roles assigned to theories were divergent: quantitative studies constructed theories as both agents and objects of specific positivistic reasoning processes, while qualitative studies either represented theories as dynamic agents with strong subjectivity or adopted a pragmatic, dialogic attitude when representing theories as objects. (Gao, Pham, Polio)

Part 2: In a follow-up study, presented at AAAL 2022, we pulled the sentences containing theor– and classified them according to their rhetorical move (based on Swales’s Create a Research Space model).  Through this analysis, we have been able to pinpoint the functions of the theor- structuring a literature review. Specifically, the analysis revealed that both quantitative and qualitative research used theories primarily for a general research background, but qualitative studies also relied on theories more frequently to announce the research purpose, define key terms, and summarize, clarify, and justify the framework. In addition, the analysis has allowed us to look more closely at some of the less frequent forms of theor-, such as the verb form, not examined in Part 1.  (Gao, Pham, Polio)  

Part 3:  This qualitative interview-based study is currently being planned. The goal is to triangulate the findings from the two corpus-based studies by interviewing authors whose publications were included in the corpus.  (Polio, Kubikawa)

Philip Montgomery

Although advice literature about research writing presents acknowledging limitations and suggesting directions for future research as obligatory moves that demonstrate an author’s critical self-evaluation and authority, there is no research that focuses on their usage in research articles (RAs). Based on two specialized corpora of 100 quantitative and 100 qualitative RAs from four applied linguistics journals, this exploratory mixed method study included a) a genre analysis, which highlighted the relative prominence of these moves across methodological approaches and journals;, b) a p-frame analysis, which generated a list of linguistic frames with one or more variable slots (e.g., it is important to * (note/mention/realize));, c) questionnaires (n=114); , and d) semi-structured interviews (n=21) with applied linguists at varying stages in their careers. Findings revealed a) how scholars define and use these moves, b) their variation across journals and methodological approaches, c) strategies scholars employ when deciding how prominently to include them, and d) variable frames that can support multilingual and emerging scholars. The study has implications for graduate program instruction and genre analysis methodology.

Charlene Polio, Hyung-Jo Yoon

Part 1 (published in Learner corpora meets second language acquisition): Most second language writing research assesses accuracy through measures that require identifying errors (e.g., error-free unit ratios; number of errors).  Although achieving reliability on most measures is possible, the measures are labor-intensive and problematic because of their lack of validation, their lack of theoretical basis, and the questionable separation of lexis and grammar on some of the measures.  We therefore explore automated accuracy measures rooted in a usage-based theory of second language acquisition, which views language as a set of constructions or chunks. The proposed set of measures was calculated by comparing two and three-word combinations (bigrams and trigrams) from learners’ essays to bigrams and trigrams in the Corpus of Contemporary American English.  The first step involved a factor analysis with 139 ESL essays.  We manually coded for three traditional measures of accuracy and, using automated programs, calculated measures of syntactic complexity and lexical sophistication.   Using a set of three corpus-based measures, reduced from six, we found that the measures (proportion of absent trigrams and the mutual information scores of the bigrams and trigrams) grouped with measures of accuracy and not lexical sophistication or syntactic complexity. 

Part 2 (presented at AAAL 2019): As a follow-up validation study, we used 1244 essays from the Cambridge Learner Corpus, which included essays with the errors coded and the same essays with the errors corrected.  We calculated corpus-based measures for the essays and found that the same three from the first study correlated with the number of errors from .36 to .51.  We calculated the change in the corpus-based accuracy measures from the original to the corrected essays and found that they decreased significantly with large effect sizes while the complexity measures did not change.  These automated measures hold promise for more theoretically based conceptions of accuracy.  We hope to explore these corpus-based error identification measures as a way to give automated feedback. (Polio, Yoon)

Sandra Deshors, Steven Gagnon

This study focuses on the usage patterns of progressive marking (specifically the progressive vs. nonprogressive alternation) in Korean Learner English (KLE) and how those patterns differ from those in native English. Due to formal, morphological and semantic typological differences between English and Korean, KLE is a promising candidate to unveil new patterns of progressive marking in learner language and it can help us understand more deeply the process of native-language interference. Methodologically, the study is based on over 2,600 contextualized occurrences of the (non-)progressive constructions manually annotated for nine co-occurring linguistic factors. Statistically, the study includes a collostructional analysis followed by a GLMM tree analysis. Overall, even though Korean learners of English are able to use the progressive relatively similarly to native speakers, there are nonetheless systematic subtle deviations between the two speaker populations. For example, it emerges that stative progressives characterize KLE more than native English. Ultimately, our results reveal that at a fine-grained level of granularity, there is a disconnect between on the one hand, the diversity of the linguistic contexts that characterize progressive marking in native English, and on the other hand, the linguistic contexts that trigger a progressive construction in KLE, specifically. Our results bear pedagogical implications supporting the adoption of data-driven learning approaches in the Korean English classroom.

Steven Gagnon

This corpus-based study investigates the uses of phrasal verbs in Korean learner English. As a known acquisitional challenge for all English learners, these verbs are particularly problematic for Korean English learners as they are not available in Korean. We extracted approximately 1,500 occurrences of phrasal verbs in (in)transitive constructions (Verb Particle, Verb Particle Object, and Verb Object Particle) from the written Yonsei English Learner Corpus. Statistically, we conducted a co-varying collexeme analysis and assessed, for each construction, to what extent: (i) lexical verbs and particles attract, (ii) phrasal verbs and semantic uses attract, (iii) pairings of lemmas, particles and semantic uses vary across learner and native English, and (iv) to what extent the strength of those pairings varies across speaker populations. Overall, learners’ uses of these constructions are surprisingly varied. In terms of absolute lemma-particle co-occurrences, the data yield strong usage patterns by learners. However, semantically, our results confirm that learners’ difficulties lie at the grammar-lexis interface. Specifically, learners are yet to fully integrate that the syntactic configurations in which lemmas and particles combine vary as a function of individual phrasal verbs and individual semantic uses.

Adam Pfau

This study is one part of a larger project that will examine the use of hedging devices by Japanese learners of English when either direct or indirect data-driven learning (DDL) treatments are given to them based on corpus-informed materials. This portion of the study investigates the frequency differences among hedging use between Japanese learners of English and native-speaking writers of English. Samples of argumentative essays from Japanese learners of English at various proficiencies were collected from the International Corpus Network of Asian Learners of English (ICNALE), along with an equal number from ICNALE’s corpus of essays written by native-speaking English writers. The two corpora are topic-controlled and comparable in length, genre, and prompt. The two corpora were examined for the frequencies of 75 different lexical items commonly associated with hedging (replicated from Hyland & Milton, 1997). All occurrences of the lexical items were manually assessed to ensure the lexical items were being used to express certainty or doubt, following Hyland and Milton’s methodology for searching hedges within several large corpora. Frequencies for hedging categories based on word class are included, as well as frequencies for hedges categorized by their function. The results from this study will help inform the compilation of corpus-based materials that can then be used for the purposes of direct and indirect DDL instruction with Japanese learners of English.

Kevin Fedewa

This study explores the uses of the primary Chinese negation words, bu (不) and mei (没), by native English speakers with the aim to assess (i) to what extent lexical aspect affects the uses of bu and mei in L2 Mandarin and (ii) whether transfer effects from English to L2 Mandarin can be captured. The use of either bu or mei is closely tied to aspect. Specifically, “bu is semantically incompatible with aspect markers denoting realization” (Xu, 1997, as paraphrased by Xiao & McEnery, 2008: 291), and bu does not co-occur with perfective viewpoints (Xiao & McEnery, 2008). Mei, however, is compatible with experiential and actual aspects: positive forms of actual and experiential aspects tend to be more frequent than negative forms (Xiao & McEnery, 2008). Further, native Mandarin speakers rarely negate progressive and durative aspects (Xiao & McEnery, 2008). Altogether, these trends contrast sharply with English where negating actual, experiential, progressive, and durative aspects is common. These typological differences strongly suggest the existence of negative L1 transfer effects as L1 English learners would likely not have received input with negation of these aspectual categories. While the difficulties of using bu and mei by L2 Mandarin learners is well documented, methodologically, existing research remains based on experimental approaches focused on the acquisition of negation and aspect (Yan, 2013), negation and mood (Wang & Chan, 2021), as well as modals and negation (Peng & Zhu, 2017). In this context, the present study builds on existing research by adopting a quantitative corpus-based approach to explore to what extent learners rely on L1 transfer or Mandarin input or both in their production of negative forms, as their interlanguage develops. Methodologically, bu/mei constructions are investigated in context as extracted from the spoken and written components of the Guangwai-Lancaster Chinese Learner Corpus, manually annotated against actual, experiential, progressive, and durative aspectual categories, and compared to native speakers’ positive and negative forms ratios. This study is a first step towards further comprehensive, multifactorial analyses of negation in L2 Mandarin integrating aspect, modality, and negation. As such, it bears possible important pedagogical implications that align with Data-Driven Learning approaches to second-language instruction.

Eunmi Kim

This corpus-based study provides a contrastive analysis of usage of well among Asian L2 English learners at low levels of L2 proficiency and English native speakers with special reference to the frequency of each function of the discourse marker well and its role in their turn-taking system. This study uses both quantitative and qualitative methods. This study uses data from English native speakers, beginners at the A2 level, and intermediate learners at B1 level of the Common European Framework of Reference (CEFR; Council of Europe 2001) from ICNALE (International Corpus Network of Asian Learners of English) Spoken Dialogue for the quantitative analysis. In the qualitative analysis, this study categorizes each pragmatic function of well based on the classification of 5 distinctive functions: a filler, a mitigator, a repair marker, response marker, and a structurer. This research shows that (a) English native speakers overuse discourse functions of well and L2 learners underused them and (b) more proficient learners have similar patterns in their use of well to English native speakers’ counterparts in terms of frequency, functions and its role in turn-taking. One thing to note here is that English native speakers had the almost same ratio of tokens of the filler as L2 learners in the level of A2 (~28%). It indicates that the frequent use of fillers should not be regarded as disfluent speech. Therefore, this study raises critical issues related to hesitations and fillers in assessment of oral proficiency and discusses pedagogical implications for teaching beginners.