Keynote Speakers

Glenn Fulcher

University of Leicester, United Kingdom

Alternative Validity Worlds


The late 20th century consensus on the meaning of validity as embodied in Messick and the 1999 Standards has evaporated. We are left with a multiplicity of validity models that are largely theoretically incongruent. As a result, each model has its ardent supporters in equal measure with strident critics. But in truth the battleground has not shifted very far. Disputes centre around what is to be included or excluded from a model, the definition of construct, or (still rather shockingly) the extent to which a test provider is responsible for score use. This parochial timidity stands in stark contrast to the expansive world view of visionaries like Lado, whose work depicts the kind of society that assessment can serve. Language testing is both a social science and a social phenomenon. As such it can never be value free. In this talk I consider a number of validity models in current vogue with regard to implicit value systems, notions of society, and the human condition. I argue for an approach to validity and validation that prioritises a progressive value system, which in turn can motivate “effect-driven” language testing practices.


Elana Shohamy

Tel Aviv University School of Education, Israel

Incorporating expanded dimensions of ‘language’ for increasing construct validity


Critical language testing refers to the continuous need to raise questions about language tests in various domains but especially in terms of fairness and justice. Indeed, most of the research in the past two decades raised questions about tests’ uses, misuses, injustices, ethicality and learning. It represented at the time, a shift from viewing the quality of tests based on their psychometric features to a point where tests are viewed in terms of their uses in education and society. It was shown how central agencies – Ministries of Education, testing boards, principals and teachers, misuse tests to perpetuate their agendas given the enormous power of tests based on their unique feature of determining the future of test takers and educational systems. Calls for democratic assessments, fairness, ethicality and effective learning became part of the need to increase construct validity of tests given the argument of Messick that tests’ uses and impact are part of construct validity. The critical testing approach continues to collect evidence and raise questions of justice and fairness about various domains of tests; now it is turning to the examination of the ‘what’ is being tested on language tests, the nature of language. Drastic changes have occurred in the past decade about the definitions of language, mostly as a result of sociolinguistics, with regards to diversity of learners, immigrants, indigenous and others. These new definitions perceive languages as multilingual, translanguage, fluid, semiotic, multimodal and contexualized in space (linguistic landscape). Still most language tests remain monolingual, static, formulaic and closed. This paper will focus on a variety of new approached for assessment of language in their expanded forms and definitions and share research about the advantages of multi language, multi modal tests. I will show how such tests reflect the nature of the language of learners our current understanding. These tests reflect a revised construct validity of language tests in this day and age.




Saturday, April 13


Piibi-Kai Kivik   &  Elisa Räsänen

Indiana University, USA

Portfolio assessment of learners’ independent language use


The presentation discusses the implementation of a portfolio task in less commonly taught foreign language courses. We report on the introduction of “independent use portfolio” as part of formative course assessment in our Finnish and Estonian language courses at a North American university across different proficiency levels. The portfolio is our first stage in re-organizing curricula to correspond better to the learners’ real-life L2 needs.

We seek to learn more about authentic situated language use of our students and make it part of course assessment by collecting samples of and self-reflections on language use “in the wild” and student reflective talk about these (Garcia-Cruz & Lilja, 2017; Lilja & Piirainen-Marsh 2018; Lilja 2018). In the assignment, students record their efforts of using the target language outside of classroom. They are asked to record samples and self-report instances of their interactions with native and non-native speakers of Finnish/Estonian in a variety of situations in different modalities and engagement with target-language content. Students then discuss their portfolio entries in small-group conversations, which are recorded. We plan to analyze the video-data of the conversations, identifying the interactional practices and multimodal resources in both the interactions reported in student portfolios and the group discussions of these experiences. We will then put forward an improved portfolio task.

The primary goals of the portfolio collection task are to learn about the range and interactional details of the naturally occurring and the semi-authentic (arranged for the portfolio) interactions as well as to encourage students to engage in these. The discussions of the data serve to raise the learners’ awareness of themselves as L2 users, including in socially situated interactions, and to increase their agency in the learning process by identifying learnables (Eskildsen & Majlesi, 2008).

The portfolios will provide us with a view of the students’ L2 use outside of the institutional learning situations (often literally thanks to the video entries). Analysis of the portfolio entries and the learners’ introspection will result in curricular changes addressing the demonstrated needs. We will incorporate language use in the wild, as well as targeted instruction of interactional competence, such as the conversation-analysis inspired practices instigated at the Rice University Center for Language and Interaction, detailed in the 2016 CLIC symposium (Skogmyr Marian & Balaman, 2018; Waring, 2018; Salaberry & Kunitz, forthcoming). Our ultimate goal is to develop 1) final course assessment and 2) related teaching interventions and materials that correspond to students’ real-life language use.


Eskildsen, S.W., & Majlesi, A. R. (2018). Learnables and teachables in second language talk: Advancing a social reconceptualization of central SLA tenets. Introduction to the Special Issue. Modern Language Journal, 102 (Supplement 2018), 3–10. https://doi.org/10.1111/modl.12462

Garcia-Cruz, K. & Lilja, N. (2017). Experiential learning in action: L2 learners reflecting on their language use experiences. Paper at Interactional Competence and Practices in a Second Language International Conference (ICOP-L2). University of Neuchatel, Neuchatel, Switzerland.

Lilja, N. (2018). Necessary self-deprecations: Analysing own interactions for second language learning. Paper presented at the International Conference for Conversation Analysis, University of Loughborough.

Lilja, N. & Piirainen-Marsh, A. (2018). Connecting the language classroom and the wild: Re-enactments of language use experiences. Applied Linguistics,  https://doi.org/10.1093/applin/amx045

Salaberry, R. & Kunitz, S. (Eds.) (forthcoming). Teaching and testing L2 interactional competence: Bridging theory and practice. New York: Routledge.

Skogmyr Marian, K., & Balaman, U. (2018). Second language interactional competence and its development: An overview of conversation analytic research on interactional change over time. Language and Linguistics Compass;12:e12285, 1-16. https://doi.org/10.1111/lnc3.12285

Waring, H. Z. (2018). Teaching L2 interactional competence: problems and possibilities. Classroom Discourse, 9, 57-67. https://doi.org/10.1080/19463014.2018.1434082


Constanza Tolosa

University of Auckland, New Zealand


Challenges of assessing Intercultural Communicative Competence in multicultural school classrooms


With the expansion of the construct of communicative competence to include an intercultural dimension, teachers are now faced with the task of teaching and assessing intercultural Communicative Competence (ICC). Based on recent empirical evidence in the New Zealand context, this paper presents some of the most pressing challenges reported by those ‘at the chalkface’ in schools where there is an expectation that the foreign languages curriculum develop ICC. Some of these challenges point at issues of validity in the assessment of students on their development of a multi-faceted construct within complex multilingual and multicultural classrooms.

The rather recent New Zealand national curriculum document for foreign languages (mandatory only from 2010), declared that the development of ICC “lies at the heart of learning languages” and introduced a three-strand model where communication is supported by language knowledge and cultural knowledge. In the cultural knowledge strand, students learn about culture and the interrelationship between culture and language. Furthermore, “[a]s they compare and contrast different beliefs and cultural practices, including their own, they understand more about themselves and become more understanding of others” (Ministry of Education, 2009, p. 24). These aims are reflected in achievement objectives used to assess the students’ performance in their foreign language.

In order to support teachers in implementing this new mandate, the ministry commissioned two reports; one report focused on the Principles of Instructed Second Language Acquisition (Ellis, 2005) and a second report focused on the Principles for Effective Intercultural Language Teaching and Learning (Newton et al., 2010). The stated aim of the Newton et al. principles is to support teachers to integrate language and culture from the beginning of students’ language learning, to encourage and develop a reflective approach to culture and culture-in language, and to encourage appropriate responses to diverse learners and learning contexts. In practice, many teachers have reported struggling with integrating intercultural principles into their classrooms and understanding what (if anything) needs to be assessed under this new expectation of the foreign language curriculum.

Qualitative data from an online survey (67 valid responses) and interviews with six practicing teachers of Mandarin in New Zealand school classrooms was analysed thematically to identify the teachers’ understanding and practices regarding the inclusion of an intercultural dimension in their teaching. From these responses, the present paper focuses on the challenges that assessing ICC pose for the teachers. These include: the teachers’ own proficiency in Mandarin; their understanding of the ICC construct; the low value placed by different stakeholders of the ICC dimension of language; the risk of their own subjectivities in judging students’ ICC; the inclusion of the students’ own cultures in a holistic view of ICC; and the difficulty of assessing the development of intercultural competence with beginner students who have limited linguistic competence.


Casey Richardson

University of Arizona, USA


Title: Enough is Enough: Using CRT to Demand an End to SEI


Despite the repeal of English-only policies in California and Massachusetts, Arizona continues to place bilingual students in English(-only) immersion classrooms that the Arizona Department of Education situates as the universal solution to emergent bilinguals’ need for acquiring academic English. Previous research (Florez, 2012), however, details the lack of validity surrounding Arizona’s placement of emergent bilinguals with respect to the Primary Home Language other than English (PHLOTE) survey given how any language other than, or in addition to, English listed on the home language questionnaire leads the student to be assessed for English language proficiency via the Arizona English Language Learner Assessment, AZELLA. This standardized exam serves an additional gatekeeping role to emergent bilinguals’ access to quality schooling and meaningful participation in society (Lillie, 2014).

The social and political consequences for emergent bilinguals, predominantly because their placement in an English(-only) immersion setting for four hours a day isolates them from their peers and deprives them of academic content, is significant. This segregation is claimed to be based on language, resulting in what Combs and colleagues (2014) refer to as linguistic apartheid: “the subjugation of minority language speakers by dominant language groups through cultural genocide or repressive language policies”; however, the negative washback effects of such practices extend beyond language alone. For example, most emergent bilinguals are not reclassified within one year and fall farther behind, widening the achievement gap between emergent bilingual and native-speaker. By contrast, students whose parents have the social capital to write only English on the PHLOTE are often mainstreamed and learn more of the academic content imperative for their graduation and future success. The purpose of this paper then is to expand our understanding of the social and ethical consequences of Arizona’s ELL assessment practices through a discussion of the multiple identities of those most targeted by such politicized actions: Latinx (immigrant) emergent bilinguals.

Drawing on CRT and specifically, Crenshaw’s construct of intersectionality (1991), this paper will detail the ways in which Arizona’s language assessment of emergent bilinguals is not just a reaction to students’ “limited English proficiency”, but also stems from racially-motivated political underpinnings (Shohamy 2001, 2006). As a result, emergent bilinguals of color have been tracked into inequitable discrete skills classes, which have historically been normalized and upheld by white supremacy. The continuous reproduction of this marginalization via such forms of assessment harms emergent bilinguals of color [e.g. the placement of emergent bilinguals into school contexts where they are stripped of their identities in order to create a unified nation-state, resulting in students’ cultural and linguistic assimilation (Wiley, 2004)]. As such, SEI is one of many gatekeeping tools in education through which the social elite maintain authority in undemocratic ways (Shohamy, 2001). To mitigate this, those most affected by such language assessment must be permitted meaningful participation and citizenship (Lillie, 2014; Ramanathan, 2013) through schooling to provide them the opportunities necessary to take part in the political conversations that determine the well-being of members of their own communities.



Gordon Blaine West

University of Wisconsin-Madison, USA


“I think my life was ruined”: Consequences of a University High-Stakes English Language Exam


This study examines the impact and consequences of a university English language exam on both instructors and students at a university in South Korea. In order to obtain their diploma and graduate, students at the university were required to achieve a certain score, different depending on their major, on an English language exam. The test was developed by the university and marketed to other universities as a tool to ensure that students were achieving a high level of English proficiency during their university level studies. The stated goals of the exam were to develop better pedagogy by emphasizing a communicative approach to testing, and to combat negative washback effects of other English language tests focused more narrowly on grammar measures. A small scale interview study was conducted with three different groups of participants. Interviews were conducted with regular English language instructors (n=7) at the university to identify ways in which the high-stakes testing policy impacted their teaching in the required English language courses. Interviews were also conducted with students and instructors in a special test prep course designed by the university for students who had repeatedly failed the English language exam to identify consequences of failure for those students, from both the instructors’ perspectives (n=2), and also from the students’ perspectives (n=4). Interviews were conducted by two investigators and lasted between 30 – 90 minutes. Interviews were then transcribed and coded by both investigators to look for themes across the data. Finally, a narrative analysis (De Fina & Georgakopoulou, 2011) of the accounts (De Fina, 2009) given in interviews, drawing on positioning theory (Bamberg, 1997), was done to look at how students and instructors positioned themselves in relation to the assessment policy not only as passive subjects, but as resistant agents working against the policy in various ways. Regular English language course instructors reported working against the policy by spending up to 90% of class time in lower level courses specifically working on test prep for the exam. Test prep course instructors also positioned themselves as aligned with the students in working to subvert the test policy through more lenient scoring when assessments were scored locally. Students who failed the exam positioned themselves both as victims of an assessment policy which had steep consequences, including in one case having acceptance to a graduate program rescinded, and having delayed graduation from university by ten years in another. They also, however, positioned themselves as capable English users, refusing to recognize the validity of the assessment. The results from this exploratory study show the importance of qualitative studies of assessments to show the human consequences for testing that often go unreported in quantitative studies. In developing a more ethical framework for assessment, as called for by scholars in the field (Lynch, 2001; Shohamy, 2001/2014), we need qualitative studies to hear the voices of those more severely impacted by the assessments in order to fully understand and gauge the impact validity of the assessments.


Alfred Rue Burch

Rice University, USA


Please Read this Out Loud in English:
Task Transition, Framing and Repair in Japanese OPIs


Oral Proficiency Interviews (OPIs), and Language Proficiency Interviews (LPIs, cf. Young & He, 1998; Ross, 2017) more generally, regularly incorporate role play tasks in order to provide the interviewee an opportunity to initiate talk that is not merely responsive to the interviewer’s questions (Ross, 2017; Seedhouse & Nakatsuhara, 2018). However, transitioning from the interview to the role play, both with very different turn taking and participation frameworks regarding who can speak at which points, presents a practical challenge for both the interviewer and interviewee. The transition entails signaling the end of the interview, introducing the role play task, and providing the appropriate directives for the interviewee to be able to start and complete the new task and thus provide a ratable speech sample. Given the time constraints involved in completing OPIs, it is important that the task be framed in a clear and concise manner and to confirm that the interviewee has sufficient understanding of the new task before they begin. Interviewees regularly initiate repair (Schegloff, Jefferson & Sacks, 1977) upon the task framing in order to confirm their understandings and resolve any remaining issues. There are times, however, that this sequence of actions does not proceed smoothly.

This study employs Multimodal Conversation Analysis (Mondada, 2014), including a focus on both verbal and non-verbal behaviors (i.e. gaze direction and manipulation of the role play prompt card), to examine the task framing and task repair sequences in four Japanese OPIs conducted with interviewees at the intermediate level.  Of particular interest is the interviewer’s directive, spoken in Japanese, to read the role play prompt card aloud in English. In each case, this directive is met with a repair initiation by the interviewee. The study compares how in three of these cases, the repair sequence is minimal, and the interviewee quickly displays their understanding by following the directive, while in the fourth case, the repair sequence plays out over many turns, requiring extra effort on the part of both participants to achieve the sufficient understanding required to move on to the next activity, and thus placing a strain upon the time constraints.

The findings illuminate the degree to which interactional competencies (IC, cf. Pekarek Doehler, Wagner & Gonzalez-Martinez, 2018) play a role in the achievement of understanding of the assessment task itself, allowing the interviewer and interviewee to jointly construct the task in pursuit of the institutional goal of obtaining ratable speech samples. In other words, competencies are not only on display in the results of the assessment, but are necessary for the accomplishment of the task-in-process (Breen, 1987). The study thus has implications for how task transitions and task framing are managed in OPIs, particularly in light of the time constraints involved.


Fatima Baig & Katharina Kley

Rice University, USA


How task design impacts test takers’ topic initiations in a classroom paired speaking test


This paper presentation reports on a qualitative study that focuses on the effects of test task design on students’ topic initiation. The test task is part of a low-stakes classroom-based assessment instrument that intends to assess first-semester German students’ interactional competence such as repairing a misunderstanding and initiating and expanding on topics (Galaczi & Taylor, 2018). The test takers are peers and know each other from class. They are randomly assigned to pairs.

The task is a conversation task. Participants draw one of three cards. All cards cover three of the following topics that we had covered in class: personal information, family information, daily routine, living arrangements. The test takers are expected to talk about these topics for about 5 minutes. One criterion that is assessed with the task is topic initiation.

Nine conversations from the fall semester 2017 were video recorded and transcribed. Conversation Analysis was conducted and revealed that the task design has an impact on how students initiate topics (Bachman & Palmer, 1996; McNamara, 2006). Once students draw the topic card, the intention is for the card to be placed somewhere visible to both participants, e.g., table. However, the analysis showed that the topic card was actually used in various ways, and this then had an effect on topic initiation, for example, grabbing and holding on to the topic card and dominating the conversation; discussing the order of topics to be addressed; being interrupted by the tester and forced to address a new topic.

The findings suggest that a number of adjustments could be made to reconcile the design effect. These could possibly include tester training, the adjustment of the scoring rubric, reevaluation of task design, and rethinking of random student pairing.



Katharina Kley

Rice University, USA

Silvia Kunitz

Stockholm University, Sweden

Meng Yeh

Rice University, USA


L1-L2 speaker interaction: Affordances for assessing repair practices


This paper presentation reports on a case study that investigates the role of the interlocutor in a paired speaking test. Language testers have repeatedly stated that test discourse and assigned scores not only reflect test taker performance but are also influenced by a number of contextual factors (e.g., the test task, the candidate, the rater, the interlocutor) (Bachman, 1990; Bachman & Palmer, 1996; McNamara, 1996). This study focuses on the effect of the interlocutor’s native/nonnative-speakerness on the test taker’s production of repair practices.


The study was conducted with a second-semester learner of Chinese participating in two classroom-based speaking tests with different interlocutors, a peer and a native speaker, both of which are students. For both speaking tests, the test taker and her interlocutor engaged in an open-topic task, that is, they talked for 6 to 7 minutes about a number of topics of their choice. The two interactions were video-recorded and transcribed. The data were analyzed from a conversation analytic perspective with a focus on repair practices. Repair practices such as other-directed word searches and other-initiated repair are taught in second-semester Chinese and were thus among the teaching objectives and learning outcomes of the course. They also constituted one component of the grading criteria for the two speaking tests.


The conversation analysis revealed that, in the interaction with the peer, the test taker did not initiate any repair. Specifically, the test taker did not explicitly orient to issues of non-understanding (i.e., she never asked for clarification) and neither did she engage in other-directed word searches (i.e., enlisting her peer for vocabulary help). We speculate that might be a safe strategy for the test taker to avoid potentially face-threatening situations, by displaying non-understanding or by asking for help that the peer might not be able to give.


In comparison, in the conversation with the native speaker, the test taker produced a number of different repair practices: other-directed word searches (tentative outcome in Chinese, translation request from English) and other-initiation of repair through partial repeats, through candidate understandings, and by addressing the entire turn. Thus, the interaction with the native speaker seems to provide more affordances for initiating repair in terms of other-directed word-searches and other-initiations of repair in the face of understanding issues. These affordances are related to the linguistic epistemic asymmetry between the two interlocutors, in that the test taker orients to the native speaker as being in K+ position (i.e., more knowledgeable in terms of language proficiency).


Despite the fact that not all test takers and native speakers react in the same way, the findings of the study indicate that a linguistically asymmetric test setting by including a native speaker as interlocutor may be most fruitful if the testing objective is to elicit repair practices from the test taker. Thus, if a test taker receives a low score on the repair subscale because she doesn’t produce any repair in interaction with a peer, the claim made about the test taker’s inability to initiate repair and to engage in other-directed word searches may be misleading and consequential not only for the test taker’s grades but also for classroom teaching and testing.



Jayoung Song & Wei-Li Hsu

Rice University, USA


The effects of test takers’ language proficiency on their role-play tasks and face validity in a Virtual Reality interactive speaking assessment


There has been a rapid advances in both the capabilities and the cost of virtual reality (VR) in recent years. This new media is found to potentially offer extraordinary opportunities for learning through authenticity, embodiment, and immersion (Slater, 2017; Jacobson, 2017). Given that authenticity and representation of real-life interactions are one of the critical elements in accurately measuring test takers’ language use (Bachman & Palmer, 1996), a VR interactive assessment in which test takers could interact with a native speaker through immersion and co-presence in a simulated setting seems to be a plausible setting for assessing speaking. However, a viable threat to test takers’ score-based inferences yielded from the VR interactive assessment is the test takers’ level of language proficiency. Given that VR interactive assessment, a novelty environment for test-takers, could increase task complexity (Brown, 1993; Elder et al., 2002; Skehan & Foster, 1997), it seems possible that test takers with low language proficiency can find VR mode more cognitively demanding than face-to-face mode, thus influencing their language production during the task. To increase understanding the relationships among testing mode, test- takers’ level of language proficiency, and test-taker attitude, the present study investigates 1) the extent to which test takers’ oral scores are affected by their level of language proficiency, and 2) test takers’ attitudes towards the VR testing mode across their levels of proficiency. A total of 78 KFL students (n= 25 for a novice level, n= 53 for a novice-high level) enrolled in a private institution in the southwestern part of the United States participated in the study. The participants took two sets of face-to-face and VR role-play tasks after receiving training in the VR. Data was drawn from students’ test scores, a survey asking their attitudes towards the two testing modes, conversational analysis of their speaking test scripts, and interviews. The preliminary results indicated that oral scores of the students in the novice level were more affected by the new testing mode, VR than those of the students in the novice-high level. Students in the novice level also showed higher preference towards face-to-face role-play tasks. The findings suggest that ethical consideration should be made when assessing oral skills in the VR environment. The immersing feature of the VR assessment could lead to less cognitive resources for L2 production for beginning KFL learners, thus making VR assessment more ideal for test-takers whose oral proficiency is higher than novice level.


Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Designing and developing useful language tests (Vol. 1): oxford university press.

Brown, A. (1993). The role of test-taker feedback in the test development process: Test-takers’ reactions to a tape-mediated test of proficiency in spoken Japanese. Language Testing, 10(3), 277-301.

Elder, C., Iwashita, N., & McNamara, T. (2002). Estimating the difficulty of oral proficiency tasks: what does the test-taker have to offer? Language Testing, 19(4), 347-368.

Jacobson, J. (2017). Authenticity in Immersive Design for Education. In Virtual, Augmented, and Mixed Realities in Education (pp. 35-54). Springer, Singapore

Slater, M. (2017). Implicit Learning Through Embodiment in Immersive Virtual Reality. In Virtual, Augmented, and Mixed Realities in Education (pp. 19-33). Springer, Singapore.


Skehan, P., & Foster, P. (1997). Task type and task processing conditions as influences on foreign language performance. Language Teaching Research, 1(3), 185-211.



Wei-Li Hsu

Rice University, USA


Validity of a Classroom-Based, Democratic Assessment: A Case of Oral Tests


As critical pedagogy proposes, more practitioners incorporate negotiated syllabus in their teaching and create a more equal power-relationship with students. Additionally, with more power in classroom-learning, students would perceived more autonomy which could lead to better motivation based on self-determination theory (Deci & Ryan, 2011). Democratic assessment (Shohamy, 2001) argues the need to include examinee’s voice in test-development to assess language abilities which are useful and meaningful for them. Similar to democratic assessment, many student-generated assessment also include items developed by students. However, these items are mainly multiple-choice grammar items and writing essays, overlooking the interactive and impromptu nature of language use. Given the importance of assessing interactive and impromptu performance, the current study investigates the concurrent-validity regarding test performance and examinee’s perception.

Nine heritage-speakers of Chinese enrolled in a private university in the southwestern part of the States participated in the study. The participants took 2 oral tests, and each oral test included one regular role-play (RP), one discussed (RP), and a perception survey. The regular RPs required examinees to introduce their background and invite their partners to watch a movie. The discussed RPs required them to disagree with each and maintain a small-talk with proper opening and closing. For the discussed RPs, examinees needed to decide on topics they were going to disagree with and contexts in which the small chats were going to occur. The tasks of the discussed RPs were designed for test-takers to choose topics they were more comfortable with sharing different opinions and contexts they thought relevant to their L2 use outside of classrooms. Performance was assessed with four categories, pronunciation, language use, contents-delivery, and turntaking. Perception survey assessed students’ preferences to the two types of RPs regarding self-efficacy (k = 4) and test anxiety (k = 4).

Analyses found significant correlation coefficients between the two RP types across the four categories, although the category of turntaking showed the lowest correlation among the four categories. Survey results suggest that the numbers of the students preferred discussed RPs (n = 3.13 across two test-times and 4 survey-items), the number of those preferred regular RPs (n = 3.25), and those with no clear preference (n = 2.25) are similar, regarding self-efficacy. Regarding test-anxiety, however, the number of the students who perceived higher anxiety during the discussed RPs (n = 3.75) was the highest, followed by the number of those with no clear preference (n = 2.88) and the number of those perceived higher anxiety during the regular RPs (n = 1.88). Follow-up interviews explained that discussed RPs received higher self-efficacy during discussed RPs could be due to that students were able to choose topics they were combatable with, but the higher anxiety would be due to their lower control over what the partners would produce. The findings suggest the validity and the perceived-efficacy of discussed RPs regarding incorporating students’ voice in test development, although the interactive and impromptu features of discussed RPs and natural conversation may also lead to higher anxiety than do regular RPs.



Hiromi Takayama

Rice University, USA


What factors influence foreign language anxiety in oral assessment?


Although myriad studies have shown foreign language anxiety impacts on learners’ oral performance both in class and assessment, it has not been investigated in a specific context of speaking assessment. According to Horwitz, Horwitz, and Cope (1986), foreign language anxiety is constructed based on three types of performance anxieties, such as communication apprehension, test anxiety, and fear of negative evaluation, which exhibits inevitable relationship between foreign language anxiety and assessment. These researchers conceptualized foreign language anxiety as “a distinct complexof self-perceptions, beliefs, feelings, and behaviors related to classroom language learningarising from the uniqueness of the languagelearning process” (p. 128). This connects not only fears of lacking sufficient language proficiency but also associations of foreign language learners’ ideology.By reflecting these constructs, this research explored in what ways foreign language anxiety affects learners’ oral performance.

This session analyzes how foreign language anxiety influences learners’ oral assessment through the one-on-one speaking session with heritage Japanese speakers who have beyond distinguished level of speaking proficiency based on the ACTFL proficiency guidelines (2012).Each session was analyzed what interactional and psychological factors impacted on their oral performance. Ten intermediate-mid level Japanese language learners recorded their conversation sessionabout 25 minutes. Their task was to talk about any topics of their choices and write a reflection about five categories, such as the active participation, grammar accuracy, active listening, fluency and pronunciation, and speech level and politeness. The researcher implementedGee’s seven building tasks for discourse analysis (2014) and Fairclough’s10 questions for critical discourse analysis(2010) to investigate from three perspectives:(1) what factors influenced the participants’ anxiety level in the one-on-one setting with interlocutors who have the higher target language proficiency; (2) how the assessment structure and rubric impacted their foreign language anxiety; (3)how the participants analyzed their interactions with their interlocutors.

Major findings detailed the sources of foreign language anxiety from the analysis. First, more than a half of the participants felt nervous about the speaking assessment due to the length of the session and concerns for their accuracy. In the reflection, some participants analyzed that their fluency was hindered to compensate for their accuracy in speaking. However, despite their anxiety of the speaking performance, many of the participants reflected their experience positively after all. Second, the proficiency gap between the participants and interlocutors positively correlated the participants’ emotional state. Some of them exhibited that they co-constructed sentences and topics and appreciated that the interlocutors facilitated their interactions. For this speaking assessment, the proficiency gap positively influenced the participants’ oral performance. Learning from these findings, the researcher suggests how foreign language teachers can reduce their students’ foreign language anxiety in a conversational oral assessment setting.


Yesenia Chavez

San Jacinto College, USA


A Lexical Exam for Spanish College Classes (3rd Edition): A Developmental Project


As lexical exams are becoming more numerous, they still tend to be without context, limited, varied and are difficult to create (Chavez, 2017a; Chavez, 2017b; Nizonkiza and Van Den Verg, 2014; Rodrigo, 2009). The most common lexical exams include the multiple choice and the cloze tests (Chavez, 2017b). However, we should take into consideration the following variables when creating an exam: the economic status of the test takers, their gender, the teaching practices of the faculty, their Spanish level, the school and the mode of administration among others (Chavez, 2017a; Bailey Victery, 1971). Chavez (2017b) presents a tentative guideline for improvement for the lexical exams in Spanish. The guidelines include research from Wood and Peña (2015), Izura et. al (2014), Lafford et. al (2003) and Pearson (1998). Chavez (2017a) started a sequel of lexical exams in Spanish that improved in Chavez (2017c). This presentation is a developmental project about the Lexical Exam (3rd Edition) that started with Chavez (2017) a doctoral dissertation from the University of Houston. This exam has been done with students of Spanish as a Heritage Language (HL) classes and with students of Spanish as Second Language classes (L2) at the college level and at the high school level. This presentation is a preview of the updated versions of the lexical exam multiple choice and the fill in the blank survey (3rd Edition). The Chavez Lexical Exam (3rd Edition) will have 50 multiple choice questions and a fill in the blank survey of 25 questions. Before, they had 200 questions and 30 questions respectively. The lexical surveys have been improved following the recommendations done in Chavez (2017b). There, Izura et. al (2014) recommend to avoid items “too easy” or “too difficult” questions, include real and non-real words, and it also tests in several levels. The Lexical Exam (3rd Edition) also keeps the language survey and it assesses the productive vocabulary as recommended by Pearson (1998). Finally, the Lexical Exam (3rd Edition) also avoids cultural preferences and language obstruction, and it is used to assess, teach and retest as recommended by Wood and Peña (2015). However, this Lexical Exam (3rd Edition) still lacks the following as recommended by Chavez (2017b): it aims for standardization across languages, it needs more communication with other researchers, it needs to investigate the importance of input and technology, it should be used with larger groups, it should assess productive vocabulary with a context, it should avoid unequal distribution of difficulty, assess validity and include questions beyond the ceiling effect. Finally, its contributions to the field include: adding context to a multiple choice survey may add test anxiety to students of Spanish as L2 and lexical exams should be administered at least twice to improve them. This Lexical Exam (3rd Edition) may be implemented in the spring 2019 semester in a college setting.



Mingxia Zhi

University of Texas at San Antonio, USA


The Consequential Validity of Paper- and Computer-Based ESL Writing Tests


Both paper-pencil format (PB) and computer-based format (CB) are currently used for standardized and institutional English as a second language (ESL) writing tests (Weir, O’Sullivan, Jin, & Bax, 2007). However, little research has empirically examined the consequence in test interpretation and test use for the PB and CB ESL writing tests.

Purpose of the Study

This study examines the value implications and social consequence of CB and PB ESL writing assessment from a test-taker perspective. In Messick’s (1989) definition of consequential basis of validity, two component parts should be examined– the consequential basis of test interpretation, which is “the appraisal of the value implications of the construct label, of the theory underlying test interpretation, and the ideologies in which the theory is embedded” and the consequential basis of test use, which is “the appraisal of both potential and actual social consequences of applied testing” (p. 20).


Guided by Messick’s definition of consequential basis of validity, this study employed a mixed-methods design to examine the consequences of administering an ESL writing test in CB and PB mode. Fifty three (N= 53) ESL students were recruited to participate in two 20-minute writing tasks (one CB and one PB) with two argumentative essay prompts using a counter- balanced group design. Data collection included a pre-survey (adapted from Petric & Czarl 2003) to elicit participants’ writing processes in non-testing conditions, two cognitive writing questionnaires (adapted from Chan et al., 2017) after each writing task, a post-survey and a post- interview to understand the participants’ perceptions of the tasks. The writing samples were analyzed linguistically and rated by two raters using a holistic and an analytically rubric adapted from IELTS.


Preliminary results showed that computers are available for most participants at home (94.1%) and school (70.6%). Paired-samples t-test results indicated that writers engaged in significantly more planning and revisions during writing and more revisions after writing in CB test mode. Bivariate correlation analysis suggested that while most activities in CB and PB tests correlated with daily English writing process, CB test yielded more significant correlations with the daily writing regarding the activities during writing. The quantitative results indicated that CB test yielded behaviors closer to the construct of process writing, which indicates higher consequential basis of validity for test interpretation. Despite the higher preference of CB test (70.6%), qualitive results revealed that many participants were more comfortable with PB test because computer was not commonly used in their countries until the recent five years, hence force negative bias against test candidates from certain developing countries. PB writing may help writers to flow ideas more naturally during writing, but is more time-consuming and does not afford structural revisions. This limitation was perceived beneficial for some writers because it “forced” deliberate planning.


The findings indicate differences in the L2 writing process generated under different test modalities. Analysis of the written products will be analyzed and presented. Nevertheless, we argue for both formats available for test-candidates in high stakes exams, especially those from less developed countries to ensure fairness and positive social consequence of the test.



Sunday, April 14


Reginald Gentry

University of Fukui, Japan


Predictors of Oral Fluency in Japanese University Students


Language processing is a complex process comprised of auditory stimuli (Rost, 2011), and metacognitive comprehension (Andring et al., 2012; Goh, 2008; Martin & Ellis, 2012; McBride, K., 2011; Rost, M., 2011; Vandergrift, L., & Tafaghodtari, M., 2010; Vandergrift, L., 2005; Vandergrift, L., & Baker, S., 2015; Yeldham, M., & Gruba, P., 2014). The combined aspects of listening enable a speaker to produce salient oral responses (Cutler & Clifton, 2000; Field, 2008; Graham & Macaro, 2008; Rost, 2011). However, individuals process an immense amount of information from their environments and therefore require explicit instruction on developing their aural and oral skills. Using explicit and corrective feedback (e.g., interviews, review of journal entries) with relevant explanations will enable learners to expand their lexis, and apply this knowledge to future listening and speaking contexts (Stafford et al., 2012; Lyster & Saito, 2010). Furthermore, repeated engagement in a specific tasks frequently yields benefits in developing utterance and cognitive fluency in language learners (Segalowitz, 2010).

This study will examine the effects of teacher feedback and topic familiarity, topic knowledge, and topic difficulty on oral fluency of first year Japanese university students (N = 48) in an engineering program during the fall 2018 semester (October to late January). Participants have compulsory English classes that are 90 minutes in length and meet twice a week. The medium of instruction is English, which is also the first language of the participants’ instructor, the primary researcher.

Participants will be assigned a weekly speaking topic and will record their responses in English, via a digital recording device (e.g., smartphone) or with recording software for a laptop or desktop computer. The topic will be given on the second class meeting each week. Participants will email their recording to the researcher before the first class of the following week. They will have five days to complete the assignment—the time between the last class of the proceeding week and the first class of the following week. Upon submitting each assignment, participants will also evaluate each topic by using a six-point Likert scale (1=strongly disagree; 6=strongly agree) assessment sheet written in Japanese and in English. Participants will rate: topic familiarity; topic knowledge; and topic difficulty. Participants will also indicate: the extent to which they used notes or a script while speaking; the number of times they practiced before the final recording; and how much time they prepared for each assignment.

The researcher will listen to the responses and provide feedback to the participants. Furthermore, the researcher will record the data from the assessment sheets to ascertain which aspects of the assignments, in conjunction with explicit feedback, might influence oral fluency. Preliminary results of the study will be discussed, as this research will be a work-in-progress and will be completed at the end of January 2019. However, comments and feedback from the conference attendees will be strongly encouraged and warmly welcomed.



Sefa Owusu

University of Education, Ghana

Evaluating the Content Validity of High-Stakes ESL Tests in Ghana


A good test should have content validity, that is, it should reflect the objectives and the content of the curriculum, so that the test would be representative, relevant, and comprehensive. It is said that for a test to promote positive washback, it should reflect the course objectives upon which the test content is based. The high-stakes English language tests in Ghana should therefore reflect the objectives of the English language curriculum. The objective of this paper is to find out whether or not the high-stakes English language tests in Ghana cover the objectives and the content of the English language curriculum. The paper makes use of the data gathered through questionnaires and document analysis to provide answers to the research question: To what extent are the high-stakes English language tests in Ghana aligned with the English language curriculum? The English language syllabus and past questions from 2010 to 2017 were analysed to establish the relationship between the test items and the prescribed English language syllabus. Again, a questionnaire was conducted with 24 English language teachers from 4 junior high schools and 8 eight senior high schools. Analysis of data revealed that the high-stakes English language tests in Ghana lacked washback validity. This means that the objectives of the English language curriculum were not fully reflected in the tests, since some topics or areas in the English language syllabus were not examined. This gap between the objectives of English language curriculum and the focus of the high-stakes tests encouraged the teachers to teach to the test, thereby concentrating on only the areas that were examined in the high-stakes tests. The teachers concentrated on grammatical structure, reading comprehension, and essay writing which were tested in the high-stakes tests. In effect, the results of this research work could have important implications for high-stakes English language test system reform, and the roles high-stakes language tests play in shaping ESL classroom practices in Ghanaian schools.



Yangting Wang

The University of Texas at San Antonio, USA

Test washback of TOEFL Preparation Courses at an Intensive English Program


Test washback is considered an important aspect of consequential validity (Messick, 1996). It refers to the effect or impact of a test on language learning and teaching (Abeywickrama & Brown, 2010). With over a million international students studying in the United States, TOEFL test as the English proficiency criteria for international admission become critical for students’ education. International students are seeking guidance in TOEFL test-taking and test-preparation strategies. Even though there is an increasing number of TOEFL preparation courses offered in the United States, little is known about the effectiveness of these TOEFL classes in helping students’ improving their Test of English as a Foreign Language (TOEFL) scores and their language learning (Huang, 2018). This study aims to investigate the washback effects of TOEFL ITP (Institutional Testing Program) on TOEFL preparation classes, more specifically, its influence on classroom instruction, classroom dynamics, students’ attitude, and TOEFL ITP test score gains.

This study took place at an Intensive English Program (IEP) in a university in the Southwest United States and data was collected in 2018 spring and fall semesters. A total of 49 students from TOEFL level 2 to level 5 was involved and each level’s TOEFL instructors participated in the project. This study follows Leech and Onwuegbuzie’s (2017) partially mixed concurrent dominant status design where both qualitative and quantitative data were collected independently and the results were triangulated after the quantitative and qualitative analyses were completed. Given the research questions and aim, the study put more weight on qualitative data collection and analysis. Qualitative data include 14 times classroom observations and 16 students’ interviews (at least two observations and two interviews for each TOEFL class level), classroom artifacts, field notes, memos, and one open-ended survey question (n = 49). Qualitative data were analyzed inductively through coding techniques from Saldana (2015). Quantitative data includes 15 Likert-scale survey items and four-times before and after semester TOEFL ITP scores (n=49). One-Way Repeated Measure ANOVA and post hoc test were conducted for quantitative analysis.

Preliminary qualitative results indicated that teachers’ instructions follow the same “test strategies/skills-practice, skills-practice” delivery structure; there are only a few interactions in class with the exceptions of TOEFL Level 5 class; students generally believed that TOEFL class is necessary for an Intensive English Program. However, they had mixed attitude towards whether TOEFL class can improve their scores as well as promote their language learning. Initial quantitative results revealed significant differences between the first and the third time when students took the TOEFL ITP exam. No significant differences were found among the second, third, and fourth time when the TOEFL was taken. The findings contributed to validity studies regarding test washback. Pedagogical implications for TOEFL preparation classes in terms of classroom interactions, classroom activities, and homework were discussed.



Analynn Bustamante

Georgia State University, USA

Exploring rater’s self-monitoring behaviors: an eye-tracking and stimulated recall study


Writing assessment in standardized testing contexts can be considered a political act as these tests serve a gatekeeper role for academic opportunities. Because written texts must be rated, scores are given based on raters’ values regarding essay features and interpretations of rubric criteria. Therefore, in discussing the social and ethical consequences of assessment, raters ought to receive academic attention. Illuminating the “how and why” of raters scoring decisions is a crucial component in understanding the broader social impact of decisions made using test scores.

The purpose of the present analysis was to explore whether specific self-monitoring behaviors (Cumming, Kantor & Powers, 2002) were more common in certain rater subgroups (e.g. L1, expertise level). Fifteen participants, all applied linguistics graduate students with various levels of professional experience and demographic backgrounds, rated placement test essays on an eye-tracker, then a stimulated recall (SR) was conducted using the eye-movement recording. Participants were trained before each data collection session. As part of a larger mixed-methods study, the data were coded for raters’ self-monitoring behaviors for the present analysis. Qualitative results will be presented.

The eye-movement SR allowed raters to discuss their decision-making without interference that may occur during think-aloud-protocols. Raters were able to watch their eye-movement and elucidate their real-time thoughts. The data show that, while raters consciously engage in a variety of self-monitoring strategies during essay rating, there may not be subgroup-specific trends. Common themes (mentioned by at least 5 participants) that emerged are treatment of biases and first impressions, envisioning the test context, and rubric use. First, raters had a variety of strategies when dealing with biases. For example, essay length emerged as a common influence on raters’ first impressions with many raters engaging in self-policing (“let me not score it based simply on the length”) and others justifying their length-based impressions (“you couldn’t explain something… in such short paragraph”). Another common theme that emerged was envisioning the larger context of the test and test taker. A few participants considered “what the test is being used for” in their rating decisions and many felt empathy for test takers (“it might be hard for them”). The final theme that will be discussed is raters’ rubric use. Participants stated they primarily used the rubric specifically to break ties or look for certain criteria but often did not consult the rubric at all. Some felt they did not need to closely refer to the rubric immediately after training, while others actively used the rubric for several essays until they felt comfortable. Overall, although the sample size in this study is small as is the case for many rater studies, there did not seem to be specific self-monitoring strategies vis-a-vis subgroups of previously explored rater characteristics like L1 or expertise level.

Through thick-description, the present study hopes to contribute to research on the understanding of essay rating beyond reliability statistics toward a more holistic approach to rater training and evaluation. Practical implications will be discussed.