UNCE Workshop May 23, 2025

ABSTRACTS

Colloquialization and Conversationalization in Casual Online Written Discourse

Veronika Raušová

The present research explores the convergence of casual online written discourse with features of spoken language, focusing on the processes of colloquialization and conversationalization in online forum comment threads. Colloquialization refers to the increasing informality and spoken-like structure of written texts, while conversationalization denotes the adoption of interactive, dialogic elements characteristic of casual speech in written discourse. A key feature of spoken discourse is the use of discourse-pragmatic (D-P) markers—optional elements such as likewellactuallynow, and you know—which contribute to the expression of speaker stance, the organization of discourse, and interpretive guidance. This study draws on a growing corpus (approx. 100 million tokens) of posts and comment threads from Reddit.com, scraped from eleven subreddits between February and May 2025. The current aim is to use the corpus data to test and refine automatic retrieval methods for as broad a range of discourse markers as possible, using tools such as the UDPipe 2 toolkit and by circumscribing the relevant syntactic context for each marker. While some markers are reliably tagged (e.g., D-P likewell as INTJ) or strongly favor one syntactic slot (e.g., sentence-initial/clause-external D-P uses of now), others present challenges due to their multi-word structure (you knowI meanI guess) or their positional mobility—particularly in the case of adverb-derived D-P markers (e.g. actually, basically). If sufficiently accurate, automatic retrieval of D-P markers could effectively facilitate large-scale analyses of their usage in casual written discourse.

When svůj becomes můj: personal and reflexive possessive pronouns in Czech

Michal Láznička

When an adnominal possessor is coreferential with the subject of a clause, the Czech reflexive possessive pronoun svůj is typically used. However, personal pronouns may and do also occur in such contexts. I will present an analysis that follows up on Perevozchikova’s [Reflexive or Not? Choosing a Possessive in Bulgarian, Czech, and Russian. Scando-Slavica 69(2): 244-263. 2023] systematic quantitative study of the phenomenon.

Using a sample of more than 2000 occurrences from a corpus of online discussions and posts on social networks, I will show that the use of personal pronouns is associated with the preverbal position, larger structural distance from the predicate, and the presence of other referents that could be (mis)interpreted as the possessor. Furthermore, I will discuss how the use of personal pronouns in contexts favouring the use of svůj may be explained by semantic and pragmatic factors, such as possessum animacy. Finally, using data from two questionnaires (N = 110 and N = 139), I will show how the intuitions of Czech speakers align with the corpus data.

Telicity, Aspect, and the Imperfect Subjunctive in Latin

Martina Vaníková

The Latin corpus is unique: it is limited in size, lacks native speaker data, and often provides little contextual information. In earlier work, I showed that telic verbs in the imperfective indicative forms can convey completed or specific meanings.

Building on this, the talk focuses on the imperfect subjunctive, which appears more frequently with telic verbs than predicted. While present subjunctives are not especially rare, their distribution is less striking. The talk analyses these imperfect subjunctive instances in detail, comparing their use with telic and atelic predicates.

The Czech phonological stereotype of Hungarian

Jiří Januška

The fact that the Czech linguistic community living historically in the present-day territory of the Czech Republic has a more or less(!) coherent idea of what „Hungarian sounds like“ seems to be a fact. This talk will present the plans and preparatory work carried out so far on researching this question.

Autonomous LOTE Tandem Language Exchange as a Space to Foster Plurilingualism in Higher Education

Silvie Převrátilová

This study investigates the potential of tandem language learning (TLL) to foster plurilingualism in higher education, focusing on languages other than English (LOTEs). Conducted over one semester at a Czech university, the study involved 14 students (seven Czech–international pairs) participating in an autonomous tandem course. Data were collected through learner diaries and three open-ended questionnaires.

Analysis was conducted on two levels. A thematic analysis of the diaries and questionnaires identified how learners engaged in reflexive practices, constructed plurilingual identities, and demonstrated language awareness. Participants reflected on cross-linguistic similarities and differences, drew on prior language knowledge, and positioned themselves as plurilingual speakers. A complementary analysis examined translanguaging practices in the diaries, revealing how learners blended multiple languages—including their L1, the target language, and English—within individual entries to scaffold comprehension, reflect on their learning, and navigate intercultural dialogue.

The findings show that learners naturally activated plurilingual strategies in informal, peer-based settings, even without explicit pedagogical guidance. The study contributes to research on plurilingual education by illustrating how tandem learning supports metalinguistic awareness, identity development, and translanguaging as integral dimensions of language learning. It also advocates for stronger integration of plurilingual practices into language teaching and teacher education.

Typology of corpus-based exercises and their automatic generation using AI

Adrian Zasina

This study presents a typology of corpus exercises that can be generated using generative AI and analyses the feasibility of automatically producing corpus-based exercises. The automated creation of such exercises highlights the new potential of modern technologies in language learning, particularly in data-driven learning. ChatGPT (OpenAI, 2025), employed for this purpose, was able to generate a procedure for designing a corpus-based exercise within seconds, while the Corpus Linguist language model (Milička & Machálek, 2024) facilitated direct interaction with corpus data.

The evaluation of automatically generated corpus exercises revealed several shortcomings, primarily the inconsistent use or non-use of corpus data, the formulation of conclusions about common difficulties faced by foreign learners based on sources other than corpus data, and the inclusion of incorrect examples in sample exercises. In such cases, expert oversight is indispensable, as specialists must not only assess the linguistic aspects of the results but also verify their alignment with corpus data.

Although these findings are promising for future applications, the current state of AI-generated exercises may be more detrimental than beneficial to inexperienced learners. It is therefore essential to monitor further developments, critically evaluate advancements, and, above all, raise awareness of the limitations of such tools. As with the use of corpora in language learning, the integration of generative AI requires a well-informed approach and careful, nuanced interpretation of the information provided. Similar concerns are echoed in international research, where the safety and reliability of AI in computer-assisted language learning remain open questions (Kohnke, 2025).

Kohnke, L. (2025). AI Safety in Computer-Assisted Language Learning. In L. McCallum & D. Tafazoli (Eds.), The Palgrave Encyclopedia of Computer-Assisted Language Learning (pp. 1–5). Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-51447-0_95-1

Milička, J. & Machálek, T. (2024) Corpus Linguist [GPT model], https://chatgpt.com/g/g-pFqRCNeHu-corpus-linguist

OpenAI. (2025). Chat-GPT (Version January 17, 2025) [Computer software]. https://chatgpt.com/

Coreference by Competition: Lessons from the CRAC Shared Task

Michal Novák

Coreference resolution research has long been dominated by English-centric evaluation, in part due to the fragmented nature of available resources across languages. To help address this imbalance, my colleagues and I have been developing CorefUD, a growing multilingual collection of datasets annotated with coreference and anaphoric relations.

To promote multilingual evaluation and establish a unified evaluation framework, we initiated the CRAC Shared Task, a competition in which participants build and benchmark coreference resolution systems across languages.

In this talk, I will share our progress over the past four years, the challenges we have encountered, and reflect on how our efforts have shaped the field. I will also touch on our recent adaptations to the Shared Task, designed to explore how well Large Language Models handle coreference resolution.

Live Credible Translation

Dominik Macháček

I will introduce my project “Live Credible Translation.” The current automatic Simultaneous Speech Translation (SST) systems usually provide only one translation hypothesis without any Quality Estimation (QE) score that would indicate how likely is the translation correct or wrong. I plan to propose SST QE methods for practical applications, such as for SST of cross-lingual dialogues, and for human real-time post-editing of SST.

First, I will introduce my current work, (1) a state-of-the-art simultaneous translation system, and (2) a corpus of cross-lingual dialogues. Then, I present my plans and ask for recommendations.