ABSTRACTS
Bootstrapping UMRs from UD for Scalable Multilingual Annotation
Federica Gamba
Uniform Meaning Representation (UMR) offers a cross-linguistically applicable framework for capturing sentence- and document-level semantics, but producing UMR annotations from scratch is a time-intensive process. This talk presents an approach for bootstrapping UMR graphs by leveraging Universal Dependencies (UD), a richly annotated multilingual syntactic resource covering a wide range of language families. It will be described how structural correspondences between UD and UMR can be exploited to automatically derive partial UMR graphs from UD trees, providing annotators with an initial representation to refine rather than create from scratch. While UD is not inherently semantic, it encodes syntactic information that maps well onto UMR structures, allowing us to extract meaningful correspondences that simplify annotation. This method not only reduces annotation effort but also facilitates scalable UMR creation across typologically diverse languages, aligning with UMR’s cross-linguistic design goals.
Automated Quality Control for Language Documentation: Detecting Phonotactic Inconsistencies in a Kokborok Wordlist
Abishek Stephen
Lexical data collection in language documentation often contains transcription errors and borrowings that can mislead linguistic analysis. We present unsupervised methods to identify phonotactic inconsistencies in wordlists, applying them to a multilingual dataset of Kokborok varieties with Bangla. Using phoneme-level and syllable-level n-gram language models, our approach identifies potential transcription errors and borrowings. We evaluate our methods using hand annotated gold standard and rank the phonotactic outliers using precision and recall at K metric. The ranking approach provides field linguists with a method to flag entries requiring verification, supporting data quality improvement in low-resourced language documentation.
Cross-linguistic statistical patterns in morphologically annotated corpora.
Vojtěch John
Research on morphological diversity in typology and contrastive linguistics has traditionally focused on discrete, predominantly inflectional features. However, corpus-based approaches can provide complementary insights into the quantitative and dynamic aspects of morphological systems. While multiple languages have both morphological resources and large parallel corpora, sizeable corpora with detailed morphological annotation – including morphological segmentation and morpheme classification – remain very scarce. As part of a broader effort to address this gap, we present our current work on the detailed automatic annotation of part of the multiparallel corpus Europarl, comprising over 10 million tokens in each of six languages: Czech, English, French, German, Hungarian, and Slovak. The presentation reports preliminary results on quantitative morphological features extracted from these data and their potential to inform further cross-linguistic research. In particular, we discuss observed cross-linguistic regularities in morpheme frequency distributions, relationships among morpheme classes, and their possible connection to word formation strategies.
When Data Meet Tools: Using the Monitor Corpus for the Analysis of laguage Development
Klára Pivoňková
The aim of this paper is to introduce an infrastructure developed within the HiČKoK project to enable full-fledged corpus-based diachronic research of Czech. The individual sections of the paper present the components of this infrastructure, which links well-balanced, representative and annotated data with tailor-made tools for diachronic research. The forthcoming monitor corpus, covering the entire period of written Czech, along with its composition and annotation strategies, is briefly introduced. In the following sections, the potential of the application and its four modules—simple query, comparison, time-based associations, and diachronic collocations—are demonstrated through mini case studies. Combining large-scale data (as representative as possible) with a tool that enhances standard corpus functionalities, enriches them with a diachronic perspective, and enables result visualization makes diachronic research on language change more accessible and comprehensive.
Assembling a Large Diachronic Corpus of Czech Books (1850–1950)
Michal Olbrich
We assembled a uniquely large diachronic corpus of written Czech based on books published
between 1850 and 1950. The corpus was compiled from materials provided by the Moravská
zemská knihovna (MZK), which offers extensive digitized collections through its Cramerius digital library. As source data, we used OCR texts derived from scanned book pages made available by MZK.
To collect the data at scale, we developed an automated acquisition pipeline in the form of a web crawler interfacing with the standard Cramerius API. Using this tool, we initially obtained 7,584 books from the target period. The resulting OCR texts, however, exhibited substantial noise typical of historical print, including non-textual pages, segmentation errors, and severely misrecognized characters.
The raw texts therefore underwent several stages of post-processing. We first applied a set of rule-based filters to address recurring and easily detectable issues. On this pre-cleaned data, we then employed a neural-network-based OCR correction model to further reduce systematic recognition errors and improve overall text quality.
After filtering, the corpus contains over 323 million words. Apart from the first decade (the 1850s), which includes around 4 million words, all later decades exceed 10 million words per decade. This scale allows for reliable statistical analyses and makes it feasible to apply more data-intensive methods, including the training of contextual embeddings.
Exploring Register Diversity in Czech Internet Language
Jan Henyš
Presented research investigates register variation in Czech internet texts through multidimensional analysis (MDA), building on the seminal work of Douglas Biber and Jesse Egbert (2016) and on recent research on Czech language (Cvrček et al. 2018a, 2018b, 2020). MDA provides an empirically grounded approach to textual variability by identifying patterns of co-occurring linguistic features and modeling them as latent dimensions of variation. The study applies this method to a diverse corpus of Czech online texts, a language domain characterized by rapid change, hybridity, and the convergence of written and spoken modes.
The research proceeds in two analytical stages. First, an existing MDA model developed for Czech written and spoken registers (Koditex corpus) is applied to internet texts in order to situate them within a broader register continuum and enable direct comparison with traditional registers. Second, a separate MDA is conducted on the same dataset using a revised set of features that includes phenomena specific to online communication. This allows for the extraction and interpretation of dimensions tailored to the distinctive internet discourse.
Enhancing Corpus-Assisted Discourse Studies with Sentiment Analysis
Konstantin Sulimenko
Sentiment analysis, or opinion mining, is a widely used method for automatic evaluation of the tone of textual data. It performs well for texts with explicitly stated opinions (e.g. reviews, customer feedback) but the results become notably worse for complex and implicit forms of writing, such as news articles. Corpus-assisted discourse studies (CADS), on the other hand, specifically focus on news media as their primary object of analysis, yet most methods in this field still heavily rely on the qualitative judgement of the researcher.
In my PhD thesis, I aim to find ways in which sentiment analysis might enhance CADS to improve the quality of computational social science research. Using a combination of large language models, psycholinguistic surveys, linguistic theory, keyword analysis and other statistical methods, I will explore practical and methodological opportunities and risks of using sentiment analysis as a form of automatic framing analysis for news content.
In the presentation, I will present an overview of my methodological vision for the project (which is still very much a work in progress), and discuss the possibilities of this „useful synergy“.
Elaborative Discourse Markers in Georgian Conditional Constructions
Khatia Buskivadze
The journal paper investigates elaborative discourse markers (EDMs) in Georgian language from the perspective of Construction Grammar (CxG) focusing specifically on their linguistic features and discourse/pragmatic functions within conditional sentences in chat show genre. The study seeks for the answers to the following research questions – how EDMs operate in conditional constructs and how they contribute to discourse organization and speakers’ interaction. Construction grammar stands as a powerful framework in the process of studying EDMs as it incorporates cognitive and social perspectives considering cognitive (categorization, schematization, etc.), interactional and social dimensions of language use. In addition to that, Construction Discourse (CxD) (Fried & Östman, 2004) provides a way to integrate discourse and pragmatic insights into grammatical theory by using “Frame constructions” and “discourse constructions” as flexible tools to illustrate a) how semantic meaning is encoded in constructions, syntactic patterns and b) discourse and pragmatic functions that constructions hold (Östman, 2005).
The study identified the following linguistic features of EDMs in Georgian conditional sentences: 1) contextual sensitivity-EDMs adjust their pragmatic functions when used in if clause; 2) syntactic position- clause-initial (if+EDM+clause), post-focal (if+fokus+EDM+verb) or clause final protasis+EDM+apodosis; 3) prosodic or connective autonomy- often isolated by comma/ pause; 4) multifunctionality- different discourse and pragmatic functions such as clarification, reformulation, concluding, summarizing, interactional (softeners/register and formality).
This research aims to fill a gap in the literature given the lack of prior research on discourse markers in Georgian (Buskivadze 2021, 2022) as well as the genre-specific focus on semi-formal spoken discourse. Even though the research is limited by data availability and a lack of quantitative analysis the findings of the study offer both theoretical implications for pragmatic and construction grammar theories and practical insights in the structure and coherence of Georgian media discourse.
Lexicalized valency alternations occurring with prefixation in Czech
Hana Hledíková
In the valency description of Czech, the so-called lexicalized alternations (Kettnerová &
Lopatková 2014) are a well described phenomenon which consists in a single verb having multiple different mappings between its valency positions and the semantic roles involved in the situation denoted by the verb, such as in the Czech analogue to the English spray/load alternation (Beavers 2017) shown in (1). The example is annotated both for semantic roles and valency positions based on Vallex (Lopatková 2022).
(1) a. někdo naložil seno na vůz
Agent Theme Goal
ACT PAT DIR3
‘somebody loaded the hay onto the carriage’
Agent Goal Theme
ACT PAT EFF
‘somebody loaded the carriage with hay’
Such alternations can also accompany a morphological process in which a new verb is created from the base verb (cf. e.g. Haspelmath & Sims 2010), such as prefixation – cf. (2):
(2) a. někdo leje vodu na květiny
Agent Theme Goal
ACT PAT DIR3
‘somebody pours water onto the flowers’
Agent Goal Theme
ACT PAT MEANS
‘somebody on-pours the flowers with water’
In such cases, the valency alternation does not happen in a single verb, but rather between two derivationally related verbs. In addition to changing the mapping between the valency positions and the semantic roles, the addition of a prefix can also lead to adding a new semantic role – cf. (3).
(3) a. slunce zářilo
Emitter
ACT
‘the sun was shining’
Emitter Goal
ACT PAT
‘the sun illuminated the building’
We use a sample of pairs of Czech prefixed verbs and their base verbs in combination with
SynSemClass (Urešová 2023), a syntactic-semantic lexicon of verbs which includes both valency annotation (based on Vallex) and semantic role annotation similar to the one in FrameNet (Baker & Sato 2003) to investigate the types of lexicalized alternations that occur with prefixation in Czech.
Arrival in Czech and English: A Holistic Spatial Semantics Analysis
Martin Sedláček
Traditional Motion Event (ME) literature, largely following Talmy, has spent decades focusing on Manner-prominent verbs (run, crawl, utíkat). However, „Manner-less“ or venitive verbs like English come and Czech přijít are often treated as mere deictic markers. In this talk, I want to discuss what makes these verbs cognitively complex. Czech, in particular, remains under-researched in this typological niche. My goal is to move beyond the lexical semantics of the individual verb and look at the „holistic“ construction.
I am employing Holistic Spatial Semantics (HSS) (Zlatev et al., 2021; Vesnina, 2024) as a more nuanced analytical tool on a sample of 700 random hits (350 from the BNC and 350 from syn2020). Key HSS distinctions I will be discussing:
My preliminary findings suggest that the verb itself is not the sole carrier of the „Path“ or „Direction“ meaning. The talk will address this constructional meaning and how it shifts the verb’s meaning, and what the implications for typology are. Specifically, English and Czech are often lumped into the same category of Satellite-framed languages despite vast differences ranging from morphology to aspect (e.g. the role of the Czech prefix při-).