Information détaillée concernant le cours

[ Retour ]

The art of recycling in linguistics: sharing and reusing corpus and experimental data


October 1, 2024

Lang EN Workshop language is English
Responsable de l'activité

Jérôme Jacquin


Dr. Jérôme Jacquin, Université de Lausanne

Prof. Sandrine Zufferey, Universität Bern


Emma Marsden, University of York

Peter Uhrig, FAU Erlangen-Nürnberg


Issued in 2016, the FAIR principles (i.e. findable, accessible, interoperable, reusable) have dramatically impacted the way scientific publications are managed and shared. Scholars are strongly encouraged to publish in Open Access. The application of such principles to scientific data is becoming more and more urgent, especially because they are now integrated into the guidelines issued by funding agencies (e.g. in the Data Management Plan required by the Swiss National Science Foundation). The general ambition of Open (Research) Data (ORD) is at least twofold: it is not only to contribute to guaranteeing the quality of research, by facilitating the possibility of verifying the reliability of previous analyses, but also to encourage the aggregation, sharing, and thus the reuse of research data generated with public funds. Applied to linguistics, these principles are easier to implement when the data are relatively massive, can be formatted in a tabular form, and when they do not present very complex or specific structural and confidentiality issues (e.g. multimodal corpora consisting of natural recordings in sensitive contexts). Several institutional initiatives also support Swiss researchers interested in this issue (notably CLARIN-CH, LiRi, LaRS & SWISSUbase_linguistics).

Regardless of the type of data analyzed, junior and senior researchers in linguistics repeatedly face similar challenges related to sustainable data management. Examples of such challenges are: - The diversity/evolution of formats, encodings; - The diversity/evolution of annotation schemes correlated with specific research objectives; - The diversity/evolution of metadata required by specific language archives and depositories; - Named entity recognition (NER) and more generally the management of confidentiality; - Statistical replication involving the availability of open-source code (e.g. R scripts)

In order to tackle this issue, two invited speakers who are actively engaged in initiatives for applying the FAIR principles will discuss how they are applied in their research. 

Emma Marsden (University of York) is a specialist in research into second language acquisition and applied linguistics. She has been actively engaged in several projects supporting open research practices. Emma is a founder of the IRIS database, containing free searchable resources (instruments, materials, data, code, and postprints) for research related to languages, including the domains of learning, use, processing, education. The IRIS project and other open projects such as 'OASIS' (Open Accessible Summaries of Research in Language Studies) have been a catalyst for promoting reflections about open scholarship, such as the challenges and benefits of materials transparency for experimental research and reproducibility. Marsden, with Morgan-Short, used the OSF platform to run a large-scale multi-site pre-registered replication study. In her capacity as associate editor and then journal editor for Language Learning (2015-2022), she worked with colleagues to establish the article type Registered Reports at the journal. In her lecture, she will describe some of these projects and discuss lessons learnt and ways forward to promote an equitable and sustainable open research culture. 

Peter Uhrig, professor of digital linguistics with a focus on big data at FAU Erlangen-Nürnberg, is a cognitive and corpus linguist. He is developing infrastructures and procedures aimed at promoting data access and open datasets in linguistics, by using data science methods and Machine Learning. Some of his online projects include E-VIEW-alation, which allows large-scale analysis and exploration of collocations, The Erlangen Valency Patternbank based on the Valency Dictionary of English by Herbst et al., the Project, an online interface that enables exploration of dependency structures in large corpora. He is currently very interested in the multimodal annotation of large corpora, some of which he makes available at Peter is ICAME technical secretary and very active in the Distributed Little Red Hen Lab.

The doctoral school thus aims to raise doctoral students' awareness about the medium- and long-term management of their data. They will also be encouraged to present their doctoral projects and get feedback about the methodological aspects of their research.

The doctoral school is planned in collaboration with the Swissuniversities-funded projects "CHORD-talk-in-interaction - Data-sharing skills in corpus-based research on talk-in-interaction" and "Swiss-AL: Linguistic ORD Practices for Applied Sciences".


Université de Lausanne



Délai d'inscription 23.09.2024
short-url short URL

short-url URL onepage