Information détaillée concernant le cours

[ Retour ]
Titre

The art of recycling in linguistics: sharing and reusing corpus and experimental data

Dates

October 1, 2024

Lang EN Workshop language is English
Responsable de l'activité

Jérôme Jacquin

Organisateur(s)/trice(s)

Dr. Jérôme Jacquin, Université de Lausanne

Prof. Sandrine Zufferey, Universität Bern

Intervenant-e-s

Emma Marsden, University of York

Peter Uhrig, FAU Erlangen-Nürnberg

Description

The doctoral school aims to raise doctoral students' awareness about the medium- and long-term management of their data.

Issued in 2016, the FAIR principles (i.e. findable, accessible, interoperable, reusable) have dramatically impacted the way scientific publications are managed and shared. Scholars are strongly encouraged to publish in Open Access. The application of such principles to scientific data is becoming more and more urgent, especially because they are now integrated into the guidelines issued by funding agencies (e.g. in the Data Management Plan required by the Swiss National Science Foundation). The general ambition of Open (Research) Data (ORD) is at least twofold: it is not only to contribute to guaranteeing the quality of research, by facilitating the possibility of verifying the reliability of previous analyses, but also to encourage the aggregation, sharing, and thus the reuse of research data generated with public funds. Applied to linguistics, these principles are easier to implement when the data are relatively massive, can be formatted in a tabular form, and when they do not present very complex or specific structural and confidentiality issues (e.g. multimodal corpora consisting of natural recordings in sensitive contexts). Several institutional initiatives also support Swiss researchers interested in this issue (notably CLARIN-CH, LiRi, LaRS & SWISSUbase_linguistics).

Regardless of the type of data analyzed, junior and senior researchers in linguistics repeatedly face similar challenges related to sustainable data management. Examples of such challenges are: - The diversity/evolution of formats, encodings; - The diversity/evolution of annotation schemes correlated with specific research objectives; - The diversity/evolution of metadata required by specific language archives and depositories; - Named entity recognition (NER) and more generally the management of confidentiality; - Statistical replication involving the availability of open-source code (e.g. R scripts)

In order to tackle this issue, two invited speakers who are actively engaged in initiatives for applying the FAIR principles will discuss how they are applied in their research. 

Emma Marsden (University of York) is a specialist in research into second language acquisition and applied linguistics. She has been actively engaged in several projects supporting open research practices. Emma is a founder of the IRIS database, containing free searchable resources (instruments, materials, data, code, and postprints) for research related to languages, including the domains of learning, use, processing, education. The IRIS project and other open projects such as 'OASIS' (Open Accessible Summaries of Research in Language Studies) have been a catalyst for promoting reflections about open scholarship, such as the challenges and benefits of materials transparency for experimental research and reproducibility. Marsden, with Morgan-Short, used the OSF platform to run a large-scale multi-site pre-registered replication study. In her capacity as associate editor and then journal editor for Language Learning (2015-2022), she worked with colleagues to establish the article type Registered Reports at the journal. In her lecture, she will describe some of these projects and discuss lessons learnt and ways forward to promote an equitable and sustainable open research culture. 

Peter Uhrig, professor of digital linguistics with a focus on big data at FAU Erlangen-Nürnberg, is a cognitive and corpus linguist. He is developing infrastructures and procedures aimed at promoting data access and open datasets in linguistics, by using data science methods and Machine Learning. Some of his online projects include E-VIEW-alation, which allows large-scale analysis and exploration of collocations, The Erlangen Valency Patternbank based on the Valency Dictionary of English by Herbst et al., the Treebank.info Project, an online interface that enables exploration of dependency structures in large corpora. He is currently very interested in the multimodal annotation of large corpora, some of which he makes available at https://multimodalcorpora.org/web/. Peter is ICAME technical secretary and very active in the Distributed Little Red Hen Lab.

 

In the morning, the invited speakers will give the following presentations:

1. What could open research practices do for replication research? (Emma Marsden)

2. Sharing and reusing corpus data – documentation, annotation, data formats, and software (Peter Uhrig)

In the afternoon, two successive workshops are planned:

1. Sharing and reusing corpus data – hands-on session (Peter Uhrig)

2. Initial steps towards enhancing our culture of replication research (Emma Marsden)

 

Description of Emma Marsden's workshop:

Replication research brings many benefits but also challenges. The workshop will consider some of these challenges and then provide the opportunity for participants to discuss and plan their own replication study. This could be planned as a close or partial replication, and could be a single or a multi-site endeavour. Key questions we consider will include: (i) motivating the need for the replication; (ii) which variables and methods to hold constant and which to change and why; (iii) practical steps in setting up and conducting the study; (iv) what to make open; and (v) challenges specific to multi-site endeavours. Drawing on the stated interests of the participants when they register, they will then be invited to choose one study from a small list of articles that will be provided to them in August. In preparation for the workshop, they will read this article and give some preliminary consideration to key issues that might arise from trying to replicate the study. 

 

Description of Peter Uhrig's workshop:

Data collection is seen as a necessary evil in many PhD projects and thus does not receive much love. However, neglecting the data can have unwanted side effects, both for the PhD researcher and for anyone interested in replicating the research. In the morning talk, we will take a look at data collection, storage and annotation processes, including questions of encodings, formats, and relevant software. We will discuss simple yet effective approaches to producing reusable data. In the afternoon workshop, we will work hands-on with real-world data. Participants are encouraged to submit sample data and related questions before the workshop, so that a selection of these can be addressed. 

Lieu

[EN LIGNE]

Information
Places

16

Délai d'inscription 23.09.2024
short-url short URL

short-url URL onepage