Using character-level models for efficient abbreviation and long-form detection

Zilio L, Qian S, Kanojia D, Orăsan C (2024)

Publication Language: English

Publication Type: Conference contribution, Conference Contribution

Publication year: 2024

Publisher: European Language Resources Association (ELRA)

Pages Range: 3028-3037

Conference Proceedings Title: 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings

Event location: Torino, Hybrid

ISBN: 9782493814104

URI: https://aclanthology.org/2024.lrec-main.270

Abstract

Abbreviations and their associated long forms are important textual elements that are present in almost every scientific communication, and having information about these forms can help improve several NLP tasks. In this paper, our aim is to fine-tune language models for automatically identifying abbreviations and long forms. We used existing datasets which are annotated with abbreviations and long forms to train and test several language models, including transformer models, character-level language models, stacking of different embeddings, and ensemble methods. Our experiments showed that it was possible to achieve state-of-the-art results by stacking RoBERTa embeddings with domain-specific embeddings. However, the analysis of our first run showed that one of the datasets had issues in the BIO annotation, which led us to propose a revised dataset. After re-training selected models on the revised dataset, results show that character-level models achieve comparable results, especially when detecting abbreviations, but both RoBERTalarge and the stacking of embeddings presented better results on biomedical data. When tested on a different subdomain (segments extracted from computer science texts), an ensemble method proved to yield the best results for the detection of long forms, and a character-level model had the best performance in detecting abbreviations.

Authors with CRIS profile

Leonardo Zilio Lehrstuhl für Korpus- und Computerlinguistik

Involved external institutions

University of Surrey

United Kingdom (GB)

How to cite

APA:

Zilio, L., Qian, S., Kanojia, D., & Orăsan, C. (2024). Using character-level models for efficient abbreviation and long-form detection. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue (Eds.), 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings (pp. 3028-3037). Torino, Hybrid, IT: European Language Resources Association (ELRA).

MLA:

Zilio, Leonardo, et al. "Using character-level models for efficient abbreviation and long-form detection." Proceedings of the Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024, Torino, Hybrid Ed. Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue, European Language Resources Association (ELRA), 2024. 3028-3037.

BibTeX: Download