Data augmentation via dependency tree morphing for low-resource languages

Şahin GG, Steedman M (2018)

Publication Type: Conference contribution

Publication year: 2018

Publisher: Association for Computational Linguistics

Pages Range: 5004-5009

Conference Proceedings Title: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018

Event location: Brussels, BEL

ISBN: 9781948087841

Abstract

Neural NLP systems achieve high scores in the presence of sizable training dataset. Lack of such datasets leads to poor system performances in the case low-resource languages. We present two simple text augmentation techniques using dependency trees, inspired from image processing. We “crop” sentences by removing dependency links, and we “rotate” sentences by moving the tree fragments around the root. We apply these techniques to augment the training sets of low-resource languages in Universal Dependencies project. We implement a character-level sequence tagging model and evaluate the augmented datasets on part-of-speech tagging task. We show that crop and rotate provides improvements over the models trained with non-augmented data for majority of the languages, especially for languages with rich case marking systems.

Involved external institutions

Technische Universität Darmstadt

Germany (DE) University of Edinburgh

United Kingdom (GB)

How to cite

APA:

Şahin, G.G., & Steedman, M. (2018). Data augmentation via dependency tree morphing for low-resource languages. In Ellen Riloff, David Chiang, Julia Hockenmaier, Jun'ichi Tsujii (Eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018 (pp. 5004-5009). Brussels, BEL: Association for Computational Linguistics.

MLA:

Şahin, Gözde Gül, and Mark Steedman. "Data augmentation via dependency tree morphing for low-resource languages." Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018, Brussels, BEL Ed. Ellen Riloff, David Chiang, Julia Hockenmaier, Jun'ichi Tsujii, Association for Computational Linguistics, 2018. 5004-5009.

BibTeX: Download