Editors:

Inguna Skadiņa⁰,
Robert Gaizauskas¹,
Bogdan Babych²,
Nikola Ljubešić³,
Dan Tufiş⁴,
…
Andrejs Vasiļjevs⁵

Inguna Skadiņa
1. Tilde, Riga, Latvia
View editor publications

You can also search for this editor in PubMed Google Scholar
Robert Gaizauskas
1. Department of Computer Science, University of Sheffield, Sheffield, UK
View editor publications

You can also search for this editor in PubMed Google Scholar
Bogdan Babych
1. School of Modern Languages & Cultures, University of Leeds, Leeds, UK
View editor publications

You can also search for this editor in PubMed Google Scholar
Nikola Ljubešić
1. Faculty of Humanities & Social Sciences, University of Zagreb, Zagreb, Croatia
View editor publications

You can also search for this editor in PubMed Google Scholar
Dan Tufiş
1. Institute for Artificial Intelligence, Romanian Academy, Bucharest, Romania
View editor publications

You can also search for this editor in PubMed Google Scholar
Andrejs Vasiļjevs
1. Tilde , Riga, Latvia
View editor publications

You can also search for this editor in PubMed Google Scholar

Describes a step-by-step method for collecting comparable corpora and processing it for usage in machine translation
Demonstrates how data from comparable corpora can improve the quality of machine translation
Proposes novel methods for measuring the comparability of multilingual corpora
Describes algorithms and techniques for alignment and extraction of lexical and terminological data from comparable corpora in order to provide training and customization data for machine translation

Part of the book series: Theory and Applications of Natural Language Processing (NLP)

3083 Accesses
7 Citations

Buy it now

eBook USD 119.00

Price excludes VAT (USA)

Hardcover Book USD 159.99

Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Other ways to access

Licence this eBook for your library

Learn about institutional subscriptions

This is a preview of subscription content, log in via an institution to check for access.

Table of contents (8 chapters)

Front Matter

Pages i-vi

PDF
Introduction
- Inguna Skadiņa, Robert Gaizauskas, Andrejs Vasiļjevs, Monica Lestari Paramita
Pages 1-11
Cross-Language Comparability and Its Applications for MT
- Bogdan Babych, Fangzhong Su, Anthony Hartley, Ahmet Aker, Monica Lestari Paramita, Paul Clough et al.
Pages 13-53
Collecting Comparable Corpora
- Monica Lestari Paramita, Ahmet Aker, Paul Clough, Robert Gaizauskas, Nikos Glaros, Nikos Mastropavlos et al.
Pages 55-87
Extracting Data from Comparable Corpora
- Mārcis Pinnis, Nikola Ljubešić, Dan Ştefănescu, Inguna Skadiņa, Marko Tadić, Tatjana Gornostaja et al.
Pages 89-139
Mapping and Aligning Units from Comparable Corpora
- Ahmet Aker, Alexandru Ceaușu, Yang Feng, Robert Gaizauskas, Sabine Hunsicker, Radu Ion et al.
Pages 141-188
Training, Enhancing, Evaluating and Using MT Systems with Comparable Data
- Bogdan Babych, Yu Chen, Andreas Eisele, Sabine Hunsicker, Mārcis Pinnis, Inguna Skadiņa et al.
Pages 189-254
New Areas of Application of Comparable Corpora
- Reinhard Rapp, Vivian Xu, Michael Zock, Serge Sharoff, Richard Forsyth, Bogdan Babych et al.
Pages 255-290
Appendices
- Ahmet Aker, Radu Ion, Nikos Mastropavlos, Monica Paramita, Mārcis Pinnis, Dan Ştefănescu et al.
Pages 291-323

About this book

This book provides an overview of how comparable corpora can be used to overcome the lack of parallel resources when building machine translation systems for under-resourced languages and domains. It presents a wealth of methods and open tools for building comparable corpora from the Web, evaluating comparability and extracting parallel data that can be used for the machine translation task. It is divided into several sections, each covering a specific task such as building, processing, and using comparable corpora, focusing particularly on under-resourced language pairs and domains.

The book is intended for anyone interested in data-driven machine translation for under-resourced languages and domains, especially for developers of machine translation systems, computational linguists and language workers. It offers a valuable resource for specialists and students in natural language processing, machine translation, corpus linguistics and computer-assisted translation, and promotes the broader use of comparable corpora in natural language processing and computational linguistics.

Keywords

Editors and Affiliations

Tilde, Riga, Latvia

Inguna Skadiņa
Department of Computer Science, University of Sheffield, Sheffield, UK

Robert Gaizauskas
School of Modern Languages & Cultures, University of Leeds, Leeds, UK

Bogdan Babych
Faculty of Humanities & Social Sciences, University of Zagreb, Zagreb, Croatia

Nikola Ljubešić
Institute for Artificial Intelligence, Romanian Academy, Bucharest, Romania

Dan Tufiş
Tilde , Riga, Latvia

Andrejs Vasiļjevs

About the editors

Prof. Inguna Skadiņa has been working on language technologies for over 25 years. Her research interests are in machine translation, human-computer interaction, and language resources and tools for under-resourced languages. She has coordinated and participated in many national and international projects related to human language technologies, and has authored or co-authored more than 60 peer-reviewed research papers.

Bogdan Babych is an Associate Professor of Translation Studies at the University of Leeds, UK. He holds a PhD in machine translation and in Ukrainian linguistics. Dr. Babych was a coordinator of the EU FP7 Marie Curie project HyghTra, and received a Leverhulme Early Career Fellowship for his project Translation Strategies in Comparable Corpora. He previously worked as a computational linguist at L&H Speech Products, Belgium.

Robert Gaizauskas is a Professor of Computer Science and head of the Natural Language Processing group, Department of Computer Science, University of Sheffield, UK. His research interests are in computational semantics, information extraction, text summarization and machine translation. He holds a DPhil from the University of Sussex, UK (1992), and has published more than 150 papers in peer-reviewed journals and conference proceedings.

Nikola Ljubešić is an Assistant Professor at the Department of Information Science, University of Zagreb, Croatia, and researcher at the "Jožef Stefan" Institute in Ljubljana, Slovenia. His main research interests are in language technologies for South Slavic languages, linguistic processing of non-standard texts, author profiling and social media analytics.

Prof. Dan Tufiș, director of RACAI and full member of the Romanian Academy, has been active in computational and corpus linguistics for more than 30 years. His expertise is in tagging, word alignment, multilingual WSD, SMT, QA in open domains, lexical ontologies, language resource annotation and encoding. He has authored or co-authored more than 250 peer-reviewed papers, book chapters and books.

Andrejs Vasiļjevs is a co-founder and chairman of the board of Tilde, a leading European language technology and localization company. His expertise is in terminology management, machine translation and human computer interaction. He initiated and coordinated the ACCURAT project as well as several other international research and innovation projects. He holds a PhD in computer sciences from the University of Latvia and a Dr.h. from the Latvian Academy of Sciences.

Bibliographic Information

Book Title: Using Comparable Corpora for Under-Resourced Areas of Machine Translation
Editors: Inguna Skadiņa, Robert Gaizauskas, Bogdan Babych, Nikola Ljubešić, Dan Tufiş, Andrejs Vasiļjevs
Series Title: Theory and Applications of Natural Language Processing
DOI: https://doi.org/10.1007/978-3-319-99004-0
Publisher: Springer Cham
eBook Packages: Computer Science, Computer Science (R0)
Copyright Information: Springer Nature Switzerland AG 2019
Hardcover ISBN: 978-3-319-99003-3Published: 22 February 2019
eBook ISBN: 978-3-319-99004-0Published: 06 February 2019
Series ISSN: 2192-032X
Series E-ISSN: 2192-0338
Edition Number: 1
Number of Pages: VI, 323
Number of Illustrations: 24 b/w illustrations, 39 illustrations in colour
Topics: Natural Language Processing (NLP), Computational Linguistics, Data Mining and Knowledge Discovery

Publish with us

Policies and ethics

Editors:

Sections

Buy it now

Buying options

Other ways to access

Table of contents (8 chapters)

Front Matter

About this book

Keywords

Editors and Affiliations

Tilde, Riga, Latvia

Department of Computer Science, University of Sheffield, Sheffield, UK

School of Modern Languages & Cultures, University of Leeds, Leeds, UK

Faculty of Humanities & Social Sciences, University of Zagreb, Zagreb, Croatia

Institute for Artificial Intelligence, Romanian Academy, Bucharest, Romania

Tilde , Riga, Latvia

About the editors

Bibliographic Information

Publish with us

Buy it now

Buying options

Other ways to access

Search

Navigation