en | es | de | gl
|
Text resources
|
Publications
|
Team
|
Contact

About PaGeS


The German/Spanish parallel Corpus, PaGeS, is a bilingual parallel corpus composed of two major parts: the core corpus and the supplements.

The core corpus is comprised of original texts in German and Spanish and their respective translations, as well as a small percentage (around 8%) of texts (in German and Spanish) translated from a third language. It includes 169 works of fiction —novels and short stories, making up around 80%— as well as non-fiction — essays and popular science texts. The selected works are represented not by the full texts, but rather by samples, allowing for a better cross-section of the texts.

This part of PaGeS (s. below) contains some 36,000,000 tokens and 1.055.685 bisegments, i.e. pairs of aligned text chunks (sentences or subsentential units/segments).

To guarantee overall quality, the texts have been manually verified at different levels and the automatic alignment of the bisegments has been completely reviewed. For each occurrence, the original source is provided, which includes information on the author, title, year of the first publication, and — if applicable —the edition used and the part or chapter within the work to which the specific occurrence belongs. The complete bibliographic data of the works included in PaGeS can be found here.

The supplements include so far Europarl v7, a corpus that collects the proceedings of the European Parliament from 1996 to 2011, making a total of more than 70 million words. Segments over 80 words (in Spanish and/or German) were excluded. In the near future, new collections of bilingual texts of diverse origin are expected to be added.

Even though the initial impulse for the creation of PaGeS was to provide a broad source of data for contrastive linguistic research, the excellent reception by very diverse users has motivated us to strive for greater interoperability and standardization in order to make PaGeS a multifunctional resource able to meet the differentiated needs of our users. We aim at building a representative language resource for the language pair German / Spanish that can be exploited for multiple purposes such as general research in contrastive linguistics, linguistic typology, translation studies and bilingual lexicography, as well as the supply of training data to machine translation systems. PaGeS has also proven to be a very useful and widely used resource by translators and learners of German or Spanish as Foreign Languages at intermediate and advanced levels to obtain a multitude of translation suggestions made by humans and presented within examples of real language use.

For more detailed information about PaGeS, see the publications webpage.

Despite our best efforts, some mistakes have undoubtedly slipped through. If you come across any, please let us know by sending an email to: corpuspages@usc.es indicating the id. number provided in the source information of the occurrence where the mistake was found.

Notice:

If you use PaGeS in your work, please indicate it and let us know: corpuspages@usc.es. This way you contribute to the sustainability of the project.

Statistics PaGeS

Core Corpus

LANGUAGE CHARACTERS WORDS TOKENS TYPES BISEGMENTS WORKS
German Original 34.020.927 6.280.994 7.402.194 188.540 461.768 81
Spanish Translation 32.520.223 6.781.481 7.823.016 109.921
Spanish Original 33.403.052 7.010.327 8.026.021 119.501 442.623 70
German Translation 37.212.971 6.924.157 8.114.878 162.972
German Translation <3rd language 11.247.605 2.084.860 2.464.909 74.244 151.294 18
Spanish Translation <3rd language 10.337.022 2.149.204 2.483.199 57.389
Total 158.741.830 31.267.205 36.314.217 1.055.685 169

Supplements: Europarl v7

LANGUAGE CHARACTERS WORDS TOKENS BISEGMENTS
German 219.099.293 35.222.373 39.726.336 1.586.374
Spanish 205.008.875 39.664.923 43.662.223
Total 424.108.168 74.887.296 83.388.559 1.586.374


(Release: 01/05/2020)

                                                              
PaGeS Vers. 2.0
Last updated: 20.05.2020
ISSN 2605-5228 ©SpatiAlEs
Creative Commons Licencia Creative Commons
University of Santiago de Compostela
This project is funded by the State Research Agency (AEI) of Spanish Ministry of Science, Innovation and Universities (FFI2017-85938-R) and by the Department of Economy and Industry of the Galician Government (2017-PG023).