Corpora
Free multilingual corpora suitable for research and teaching comparative discourse studies (without parallel corpora):
- Swiss-AL, a multilingual Swiss web corpus for Applied Linguistics. Swiss-AL addresses four major challenges: a) multilingual discourse (with the major national Swiss languages German, French, and Italian); b) the representation of the Swiss federal structure in the selection of corpus sources; c) following the perspective of applied discourse analysis, d) encounter the increasing machine and deep learning algorithms with a linguistically annotated corpus at hand that allows for basic corpus linguistic methods as well as for the adaption of such innovative NLP methods.
-
PolMine, a political science-oriented project that processes its data extensively in linguistic terms. The project uses tools from corpus linguistics, computational linguistics, and information science. It contains data from German Bundestag, Germany’s regional parliaments (Landtage), French Assemblée nationale, Austrian Nationalrat, and United Nations General Assembly.
-
Korpora zur Fußballlinguistik, multilingual text corpus with the topic football (soccer), access via CQPweb. Including German, English, Dutch, French, Italian, Spanish, Portuguese, Norwegian, Polish, Czech, Russian, Hungarian, and Greek.
- Wikipedia Corpora, Articles and discussions until 2019; English, French, Hungarian, Norwegian, Spanish, Croatian, Italian, and Polish. Access via COSMAS II.
- GerBosAC (German-Bosnian Advertisements Corpus), corpus of advertising and promotional packaging texts with rich manual annotation and multimodal resources. Total number of advertisements: 1289 in Bosnian and German. Access via in-house online-tool AdveRtis (Advertisements Management, Search and Analysis System).
Free multilingual parallel corpora, maybe suitable for research and teaching comparative discourse studies:
- OPUS, a large text collection of parallel corpora with more than 3,800 language pairs.
- European Parliament Proceedings Parallel Corpus (EUROPARL), extracted from the proceedings of the European Parliament and used for machine translation. Includes versions in 21 European languages.
- An annotated corpus of argumentative microtexts, original written in German, translated into English; access via github.
- ParaSol – A Parallel Corpus of Slavic and other languages, currently available via SketchEngine. Including Bulgarian, Belarusian, Czech, Croatian, Macedonian/Macedonian, Polish, Russian, Slovak, Slovenian, Upper Sorbian, Serbian, Ukrainian, Danish, German, Estonian, Modern Greek, English, Esperanto, Spanish, Finnish, French, Hungarian, Armenian, Italian, Latvian, Lithuanian, Dutch, Norwegian, Portuguese, Romanian, and Swedish.