Slovenské jazykové zdroje

January 26, 2021 Daniel Hladek info 0 minutes, 46 seconds

Slovenské jazykové zdroje

POS

Multext East Anotovaný román George Orwell 1984 v 15 európskych jazykoch

NER

  • Learning multilingual named entity recognition from Wikipedia- WIKI Ner?
  • Cross-lingual Name Tagging and Linking for 282 Languages - NER anotácia aj slovenskej Wikipédie podľa anglickej
    • https://drive.google.com/drive/folders/1bkK6ly_awxe9IgAKL16VVvCtjcYcDSw8
    • https://elisa-ie.github.io/wikiann/

Parsing-POS

Slovak Dependency Treebank

https://github.com/UniversalDependencies/UD_Slovak-SNK

Artificial Treebank with Ellipsis

Wordnet

Slovak Word Net

Parallel Corpus

Europarlament

Czech-Slovak Parallel Corpus

English-Slovak Parallel Corpus

Multext East

Sentiment

Twitter sentiment for 15 European languages

Web

  • Aranea
  • SkTenTen automaticky POS anotovaný, prístup cez web rozhranie
  • CommonCrawl Obsahuje aj slovenské dáta?
  • Oscar klasifikácia a deduplikácia dát z COmmonCrawl, aj pre slovenčinu (4.5 GB dedub, 665M slov dedup.)

Wikipedia

Wikipedia vo formáte JSON Elasticsearch Bulk

Word Embedding

Databázy zdrojov

https://github.com/slovak-nlp/resources

https://www.clarin.eu/portal

https://www.clarin.eu/resource-families/manually-annotated-corpora

http://www.meta-share.org/

https://korpus.sk/res.html

Slovak Stemming https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Slovak_Stemmer_Analysis

Tools