Slovenské jazykové zdroje
Slovenské jazykové zdroje
POS
Multext East Anotovaný román George Orwell 1984 v 15 európskych jazykoch
NER
- Learning multilingual named entity recognition from Wikipedia- WIKI Ner?
- Cross-lingual Name Tagging and Linking for 282 Languages - NER anotácia aj slovenskej Wikipédie podľa anglickej
- https://drive.google.com/drive/folders/1bkK6ly_awxe9IgAKL16VVvCtjcYcDSw8
- https://elisa-ie.github.io/wikiann/
Parsing-POS
https://github.com/UniversalDependencies/UD_Slovak-SNK
Artificial Treebank with Ellipsis
Wordnet
Parallel Corpus
Europarlament
English-Slovak Parallel Corpus
Sentiment
Twitter sentiment for 15 European languages
Web
- Aranea
- SkTenTen automaticky POS anotovaný, prístup cez web rozhranie
- CommonCrawl Obsahuje aj slovenské dáta?
- Oscar klasifikácia a deduplikácia dát z COmmonCrawl, aj pre slovenčinu (4.5 GB dedub, 665M slov dedup.)
Wikipedia
Wikipedia vo formáte JSON Elasticsearch Bulk
Word Embedding
Databázy zdrojov
https://github.com/slovak-nlp/resources
https://www.clarin.eu/portal
https://www.clarin.eu/resource-families/manually-annotated-corpora
http://www.meta-share.org/
https://korpus.sk/res.html
Slovak Stemming https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Slovak_Stemmer_Analysis
Tools
- Spacy, tokenizer, stopwords, custom model
- Slovak Lexer / tokenizer
- Slovak Elasticsearch - stopwords, stemmer
- Slovak Hunspell - stemmer, spelling