mirror of
https://github.com/csunny/DB-GPT.git
synced 2025-07-31 15:47:05 +00:00
update ignore file
This commit is contained in:
parent
63586fc6a3
commit
2252b9b6ba
1
.gitignore
vendored
1
.gitignore
vendored
@ -131,3 +131,4 @@ dmypy.json
|
||||
.pyre/
|
||||
.DS_Store
|
||||
logs
|
||||
.vectordb
|
98
pilot/nltk_data/tokenizers/punkt/PY3/README
Normal file
98
pilot/nltk_data/tokenizers/punkt/PY3/README
Normal file
@ -0,0 +1,98 @@
|
||||
Pretrained Punkt Models -- Jan Strunk (New version trained after issues 313 and 514 had been corrected)
|
||||
|
||||
Most models were prepared using the test corpora from Kiss and Strunk (2006). Additional models have
|
||||
been contributed by various people using NLTK for sentence boundary detection.
|
||||
|
||||
For information about how to use these models, please confer the tokenization HOWTO:
|
||||
http://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html
|
||||
and chapter 3.8 of the NLTK book:
|
||||
http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html#sec-segmentation
|
||||
|
||||
There are pretrained tokenizers for the following languages:
|
||||
|
||||
File Language Source Contents Size of training corpus(in tokens) Model contributed by
|
||||
=======================================================================================================================================================================
|
||||
czech.pickle Czech Multilingual Corpus 1 (ECI) Lidove Noviny ~345,000 Jan Strunk / Tibor Kiss
|
||||
Literarni Noviny
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
danish.pickle Danish Avisdata CD-Rom Ver. 1.1. 1995 Berlingske Tidende ~550,000 Jan Strunk / Tibor Kiss
|
||||
(Berlingske Avisdata, Copenhagen) Weekend Avisen
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
dutch.pickle Dutch Multilingual Corpus 1 (ECI) De Limburger ~340,000 Jan Strunk / Tibor Kiss
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
english.pickle English Penn Treebank (LDC) Wall Street Journal ~469,000 Jan Strunk / Tibor Kiss
|
||||
(American)
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
estonian.pickle Estonian University of Tartu, Estonia Eesti Ekspress ~359,000 Jan Strunk / Tibor Kiss
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
finnish.pickle Finnish Finnish Parole Corpus, Finnish Books and major national ~364,000 Jan Strunk / Tibor Kiss
|
||||
Text Bank (Suomen Kielen newspapers
|
||||
Tekstipankki)
|
||||
Finnish Center for IT Science
|
||||
(CSC)
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
french.pickle French Multilingual Corpus 1 (ECI) Le Monde ~370,000 Jan Strunk / Tibor Kiss
|
||||
(European)
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
german.pickle German Neue Zürcher Zeitung AG Neue Zürcher Zeitung ~847,000 Jan Strunk / Tibor Kiss
|
||||
(Switzerland) CD-ROM
|
||||
(Uses "ss"
|
||||
instead of "ß")
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
greek.pickle Greek Efstathios Stamatatos To Vima (TO BHMA) ~227,000 Jan Strunk / Tibor Kiss
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
italian.pickle Italian Multilingual Corpus 1 (ECI) La Stampa, Il Mattino ~312,000 Jan Strunk / Tibor Kiss
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
norwegian.pickle Norwegian Centre for Humanities Bergens Tidende ~479,000 Jan Strunk / Tibor Kiss
|
||||
(Bokmål and Information Technologies,
|
||||
Nynorsk) Bergen
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
polish.pickle Polish Polish National Corpus Literature, newspapers, etc. ~1,000,000 Krzysztof Langner
|
||||
(http://www.nkjp.pl/)
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
portuguese.pickle Portuguese CETENFolha Corpus Folha de São Paulo ~321,000 Jan Strunk / Tibor Kiss
|
||||
(Brazilian) (Linguateca)
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
slovene.pickle Slovene TRACTOR Delo ~354,000 Jan Strunk / Tibor Kiss
|
||||
Slovene Academy for Arts
|
||||
and Sciences
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
spanish.pickle Spanish Multilingual Corpus 1 (ECI) Sur ~353,000 Jan Strunk / Tibor Kiss
|
||||
(European)
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
swedish.pickle Swedish Multilingual Corpus 1 (ECI) Dagens Nyheter ~339,000 Jan Strunk / Tibor Kiss
|
||||
(and some other texts)
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
turkish.pickle Turkish METU Turkish Corpus Milliyet ~333,000 Jan Strunk / Tibor Kiss
|
||||
(Türkçe Derlem Projesi)
|
||||
University of Ankara
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
The corpora contained about 400,000 tokens on average and mostly consisted of newspaper text converted to
|
||||
Unicode using the codecs module.
|
||||
|
||||
Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection.
|
||||
Computational Linguistics 32: 485-525.
|
||||
|
||||
---- Training Code ----
|
||||
|
||||
# import punkt
|
||||
import nltk.tokenize.punkt
|
||||
|
||||
# Make a new Tokenizer
|
||||
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
|
||||
|
||||
# Read in training corpus (one example: Slovene)
|
||||
import codecs
|
||||
text = codecs.open("slovene.plain","Ur","iso-8859-2").read()
|
||||
|
||||
# Train tokenizer
|
||||
tokenizer.train(text)
|
||||
|
||||
# Dump pickled tokenizer
|
||||
import pickle
|
||||
out = open("slovene.pickle","wb")
|
||||
pickle.dump(tokenizer, out)
|
||||
out.close()
|
||||
|
||||
---------
|
BIN
pilot/nltk_data/tokenizers/punkt/PY3/dutch.pickle
Normal file
BIN
pilot/nltk_data/tokenizers/punkt/PY3/dutch.pickle
Normal file
Binary file not shown.
BIN
pilot/nltk_data/tokenizers/punkt/PY3/english.pickle
Normal file
BIN
pilot/nltk_data/tokenizers/punkt/PY3/english.pickle
Normal file
Binary file not shown.
BIN
pilot/nltk_data/tokenizers/punkt/PY3/french.pickle
Normal file
BIN
pilot/nltk_data/tokenizers/punkt/PY3/french.pickle
Normal file
Binary file not shown.
BIN
pilot/nltk_data/tokenizers/punkt/PY3/italian.pickle
Normal file
BIN
pilot/nltk_data/tokenizers/punkt/PY3/italian.pickle
Normal file
Binary file not shown.
BIN
pilot/nltk_data/tokenizers/punkt/PY3/malayalam.pickle
Normal file
BIN
pilot/nltk_data/tokenizers/punkt/PY3/malayalam.pickle
Normal file
Binary file not shown.
BIN
pilot/nltk_data/tokenizers/punkt/PY3/portuguese.pickle
Normal file
BIN
pilot/nltk_data/tokenizers/punkt/PY3/portuguese.pickle
Normal file
Binary file not shown.
BIN
pilot/nltk_data/tokenizers/punkt/PY3/russian.pickle
Normal file
BIN
pilot/nltk_data/tokenizers/punkt/PY3/russian.pickle
Normal file
Binary file not shown.
BIN
pilot/nltk_data/tokenizers/punkt/PY3/slovene.pickle
Normal file
BIN
pilot/nltk_data/tokenizers/punkt/PY3/slovene.pickle
Normal file
Binary file not shown.
BIN
pilot/nltk_data/tokenizers/punkt/PY3/spanish.pickle
Normal file
BIN
pilot/nltk_data/tokenizers/punkt/PY3/spanish.pickle
Normal file
Binary file not shown.
BIN
pilot/nltk_data/tokenizers/punkt/PY3/swedish.pickle
Normal file
BIN
pilot/nltk_data/tokenizers/punkt/PY3/swedish.pickle
Normal file
Binary file not shown.
98
pilot/nltk_data/tokenizers/punkt/README
Normal file
98
pilot/nltk_data/tokenizers/punkt/README
Normal file
@ -0,0 +1,98 @@
|
||||
Pretrained Punkt Models -- Jan Strunk (New version trained after issues 313 and 514 had been corrected)
|
||||
|
||||
Most models were prepared using the test corpora from Kiss and Strunk (2006). Additional models have
|
||||
been contributed by various people using NLTK for sentence boundary detection.
|
||||
|
||||
For information about how to use these models, please confer the tokenization HOWTO:
|
||||
http://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html
|
||||
and chapter 3.8 of the NLTK book:
|
||||
http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html#sec-segmentation
|
||||
|
||||
There are pretrained tokenizers for the following languages:
|
||||
|
||||
File Language Source Contents Size of training corpus(in tokens) Model contributed by
|
||||
=======================================================================================================================================================================
|
||||
czech.pickle Czech Multilingual Corpus 1 (ECI) Lidove Noviny ~345,000 Jan Strunk / Tibor Kiss
|
||||
Literarni Noviny
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
danish.pickle Danish Avisdata CD-Rom Ver. 1.1. 1995 Berlingske Tidende ~550,000 Jan Strunk / Tibor Kiss
|
||||
(Berlingske Avisdata, Copenhagen) Weekend Avisen
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
dutch.pickle Dutch Multilingual Corpus 1 (ECI) De Limburger ~340,000 Jan Strunk / Tibor Kiss
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
english.pickle English Penn Treebank (LDC) Wall Street Journal ~469,000 Jan Strunk / Tibor Kiss
|
||||
(American)
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
estonian.pickle Estonian University of Tartu, Estonia Eesti Ekspress ~359,000 Jan Strunk / Tibor Kiss
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
finnish.pickle Finnish Finnish Parole Corpus, Finnish Books and major national ~364,000 Jan Strunk / Tibor Kiss
|
||||
Text Bank (Suomen Kielen newspapers
|
||||
Tekstipankki)
|
||||
Finnish Center for IT Science
|
||||
(CSC)
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
french.pickle French Multilingual Corpus 1 (ECI) Le Monde ~370,000 Jan Strunk / Tibor Kiss
|
||||
(European)
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
german.pickle German Neue Zürcher Zeitung AG Neue Zürcher Zeitung ~847,000 Jan Strunk / Tibor Kiss
|
||||
(Switzerland) CD-ROM
|
||||
(Uses "ss"
|
||||
instead of "ß")
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
greek.pickle Greek Efstathios Stamatatos To Vima (TO BHMA) ~227,000 Jan Strunk / Tibor Kiss
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
italian.pickle Italian Multilingual Corpus 1 (ECI) La Stampa, Il Mattino ~312,000 Jan Strunk / Tibor Kiss
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
norwegian.pickle Norwegian Centre for Humanities Bergens Tidende ~479,000 Jan Strunk / Tibor Kiss
|
||||
(Bokmål and Information Technologies,
|
||||
Nynorsk) Bergen
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
polish.pickle Polish Polish National Corpus Literature, newspapers, etc. ~1,000,000 Krzysztof Langner
|
||||
(http://www.nkjp.pl/)
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
portuguese.pickle Portuguese CETENFolha Corpus Folha de São Paulo ~321,000 Jan Strunk / Tibor Kiss
|
||||
(Brazilian) (Linguateca)
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
slovene.pickle Slovene TRACTOR Delo ~354,000 Jan Strunk / Tibor Kiss
|
||||
Slovene Academy for Arts
|
||||
and Sciences
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
spanish.pickle Spanish Multilingual Corpus 1 (ECI) Sur ~353,000 Jan Strunk / Tibor Kiss
|
||||
(European)
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
swedish.pickle Swedish Multilingual Corpus 1 (ECI) Dagens Nyheter ~339,000 Jan Strunk / Tibor Kiss
|
||||
(and some other texts)
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
turkish.pickle Turkish METU Turkish Corpus Milliyet ~333,000 Jan Strunk / Tibor Kiss
|
||||
(Türkçe Derlem Projesi)
|
||||
University of Ankara
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
The corpora contained about 400,000 tokens on average and mostly consisted of newspaper text converted to
|
||||
Unicode using the codecs module.
|
||||
|
||||
Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection.
|
||||
Computational Linguistics 32: 485-525.
|
||||
|
||||
---- Training Code ----
|
||||
|
||||
# import punkt
|
||||
import nltk.tokenize.punkt
|
||||
|
||||
# Make a new Tokenizer
|
||||
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
|
||||
|
||||
# Read in training corpus (one example: Slovene)
|
||||
import codecs
|
||||
text = codecs.open("slovene.plain","Ur","iso-8859-2").read()
|
||||
|
||||
# Train tokenizer
|
||||
tokenizer.train(text)
|
||||
|
||||
# Dump pickled tokenizer
|
||||
import pickle
|
||||
out = open("slovene.pickle","wb")
|
||||
pickle.dump(tokenizer, out)
|
||||
out.close()
|
||||
|
||||
---------
|
97138
pilot/nltk_data/tokenizers/punkt/dutch.pickle
Normal file
97138
pilot/nltk_data/tokenizers/punkt/dutch.pickle
Normal file
File diff suppressed because it is too large
Load Diff
61702
pilot/nltk_data/tokenizers/punkt/english.pickle
Normal file
61702
pilot/nltk_data/tokenizers/punkt/english.pickle
Normal file
File diff suppressed because it is too large
Load Diff
80529
pilot/nltk_data/tokenizers/punkt/french.pickle
Normal file
80529
pilot/nltk_data/tokenizers/punkt/french.pickle
Normal file
File diff suppressed because it is too large
Load Diff
90202
pilot/nltk_data/tokenizers/punkt/italian.pickle
Normal file
90202
pilot/nltk_data/tokenizers/punkt/italian.pickle
Normal file
File diff suppressed because it is too large
Load Diff
BIN
pilot/nltk_data/tokenizers/punkt/malayalam.pickle
Normal file
BIN
pilot/nltk_data/tokenizers/punkt/malayalam.pickle
Normal file
Binary file not shown.
90795
pilot/nltk_data/tokenizers/punkt/portuguese.pickle
Normal file
90795
pilot/nltk_data/tokenizers/punkt/portuguese.pickle
Normal file
File diff suppressed because it is too large
Load Diff
BIN
pilot/nltk_data/tokenizers/punkt/russian.pickle
Normal file
BIN
pilot/nltk_data/tokenizers/punkt/russian.pickle
Normal file
Binary file not shown.
106925
pilot/nltk_data/tokenizers/punkt/slovene.pickle
Normal file
106925
pilot/nltk_data/tokenizers/punkt/slovene.pickle
Normal file
File diff suppressed because it is too large
Load Diff
82636
pilot/nltk_data/tokenizers/punkt/spanish.pickle
Normal file
82636
pilot/nltk_data/tokenizers/punkt/spanish.pickle
Normal file
File diff suppressed because it is too large
Load Diff
133719
pilot/nltk_data/tokenizers/punkt/swedish.pickle
Normal file
133719
pilot/nltk_data/tokenizers/punkt/swedish.pickle
Normal file
File diff suppressed because it is too large
Load Diff
Loading…
Reference in New Issue
Block a user