Python-如何下载NLTK数据?

更新答案:NLTK适用于2.7。我有3.2。我卸载了3.2,然后安装了2.7。现在可以了!!

我已经安装了NLTK并尝试下载NLTK数据。我所做的就是遵循此站点上的说明:http ://www.nltk.org/data.html

我下载了NLTK,进行了安装,然后尝试运行以下代码:

>>> import nltk

>>> nltk.download()

它给了我如下错误信息:

Traceback (most recent call last):

File "<pyshell#6>", line 1, in <module>

nltk.download()

AttributeError: 'module' object has no attribute 'download'

Directory of C:\Python32\Lib\site-packages

尝试了nltk.download()nltk.downloader(),都给了我错误消息。

然后我习惯于help(nltk)拉出包装,它显示以下信息:

NAME

nltk

PACKAGE CONTENTS

align

app (package)

book

ccg (package)

chat (package)

chunk (package)

classify (package)

cluster (package)

collocations

corpus (package)

data

decorators

downloader

draw (package)

examples (package)

featstruct

grammar

help

inference (package)

internals

lazyimport

metrics (package)

misc (package)

model (package)

parse (package)

probability

sem (package)

sourcedstring

stem (package)

tag (package)

test (package)

text

tokenize (package)

toolbox

tree

treetransforms

util

yamltags

FILE

c:\python32\lib\site-packages\nltk

我确实在那儿看到了Downloader,不确定为什么它不起作用。Python 3.2.2,系统Windows Vista。

回答:

要下载特定的数据集/模型,请使用nltk.download()函数,例如,如果你要下载punkt句子标记器,请使用:

$ python3

>>> import nltk

>>> nltk.download('punkt')

如果不确定所需的数据/模型,则可以使用以下数据和模型的基本列表开始:

>>> import nltk

>>> nltk.download('popular')

它将下载“流行”资源的列表,其中包括:

<collection id="popular" name="Popular packages">

<item ref="cmudict" />

<item ref="gazetteers" />

<item ref="genesis" />

<item ref="gutenberg" />

<item ref="inaugural" />

<item ref="movie_reviews" />

<item ref="names" />

<item ref="shakespeare" />

<item ref="stopwords" />

<item ref="treebank" />

<item ref="twitter_samples" />

<item ref="omw" />

<item ref="wordnet" />

<item ref="wordnet_ic" />

<item ref="words" />

<item ref="maxent_ne_chunker" />

<item ref="punkt" />

<item ref="snowball_data" />

<item ref="averaged_perceptron_tagger" />

</collection>

已编辑

如果有人避免nltk从https://stackoverflow.com/a/38135306/610569上从下载较大的数据集而避免错误

$ rm /Users/<your_username>/nltk_data/corpora/panlex_lite.zip

$ rm -r /Users/<your_username>/nltk_data/corpora/panlex_lite

$ python

>>> import nltk

>>> dler = nltk.downloader.Downloader()

>>> dler._update_index()

>>> dler._status_cache['panlex_lite'] = 'installed' # Trick the index to treat panlex_lite as it's already installed.

>>> dler.download('popular')

更新

从v3.2.5起,当nltk_data找不到资源时,NLTK会提供更多信息,例如:

>>> from nltk import word_tokenize

>>> word_tokenize('x')

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

File "/Users/l/alvas/git/nltk/nltk/tokenize/__init__.py", line 128, in word_tokenize

sentences = [text] if preserve_line else sent_tokenize(text, language)

File "/Users//alvas/git/nltk/nltk/tokenize/__init__.py", line 94, in sent_tokenize

tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))

File "/Users/alvas/git/nltk/nltk/data.py", line 820, in load

opened_resource = _open(resource_url)

File "/Users/alvas/git/nltk/nltk/data.py", line 938, in _open

return find(path_, path + ['']).open()

File "/Users/alvas/git/nltk/nltk/data.py", line 659, in find

raise LookupError(resource_not_found)

LookupError:

**********************************************************************

Resource punkt not found.

Please use the NLTK Downloader to obtain the resource:

>>> import nltk

>>> nltk.download('punkt')

Searched in:

- '/Users/alvas/nltk_data'

- '/usr/share/nltk_data'

- '/usr/local/share/nltk_data'

- '/usr/lib/nltk_data'

- '/usr/local/lib/nltk_data'

- ''

以上是 Python-如何下载NLTK数据? 的全部内容, 来源链接: utcz.com/qa/434683.html

回到顶部