word_tokenize TypeError:预期的字符串或缓冲区 [英] word_tokenize TypeError: expected string or buffer

查看：239 发布时间：2020/5/18 1:04:23 python python-3.x nlp nltk tokenize

本文介绍了word_tokenize TypeError:预期的字符串或缓冲区的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

调用word_tokenize时出现以下错误:

File "C:\Python34\lib\site-packages\nltk\tokenize\punkt.py", line 1322,
    in _slices_from_text for match in
    self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or buffer

我有一个大的文本文件(1500.txt)，我想从中删除停用词. 我的代码如下:

I have a large text file (1500.txt) from which I want to remove stop words. My code is as follows:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

with open('E:\\Book\\1500.txt', "r", encoding='ISO-8859-1') as File_1500:
    stop_words = set(stopwords.words("english"))
    words = word_tokenize(File_1500)
    filtered_sentence = [w for w in words if not w in stop_words]
    print(filtered_sentence)

推荐答案

word_tokenize的输入是文档流语句，即字符串列表，例如['this is sentence 1.', 'that's sentence 2!'].

The input for word_tokenize is a document stream sentence, i.e. a list of strings, e.g. ['this is sentence 1.', 'that's sentence 2!'].

File_1500是一个File对象，而不是字符串列表，这就是它不起作用的原因.

The File_1500 is a File object not a list of strings, that's why it's not working.

要获取句子字符串的列表，首先您必须将文件作为字符串对象fin.read()读取，然后使用sent_tokenize将句子拆分(我假设您的输入文件未对句子进行标记化，只是一个原始文本文件.

To get a list of sentence strings, first you have to read the file as a string object fin.read(), then use sent_tokenize to split the sentence up (I'm assuming that your input file is not sentence tokenized, just a raw textfile).

此外，最好还是使用NLTK这样的方式对文件进行标记化:

Also, it's better / more idiomatic to tokenize a file this way with NLTK:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

stop_words = set(stopwords.words("english"))

with open('E:\\Book\\1500.txt', "r", encoding='ISO-8859-1') as fin:
    for sent in sent_tokenize(fin.read()):
        words = word_tokenize(sent)
        filtered_sentence = [w for w in words if not w in stop_words]
        print(filtered_sentence)

这篇关于word_tokenize TypeError:预期的字符串或缓冲区的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

word_tokenize TypeError:预期的字符串或缓冲区 [英] word_tokenize TypeError: expected string or buffer

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

word_tokenize TypeError:预期的字符串或缓冲区 [英] word_tokenize TypeError: expected string or buffer

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭