Kaggle word2vec竞赛，第2部分 [英] Kaggle word2vec competition, part 2

查看：96 发布时间：2020/9/20 8:18:08 python-2.7 beautifulsoup nltk word2vec kaggle

本文介绍了Kaggle word2vec竞赛，第2部分的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的代码来自: https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-2-word-vectors ，我成功读取了数据， 此处用于BeautifulSoup和nltk来清除文本，除去数字以外的非字母.

my code is FROM: https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-2-word-vectors, i read the data successful, here is used to BeautifulSoup and nltk to clean the text, remove non-letters but numbers.

def review_to_wordlist( review, remove_stopwords=False ):
    # Function to convert a document to a sequence of words,
    # optionally removing stop words.  Returns a list of words.
    #
    # 1. Remove HTML
    review_text = BeautifulSoup(review).get_text()
    #  
    # 2. Remove non-letters
    review_text = re.sub("[^a-zA-Z]"," ", review_text)
    #
    # 3. Convert words to lower case and split them
    words = review_text.lower().split()
    #
    # 4. Return a list of words
    return(words)

但是当我继续到这里时，就无法继续.

sentences = []  # Initialize an empty list of sentences

print "Parsing sentences from training set"
for review in train["review"]:
    sentences += review_to_sentences(review, tokenizer)

**error: what is meaning?? the before code runs well, i have tried it 3 times, when the code runs here, appear these problems.**
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "<stdin>", line 6, in review_to_sentences
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1270, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1318, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1309, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1348, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 355, in _pair_iter
    for el in it:
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1324, in _slices_from_text
    if self.text_contains_sentbreak(context):
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1369, in text_contains_sentbreak
    for t in self._annotate_tokens(self._tokenize_words(text)):
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1504, in _annotate_second_pass
    for t1, t2 in _pair_iter(tokens):
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 354, in _pair_iter
    prev = next(it)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 621, in _annotate_first_pass
    for aug_tok in tokens:
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 586, in _tokenize_words
    for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 15: ordinal not in range(128)
>>>

Kaggle word2vec竞赛，第2部分 [英] Kaggle word2vec competition, part 2

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Kaggle word2vec竞赛，第2部分 [英] Kaggle word2vec competition, part 2

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭