Kaggle word2vec竞赛,第2部分 [英] Kaggle word2vec competition, part 2
问题描述
我的代码来自: https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-2-word-vectors ,我成功读取了数据, 此处用于BeautifulSoup和nltk来清除文本,除去数字以外的非字母.
my code is FROM: https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-2-word-vectors, i read the data successful, here is used to BeautifulSoup and nltk to clean the text, remove non-letters but numbers.
def review_to_wordlist( review, remove_stopwords=False ):
# Function to convert a document to a sequence of words,
# optionally removing stop words. Returns a list of words.
#
# 1. Remove HTML
review_text = BeautifulSoup(review).get_text()
#
# 2. Remove non-letters
review_text = re.sub("[^a-zA-Z]"," ", review_text)
#
# 3. Convert words to lower case and split them
words = review_text.lower().split()
#
# 4. Return a list of words
return(words)
但是当我继续到这里时,就无法继续.
sentences = [] # Initialize an empty list of sentences
print "Parsing sentences from training set"
for review in train["review"]:
sentences += review_to_sentences(review, tokenizer)
**error: what is meaning?? the before code runs well, i have tried it 3 times, when the code runs here, appear these problems.**
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "<stdin>", line 6, in review_to_sentences
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1270, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1318, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1309, in span_tokenize
return [(sl.start, sl.stop) for sl in slices]
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1348, in _realign_boundaries
for sl1, sl2 in _pair_iter(slices):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 355, in _pair_iter
for el in it:
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1324, in _slices_from_text
if self.text_contains_sentbreak(context):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1369, in text_contains_sentbreak
for t in self._annotate_tokens(self._tokenize_words(text)):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1504, in _annotate_second_pass
for t1, t2 in _pair_iter(tokens):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 354, in _pair_iter
prev = next(it)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 621, in _annotate_first_pass
for aug_tok in tokens:
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 586, in _tokenize_words
for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 15: ordinal not in range(128)
>>>
推荐答案
当数据的编码类型不正确(应为"unicode"而不是"str")时,这是UnicodeDecodeError.对此进行更改可能会有所帮助:
This is UnicodeDecodeError, when your data is not in the proper encoding type (it should be 'unicode' instead of 'str'). Change to this may help:
`sentences += review_to_sentences(review.decode("utf8"), tokenizer)`
但是可能需要一些时间.另一种方法是在读取输入数据时在开头指定编码"utf8":
But it may take time. Another way is to specify the encoding 'utf8' in the beginning when you read the input data:
`pd.read_csv("input_file", encoding="utf-8")`
这篇关于Kaggle word2vec竞赛,第2部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!