scikit-learn中的TfidfVectorizer:ValueError:np.nan是无效的文档 [英] TfidfVectorizer in scikit-learn : ValueError: np.nan is an invalid document

查看:64
本文介绍了scikit-learn中的TfidfVectorizer:ValueError:np.nan是无效的文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用scikit-learn的TfidfVectorizer从文本数据中提取一些特征.我有一个带分数(可以为+1或-1)和审阅(文本)的CSV文件.我将这些数据提取到一个DataFrame中,以便可以运行Vectorizer.

I'm using TfidfVectorizer from scikit-learn to do some feature extraction from text data. I have a CSV file with a Score (can be +1 or -1) and a Review (text). I pulled this data into a DataFrame so I can run the Vectorizer.

这是我的代码:

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

df = pd.read_csv("train_new.csv",
             names = ['Score', 'Review'], sep=',')

# x = df['Review'] == np.nan
#
# print x.to_csv(path='FindNaN.csv', sep=',', na_rep = 'string', index=True)
#
# print df.isnull().values.any()

v = TfidfVectorizer(decode_error='replace', encoding='utf-8')
x = v.fit_transform(df['Review'])

这是我得到的错误的回溯:

This is the traceback for the error I get:

Traceback (most recent call last):
  File "/home/PycharmProjects/Review/src/feature_extraction.py", line 16, in <module>
x = v.fit_transform(df['Review'])
 File "/home/b/hw1/local/lib/python2.7/site-   packages/sklearn/feature_extraction/text.py", line 1305, in fit_transform
   X = super(TfidfVectorizer, self).fit_transform(raw_documents)
 File "/home/b/work/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 817, in fit_transform
self.fixed_vocabulary_)
 File "/home/b/work/local/lib/python2.7/site- packages/sklearn/feature_extraction/text.py", line 752, in _count_vocab
   for feature in analyze(doc):
 File "/home/b/work/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 238, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
 File "/home/b/work/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 118, in decode
 raise ValueError("np.nan is an invalid document, expected byte or "
 ValueError: np.nan is an invalid document, expected byte or unicode string.

我检查了CSV文件和DataFrame中是否有被读取为NaN的内容,但找不到任何内容.有18000行,没有任何行将isnan返回为True.

I checked the CSV file and DataFrame for anything that's being read as NaN but I can't find anything. There are 18000 rows, none of which return isnan as True.

这是df['Review'].head()的样子:

  0    This book is such a life saver.  It has been s...
  1    I bought this a few times for my older son and...
  2    This is great for basics, but I wish the space...
  3    This book is perfect!  I'm a first time new mo...
  4    During your postpartum stay at the hospital th...
  Name: Review, dtype: object

推荐答案

您需要将dtype object转换为unicode字符串,如回溯中明确提到的那样.

You need to convert the dtype object to unicode string as is clearly mentioned in the traceback.

x = v.fit_transform(df['Review'].values.astype('U'))  ## Even astype(str) would work

在TFIDF Vectorizer的文档"页面中:

From the Doc page of TFIDF Vectorizer:

fit_transform(raw_documents,y = None)

fit_transform(raw_documents, y=None)

参数:raw_documents:可迭代
产生 str unicode 文件对象

Parameters: raw_documents : iterable
an iterable which yields either str, unicode or file objects

这篇关于scikit-learn中的TfidfVectorizer:ValueError:np.nan是无效的文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆