尝试在包含Pandas Dataframe列(包含字符串)的TFidfVectorizer上应用"fit_transform()"时出现内存错误 [英] Memory Error when attempting to apply 'fit_transform()' on TFidfVectorizer containing Pandas Dataframe column (containing strings)

查看:550
本文介绍了尝试在包含Pandas Dataframe列(包含字符串)的TFidfVectorizer上应用"fit_transform()"时出现内存错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试类似所示的操作

I'm attempting a similar operation as shown here. I begin with reading in two columns from a CSV file that contains 2405 rows in the format of: Year e.g. "1995" AND cleaned e.g. ["this", "is, "exemplar", "document", "contents"], both columns utilise strings as data types.

    df = pandas.read_csv("ukgovClean.csv", encoding='utf-8', usecols=[0,2])

我已经预先清理了数据,下面显示了前4行的格式:

I have already pre-cleaned the data, and below shows the format of the top 4 rows:

     [IN] df.head()

    [OUT]   Year    cleaned
         0  1909    acquaint hous receiv follow letter clerk crown...
         1  1909    ask secretari state war whether issu statement...
         2  1909    i beg present petit sign upward motor car driv...
         3  1909    i desir ask secretari state war second lieuten...
         4  1909    ask secretari state war whether would introduc...

    [IN] df['cleaned'].head()

   [OUT] 0    acquaint hous receiv follow letter clerk crown...
         1    ask secretari state war whether issu statement...
         2    i beg present petit sign upward motor car driv...
         3    i desir ask secretari state war second lieuten...
         4    ask secretari state war whether would introduc...
         Name: cleaned, dtype: object

然后我初始化TfidfVectorizer:

Then I initialise the TfidfVectorizer:

    [IN] v = TfidfVectorizer(decode_error='replace', encoding='utf-8')

在此之后,调用下面的行将导致:

Following this, calling upon the below line results in:

    [IN] x = v.fit_transform(df['cleaned'])
   [OUT] ValueError: np.nan is an invalid document, expected byte or unicode string.

我使用了上述但是,这导致了内存错误(完整追溯).

however, this resulted in a Memory Error (Full Traceback).

我试图使用Pickle查找存储以规避大容量内存的使用,但是我不确定在这种情况下如何对其进行过滤.任何提示将不胜感激,并感谢您的阅读.

I've attempted to look up storage using Pickle to circumvent mass-memory usage, but I'm not sure how to filter it in in this scenario. Any tips would be much appreciated, and thanks for reading.

[更新]

@ pittsburgh137 发布了一个类似问题的解决方案,该问题涉及拟合数据

@pittsburgh137 posted a solution to a similar problem involving fitting data here, in which the training data is generated using pandas.get_dummies(). What I've done with this is:

    [IN] train_X = pandas.get_dummies(df['cleaned'])
    [IN] train_X.shape
   [OUT] (2405, 2380)

    [IN] x = v.fit_transform(train_X)
    [IN] type(x)
   [OUT] scipy.sparse.csr.csr_matrix

我认为我应该更新所有读者,同时我看到我可以用此开发做些什么.如果使用此方法可以预测到任何陷阱,我很乐意听到.

I thought I should update any readers while I see what I can do with this development. If there are any predicted pitfalls with this method, I'd love to hear them.

推荐答案

我认为,转换为dtype('<Unn')可能会给您带来麻烦.仅使用前几个文档加上NaN,相对地检查数组的大小:

I believe it's the conversion to dtype('<Unn') that might be giving you trouble. Check out the size of the array on a relative basis, using just the first few documents plus an NaN:

>>> df['cleaned'].values
array(['acquaint hous receiv follow letter clerk crown',
       'ask secretari state war whether issu statement',
       'i beg present petit sign upward motor car driv',
       'i desir ask secretari state war second lieuten',
       'ask secretari state war whether would introduc', nan],
      dtype=object)

>>> df['cleaned'].values.astype('U').nbytes
1104

>>> df['cleaned'].values.nbytes
48

首先删除NaN值(df.dropna(inplace=True))似乎很有意义.然后,调用v.fit_transform(df['cleaned'].tolist())应该非常有效.

It seems like it would make sense to drop the NaN values first (df.dropna(inplace=True)). Then, it should be pretty efficient to call v.fit_transform(df['cleaned'].tolist()).

这篇关于尝试在包含Pandas Dataframe列(包含字符串)的TFidfVectorizer上应用"fit_transform()"时出现内存错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆