从pandas dataFrame创建术语密度矩阵时的内存使用情况 [英] Memory usage in creating Term Density Matrix from pandas dataFrame

查看:101
本文介绍了从pandas dataFrame创建术语密度矩阵时的内存使用情况的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个从csv文件保存/读取的DataFrame,我想从中创建一个术语密度矩阵DataFrame.遵循herrfz的建议此处,我使用了CounVectorizer来自sklearn.我将该代码包装在一个函数中

I have a DataFrame which I save/read from a csv file, and I want to create a Term Density Matrix DataFrame from it. Following herrfz's suggestion here, I use CounVectorizer from sklearn. I wrapped that code in a function

    from sklearn.feature_extraction.text import CountVectorizer
    countvec = CountVectorizer()
    from scipy.sparse import coo_matrix, csc_matrix, hstack

    def df2tdm(df,titleColumn,placementColumn):
        '''
        Takes in a DataFrame with at least two columns, and returns a dataframe with the term density matrix
        of the words appearing in the titleColumn

        Inputs: df, a DataFrame containing titleColumn, placementColumn among other columns
        Outputs: tdm_df, a DataFrame containing placementColumn and columns with all the words appearrig in df.titleColumn

        Credits: 
        https://stackoverflow.com/questions/22205845/efficient-way-to-create-term-density-matrix-from-pandas-dataframe
        '''
        tdm_df = pd.DataFrame(countvec.fit_transform(df[titleColumn]).toarray(), columns=countvec.get_feature_names())
        tdm_df = tdm_df.join(pd.DataFrame(df[placementColumn]))
        return tdm_df

哪个TDM作为数据帧返回,例如:

Which returns the TDM as a DataFrame, for example:

    df = pd.DataFrame({'title':['Delicious boiled egg','Fried egg ', 'Potato salad', 'Split orange','Something else'], 'page':[1, 1, 2, 3, 4]})
    print df.head()
    tdm_df = df2tdm(df,'title','page')
    tdm_df.head()

       boiled  delicious  egg  else  fried  orange  potato  salad  something  \
    0       1          1    1     0      0       0       0      0          0   
    1       0          0    1     0      1       0       0      0          0   
    2       0          0    0     0      0       0       1      1          0   
    3       0          0    0     0      0       1       0      0          0   
    4       0          0    0     1      0       0       0      0          1   

       split  page  
    0      0     1  
    1      0     1  
    2      0     2  
    3      1     3  
    4      0     4  

此实现存在内存缩放错误的问题:当我使用一个保存为utf8的DataFrame占用190 kB时,该函数使用约200 MB的空间来创建TDM数据帧.当csv文件为600 kB时,该函数使用700 MB,而当csv文件为3.8 MB时,该函数耗尽我所有的内存和交换文件(8 GB),并崩溃.

This implementation suffers from bad memory scaling: When I use a DataFrame which occupies 190 kB saved as utf8, the function uses ~200 MB to create the TDM dataframe. When the csv file is 600 kB, the function uses 700 MB, and when the csv is 3.8 MB the function uses up all of my memory and swap file (8 GB) and crashes.

我还使用稀疏矩阵和稀疏DataFrames(如下)进行了实现,但是内存使用量几乎相同,只是速度相当慢

I also made an implementation using sparse matrices and sparse DataFrames (below), but the memory usage is pretty much the same, only it is considerably slower

    def df2tdm_sparse(df,titleColumn,placementColumn):
        '''
        Takes in a DataFrame with at least two columns, and returns a dataframe with the term density matrix
        of the words appearing in the titleColumn. This implementation uses sparse DataFrames.

        Inputs: df, a DataFrame containing titleColumn, placementColumn among other columns
        Outputs: tdm_df, a DataFrame containing placementColumn and columns with all the words appearrig in df.titleColumn

        Credits: 
        https://stackoverflow.com/questions/22205845/efficient-way-to-create-term-density-matrix-from-pandas-dataframe
        https://stackoverflow.com/questions/17818783/populate-a-pandas-sparsedataframe-from-a-scipy-sparse-matrix
        https://stackoverflow.com/questions/6844998/is-there-an-efficient-way-of-concatenating-scipy-sparse-matrices
        '''
        pm = df[[placementColumn]].values
        tm = countvec.fit_transform(df[titleColumn])#.toarray()
        m = csc_matrix(hstack([pm,tm]))
        dfout = pd.SparseDataFrame([ pd.SparseSeries(m[i].toarray().ravel()) for i in np.arange(m.shape[0]) ])
        dfout.columns = [placementColumn]+countvec.get_feature_names()
        return dfout

关于如何提高内存使用率的任何建议?我想知道这是否与scikit的内存问题有关,例如此处

Any suggestions on how to improve memory usage? I wonder if this is related to the memory issues of scikit, e.g. here

推荐答案

我还认为问题可能在于从稀疏矩阵到稀疏数据帧的转换.

I also think that the problem might be with the conversion from sparse matrix to sparse data frame.

尝试使用此功能(或类似功能)

try this function (or something similar)

 def SparseMatrixToSparseDF(xSparseMatrix):
     import numpy as np
     import pandas as pd
     def ElementsToNA(x):
          x[x==0] = NaN
     return x 
    xdf1 = 
      pd.SparseDataFrame([pd.SparseSeries(ElementsToNA(xSparseMatrix[i].toarray().ravel())) 
for i in np.arange(xSparseMatrix.shape[0]) ])
  return xdf1

您可以看到它通过使用功能密度

you can see that it reduces the size by using function density

 df1.density

我希望对您有帮助

这篇关于从pandas dataFrame创建术语密度矩阵时的内存使用情况的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆