从pandas dataFrame创建术语密度矩阵时的内存使用情况 [英] Memory usage in creating Term Density Matrix from pandas dataFrame

查看：101 发布时间：2020/5/8 20:09:14 python memory pandas scikit-learn

本文介绍了从pandas dataFrame创建术语密度矩阵时的内存使用情况的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个从csv文件保存/读取的DataFrame，我想从中创建一个术语密度矩阵DataFrame.遵循herrfz的建议此处，我使用了CounVectorizer来自sklearn.我将该代码包装在一个函数中

I have a DataFrame which I save/read from a csv file, and I want to create a Term Density Matrix DataFrame from it. Following herrfz's suggestion here, I use CounVectorizer from sklearn. I wrapped that code in a function

    from sklearn.feature_extraction.text import CountVectorizer
    countvec = CountVectorizer()
    from scipy.sparse import coo_matrix, csc_matrix, hstack

    def df2tdm(df,titleColumn,placementColumn):
        '''
        Takes in a DataFrame with at least two columns, and returns a dataframe with the term density matrix
        of the words appearing in the titleColumn

        Inputs: df, a DataFrame containing titleColumn, placementColumn among other columns
        Outputs: tdm_df, a DataFrame containing placementColumn and columns with all the words appearrig in df.titleColumn

        Credits: 
        https://stackoverflow.com/questions/22205845/efficient-way-to-create-term-density-matrix-from-pandas-dataframe
        '''
        tdm_df = pd.DataFrame(countvec.fit_transform(df[titleColumn]).toarray(), columns=countvec.get_feature_names())
        tdm_df = tdm_df.join(pd.DataFrame(df[placementColumn]))
        return tdm_df

哪个TDM作为数据帧返回，例如:

Which returns the TDM as a DataFrame, for example:

    df = pd.DataFrame({'title':['Delicious boiled egg','Fried egg ', 'Potato salad', 'Split orange','Something else'], 'page':[1, 1, 2, 3, 4]})
    print df.head()
    tdm_df = df2tdm(df,'title','page')
    tdm_df.head()

       boiled  delicious  egg  else  fried  orange  potato  salad  something  \
    0       1          1    1     0      0       0       0      0          0   
    1       0          0    1     0      1       0       0      0          0   
    2       0          0    0     0      0       0       1      1          0   
    3       0          0    0     0      0       1       0      0          0   
    4       0          0    0     1      0       0       0      0          1   

       split  page  
    0      0     1  
    1      0     1  
    2      0     2  
    3      1     3  
    4      0     4

此实现存在内存缩放错误的问题:当我使用一个保存为utf8的DataFrame占用190 kB时，该函数使用约200 MB的空间来创建TDM数据帧.当csv文件为600 kB时，该函数使用700 MB，而当csv文件为3.8 MB时，该函数耗尽我所有的内存和交换文件(8 GB)，并崩溃.

This implementation suffers from bad memory scaling: When I use a DataFrame which occupies 190 kB saved as utf8, the function uses ~200 MB to create the TDM dataframe. When the csv file is 600 kB, the function uses 700 MB, and when the csv is 3.8 MB the function uses up all of my memory and swap file (8 GB) and crashes.

我还使用稀疏矩阵和稀疏DataFrames(如下)进行了实现，但是内存使用量几乎相同，只是速度相当慢

I also made an implementation using sparse matrices and sparse DataFrames (below), but the memory usage is pretty much the same, only it is considerably slower

    def df2tdm_sparse(df,titleColumn,placementColumn):
        '''
        Takes in a DataFrame with at least two columns, and returns a dataframe with the term density matrix
        of the words appearing in the titleColumn. This implementation uses sparse DataFrames.

        Inputs: df, a DataFrame containing titleColumn, placementColumn among other columns
        Outputs: tdm_df, a DataFrame containing placementColumn and columns with all the words appearrig in df.titleColumn

        Credits: 
        https://stackoverflow.com/questions/22205845/efficient-way-to-create-term-density-matrix-from-pandas-dataframe
        https://stackoverflow.com/questions/17818783/populate-a-pandas-sparsedataframe-from-a-scipy-sparse-matrix
        https://stackoverflow.com/questions/6844998/is-there-an-efficient-way-of-concatenating-scipy-sparse-matrices
        '''
        pm = df[[placementColumn]].values
        tm = countvec.fit_transform(df[titleColumn])#.toarray()
        m = csc_matrix(hstack([pm,tm]))
        dfout = pd.SparseDataFrame([ pd.SparseSeries(m[i].toarray().ravel()) for i in np.arange(m.shape[0]) ])
        dfout.columns = [placementColumn]+countvec.get_feature_names()
        return dfout

关于如何提高内存使用率的任何建议?我想知道这是否与scikit的内存问题有关，例如此处

Any suggestions on how to improve memory usage? I wonder if this is related to the memory issues of scikit, e.g. here

从pandas dataFrame创建术语密度矩阵时的内存使用情况 [英] Memory usage in creating Term Density Matrix from pandas dataFrame

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

从pandas dataFrame创建术语密度矩阵时的内存使用情况 [英] Memory usage in creating Term Density Matrix from pandas dataFrame

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭