从pandas dataFrame创建术语密度矩阵时的内存使用情况 [英] Memory usage in creating Term Density Matrix from pandas dataFrame
问题描述
我有一个从csv文件保存/读取的DataFrame,我想从中创建一个术语密度矩阵DataFrame.遵循herrfz的建议此处,我使用了CounVectorizer来自sklearn.我将该代码包装在一个函数中
I have a DataFrame which I save/read from a csv file, and I want to create a Term Density Matrix DataFrame from it. Following herrfz's suggestion here, I use CounVectorizer from sklearn. I wrapped that code in a function
from sklearn.feature_extraction.text import CountVectorizer
countvec = CountVectorizer()
from scipy.sparse import coo_matrix, csc_matrix, hstack
def df2tdm(df,titleColumn,placementColumn):
'''
Takes in a DataFrame with at least two columns, and returns a dataframe with the term density matrix
of the words appearing in the titleColumn
Inputs: df, a DataFrame containing titleColumn, placementColumn among other columns
Outputs: tdm_df, a DataFrame containing placementColumn and columns with all the words appearrig in df.titleColumn
Credits:
https://stackoverflow.com/questions/22205845/efficient-way-to-create-term-density-matrix-from-pandas-dataframe
'''
tdm_df = pd.DataFrame(countvec.fit_transform(df[titleColumn]).toarray(), columns=countvec.get_feature_names())
tdm_df = tdm_df.join(pd.DataFrame(df[placementColumn]))
return tdm_df
哪个TDM作为数据帧返回,例如:
Which returns the TDM as a DataFrame, for example:
df = pd.DataFrame({'title':['Delicious boiled egg','Fried egg ', 'Potato salad', 'Split orange','Something else'], 'page':[1, 1, 2, 3, 4]})
print df.head()
tdm_df = df2tdm(df,'title','page')
tdm_df.head()
boiled delicious egg else fried orange potato salad something \
0 1 1 1 0 0 0 0 0 0
1 0 0 1 0 1 0 0 0 0
2 0 0 0 0 0 0 1 1 0
3 0 0 0 0 0 1 0 0 0
4 0 0 0 1 0 0 0 0 1
split page
0 0 1
1 0 1
2 0 2
3 1 3
4 0 4
此实现存在内存缩放错误的问题:当我使用一个保存为utf8的DataFrame占用190 kB时,该函数使用约200 MB的空间来创建TDM数据帧.当csv文件为600 kB时,该函数使用700 MB,而当csv文件为3.8 MB时,该函数耗尽我所有的内存和交换文件(8 GB),并崩溃.
This implementation suffers from bad memory scaling: When I use a DataFrame which occupies 190 kB saved as utf8, the function uses ~200 MB to create the TDM dataframe. When the csv file is 600 kB, the function uses 700 MB, and when the csv is 3.8 MB the function uses up all of my memory and swap file (8 GB) and crashes.
我还使用稀疏矩阵和稀疏DataFrames(如下)进行了实现,但是内存使用量几乎相同,只是速度相当慢
I also made an implementation using sparse matrices and sparse DataFrames (below), but the memory usage is pretty much the same, only it is considerably slower
def df2tdm_sparse(df,titleColumn,placementColumn):
'''
Takes in a DataFrame with at least two columns, and returns a dataframe with the term density matrix
of the words appearing in the titleColumn. This implementation uses sparse DataFrames.
Inputs: df, a DataFrame containing titleColumn, placementColumn among other columns
Outputs: tdm_df, a DataFrame containing placementColumn and columns with all the words appearrig in df.titleColumn
Credits:
https://stackoverflow.com/questions/22205845/efficient-way-to-create-term-density-matrix-from-pandas-dataframe
https://stackoverflow.com/questions/17818783/populate-a-pandas-sparsedataframe-from-a-scipy-sparse-matrix
https://stackoverflow.com/questions/6844998/is-there-an-efficient-way-of-concatenating-scipy-sparse-matrices
'''
pm = df[[placementColumn]].values
tm = countvec.fit_transform(df[titleColumn])#.toarray()
m = csc_matrix(hstack([pm,tm]))
dfout = pd.SparseDataFrame([ pd.SparseSeries(m[i].toarray().ravel()) for i in np.arange(m.shape[0]) ])
dfout.columns = [placementColumn]+countvec.get_feature_names()
return dfout
关于如何提高内存使用率的任何建议?我想知道这是否与scikit的内存问题有关,例如此处
Any suggestions on how to improve memory usage? I wonder if this is related to the memory issues of scikit, e.g. here
推荐答案
我还认为问题可能在于从稀疏矩阵到稀疏数据帧的转换.
I also think that the problem might be with the conversion from sparse matrix to sparse data frame.
尝试使用此功能(或类似功能)
try this function (or something similar)
def SparseMatrixToSparseDF(xSparseMatrix):
import numpy as np
import pandas as pd
def ElementsToNA(x):
x[x==0] = NaN
return x
xdf1 =
pd.SparseDataFrame([pd.SparseSeries(ElementsToNA(xSparseMatrix[i].toarray().ravel()))
for i in np.arange(xSparseMatrix.shape[0]) ])
return xdf1
您可以看到它通过使用功能密度
you can see that it reduces the size by using function density
df1.density
我希望对您有帮助
这篇关于从pandas dataFrame创建术语密度矩阵时的内存使用情况的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!