无需稀疏变换即可将庞大的稀疏数据帧转换为稀疏稀疏矩阵 [英] Huge sparse dataframe to scipy sparse matrix without dense transform
问题描述
具有超过一百万行和30列的数据,其中一列是user_id(超过1500个不同的用户). 我想对这一专栏进行一次热编码,并在ML算法(xgboost,FFM,scikit)中使用数据.但是由于巨大的行数和唯一的用户值矩阵将为〜1百万X 1500,因此需要以稀疏格式执行此操作(否则数据将杀死所有RAM).
Have data with more then 1 million rows and 30 columns, one of the columns is user_id (more then 1500 different users). I want one-hot-encode this column and to use data in ML algorithms (xgboost, FFM, scikit). But due to huge row numbers and unique user values matrix will be ~ 1 million X 1500, so need do this in sparse format (otherwise data kill all RAM).
对我来说,通过pandas DataFrame处理数据的便捷方法现在也支持稀疏格式:
For me convenient way to work with data through pandas DataFrame, which also now it support sparse format:
df = pd.get_dummies(df, columns=['user_id', 'type'], sparse=True)
工作速度非常快,并且RAM的大小很小.但是,要使用scikit算法和xgboost,必须将数据帧转换为稀疏矩阵.
Work pretty fast and have small size in RAM. But for working with scikit algos and xgboost it's necessary transform dataframe to sparse matrix.
有什么方法可以做到这一点,而不是遍历各列并将它们堆叠在一个稀疏的稀疏矩阵中吗? 我尝试了df.as_matrix()和df.values,但所有的第一个转换数据都是密集的MemoryError :(
Is there any way to do this rather than iterate through columns and hstack them in one scipy sparse matrix? I tried df.as_matrix() and df.values, but all of first transform data to dense what arise MemoryError :(
P.S. 同样获得xgboost的DMatrix
P.S. Same to get DMatrix for xgboost
更新:
所以我发布了下一个解决方案(感谢优化建议):
So i release next solution (will be thankful for optimisation suggestions):
def sparse_df_to_saprse_matrix (sparse_df):
index_list = sparse_df.index.values.tolist()
matrix_columns = []
sparse_matrix = None
for column in sparse_df.columns:
sps_series = sparse_df[column]
sps_series.index = pd.MultiIndex.from_product([index_list, [column]])
curr_sps_column, rows, cols = sps_series.to_coo()
if sparse_matrix != None:
sparse_matrix = sparse.hstack([sparse_matrix, curr_sps_column])
else:
sparse_matrix = curr_sps_column
matrix_columns.extend(cols)
return sparse_matrix, index_list, matrix_columns
以下代码允许获取稀疏数据帧:
And the following code allows to get sparse dataframe:
one_hot_df = pd.get_dummies(df, columns=['user_id', 'type'], sparse=True)
full_sparse_df = one_hot_df.to_sparse(fill_value=0)
我创建了1,1百万行x 1150列的稀疏矩阵.但是在创建过程中,它仍然需要使用大量的RAM(我的12Gb的边缘内存大约为10Gb).
I have created sparse matrix 1,1 million rows x 1150 columns. But during creating it's still uses significant amount of RAM (~10Gb on edge with my 12Gb).
不知道为什么,因为生成的稀疏矩阵仅使用300 Mb(从HDD加载后).有什么想法吗?
Don't know why, because resulting sparse matrix uses only 300 Mb (after loading from HDD). Any ideas?
推荐答案
您应该可以通过以下方式在熊猫[1]中使用实验性.to_coo()
方法:
You should be able to use the experimental .to_coo()
method in pandas [1] in the following way:
one_hot_df = pd.get_dummies(df, columns=['user_id', 'type'], sparse=True)
one_hot_df, idx_rows, idx_cols = one_hot_df.stack().to_sparse().to_coo()
此方法不是使用DataFrame
(行/列),而是使用在MultiIndex
中包含行和列的Series
(这就是为什么需要.stack()
方法)的原因.带有MultiIndex
的Series
必须是SparseSeries
,即使您输入的是SparseDataFrame
,.stack()
也会返回常规的Series
.因此,您需要在调用.to_coo()
之前使用.to_sparse()
方法.
This method, instead of taking a DataFrame
(rows / columns) it takes a Series
with rows and columns in a MultiIndex
(this is why you need the .stack()
method). This Series
with the MultiIndex
needs to be a SparseSeries
, and even if your input is a SparseDataFrame
, .stack()
returns a regular Series
. So, you need to use the .to_sparse()
method before calling .to_coo()
.
由.stack()
返回的Series
,即使它不是SparseSeries
也仅包含不为null的元素,因此它不应占用比稀疏版本更多的内存(至少使用np.nan
时)类型是np.float
).
The Series
returned by .stack()
, even if it's not a SparseSeries
only contains the elements that are not null, so it shouldn't take more memory than the sparse version (at least with np.nan
when the type is np.float
).
这篇关于无需稀疏变换即可将庞大的稀疏数据帧转换为稀疏稀疏矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!