处理非常大的数据框 [英] Handling a very big dataframe
问题描述
现在,我在如何处理数据并将其转换为数据框方面遇到了麻烦.基本上我想做的是先读取数据
Right now I'm having trouble on how to process my data and transform it into a dataframe. Basically what I'm trying to do is to read the data first
data = pd.read_csv(querylog, sep=" ", header=None)
然后分组
query_group = data.groupby('Query')
ip_group = data.groupby('IP')
最后创建一个空白数据框以映射其值
and lastly create a blank dataframe to map their values
df = pd.DataFrame(columns=query_group.groups, index=range(0, len(ip_group.groups)))
index = 0
for name, group in ip_group:
df.set_value(index, 'IP', name)
index += 1
df = df.set_index('IP')
for index, row in data.iterrows():
df.set_value(row['IP'], row['Query'], 1)
print(index)
df = df.fillna(0)
所以我的问题是 ip_group的大小最多可以达到6000 ,而 query_group的大小最多可以达到400000 ,这将导致生成一个很大的空白数据帧,这是我的记忆不能掌握.谁能帮助我解决这个问题?感谢您的帮助.
So my problem is that the ip_group can go up to a size of 6000 and query_group up to 400000 which would result in making a very big blank dataframe that my memory cannot handle. Can anyone help me on how to solve this issue? Any help is appreciated.
数据的示例数据框如下所示
Sample dataframe of the data would look like this
data = pd.DataFrame( { "Query" : ["google.com", "youtube.com", "facebook.com"],
"IP" : ["192.168.0.104", "192.168.0.103","192.168.0.104"] } )
和我的预期输出看起来像这样
and my expected output would look like this
google.com youtube.com facebook.com
IP
192.168.0.104 1 0 1
192.168.0.103 0 1 0
推荐答案
IIUC,您可以使用 get_dummies
,但是如果没有数据就很难找到最佳解决方案:
IIUC you can use get_dummies
, but without data is problematic find the best solution:
df = pd.get_dummies(data.set_index('IP')['Query'])
print df.groupby(df.index).sum()
示例:
import pandas as pd
data = pd.DataFrame( { "Query" : ["a", "b", "c", "d", "a" , "b"],
"IP" : [1,5,4,8,3,4] } )
print data
IP Query
0 1 a
1 5 b
2 4 c
3 8 d
4 3 a
5 4 b
#set index from column data
data = data.set_index('IP')
#get dummies from column Query
df = pd.get_dummies(data['Query'])
print df
a b c d
IP
1 1 0 0 0
5 0 1 0 0
4 0 0 1 0
8 0 0 0 1
3 1 0 0 0
4 0 1 0 0
#groupby by index and sum columns
print df.groupby(df.index).sum()
a b c d
IP
1 1 0 0 0
3 1 0 0 0
4 0 1 1 0
5 0 1 0 0
8 0 0 0 1
尝试通过 astype
转换为int8
节省3倍的内存:
Try convert to int8
by astype
for savings 3 times less memory:
print pd.get_dummies(data['Query']).info()
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, 192.168.0.104 to 192.168.0.104
Data columns (total 3 columns):
facebook.com 3 non-null float64
google.com 3 non-null float64
youtube.com 3 non-null float64
dtypes: float64(3)
memory usage: 96.0+ bytes
print pd.get_dummies(data['Query']).astype(np.int8).info()
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, 192.168.0.104 to 192.168.0.104
Data columns (total 3 columns):
facebook.com 3 non-null int8
google.com 3 non-null int8
youtube.com 3 non-null int8
dtypes: int8(3)
memory usage: 33.0+ bytes
print pd.get_dummies(data['Query'], sparse=True).info()
<class 'pandas.sparse.frame.SparseDataFrame'>
Index: 3 entries, 192.168.0.104 to 192.168.0.104
Data columns (total 3 columns):
facebook.com 3 non-null float64
google.com 3 non-null float64
youtube.com 3 non-null float64
这篇关于处理非常大的数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!