处理非常大的数据框 [英] Handling a very big dataframe

查看:80
本文介绍了处理非常大的数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

现在,我在如何处理数据并将其转换为数据框方面遇到了麻烦.基本上我想做的是先读取数据

Right now I'm having trouble on how to process my data and transform it into a dataframe. Basically what I'm trying to do is to read the data first

data = pd.read_csv(querylog, sep=" ", header=None)

然后分组

query_group = data.groupby('Query')
ip_group = data.groupby('IP')

最后创建一个空白数据框以映射其值

and lastly create a blank dataframe to map their values

df = pd.DataFrame(columns=query_group.groups, index=range(0, len(ip_group.groups)))

index = 0
for name, group in ip_group:
    df.set_value(index, 'IP', name)
    index += 1
df = df.set_index('IP')

for index, row in data.iterrows():
    df.set_value(row['IP'], row['Query'], 1)
    print(index)
df = df.fillna(0)

所以我的问题是 ip_group的大小最多可以达到6000 ,而 query_group的大小最多可以达到400000 ,这将导致生成一个很大的空白数据帧,这是我的记忆不能掌握.谁能帮助我解决这个问题?感谢您的帮助.

So my problem is that the ip_group can go up to a size of 6000 and query_group up to 400000 which would result in making a very big blank dataframe that my memory cannot handle. Can anyone help me on how to solve this issue? Any help is appreciated.

数据的示例数据框如下所示

Sample dataframe of the data would look like this

data = pd.DataFrame( { "Query" : ["google.com", "youtube.com", "facebook.com"],
     "IP" : ["192.168.0.104", "192.168.0.103","192.168.0.104"] } )

和我的预期输出看起来像这样

and my expected output would look like this

                google.com youtube.com  facebook.com
IP            
192.168.0.104   1          0             1
192.168.0.103   0          1             0

推荐答案

IIUC,您可以使用 get_dummies ,但是如果没有数据就很难找到最佳解决方案:

IIUC you can use get_dummies, but without data is problematic find the best solution:

df = pd.get_dummies(data.set_index('IP')['Query'])
print df.groupby(df.index).sum()

示例:

import pandas as pd

data = pd.DataFrame( { "Query" : ["a", "b", "c", "d", "a" , "b"],
     "IP" : [1,5,4,8,3,4] } )
print data  
   IP Query
0   1     a
1   5     b
2   4     c
3   8     d
4   3     a
5   4     b

#set index from column data
data = data.set_index('IP')

#get dummies from column Query
df = pd.get_dummies(data['Query'])
print df
    a  b  c  d
IP            
1   1  0  0  0
5   0  1  0  0
4   0  0  1  0
8   0  0  0  1
3   1  0  0  0
4   0  1  0  0

#groupby by index and sum columns
print df.groupby(df.index).sum()
    a  b  c  d
IP            
1   1  0  0  0
3   1  0  0  0
4   0  1  1  0
5   0  1  0  0
8   0  0  0  1

尝试通过 astype转换为int8 节省3倍的内存:

Try convert to int8 by astype for savings 3 times less memory:

print pd.get_dummies(data['Query']).info()
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, 192.168.0.104 to 192.168.0.104
Data columns (total 3 columns):
facebook.com    3 non-null float64
google.com      3 non-null float64
youtube.com     3 non-null float64
dtypes: float64(3)
memory usage: 96.0+ bytes

print pd.get_dummies(data['Query']).astype(np.int8).info()
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, 192.168.0.104 to 192.168.0.104
Data columns (total 3 columns):
facebook.com    3 non-null int8
google.com      3 non-null int8
youtube.com     3 non-null int8
dtypes: int8(3)
memory usage: 33.0+ bytes

print pd.get_dummies(data['Query'], sparse=True).info()
<class 'pandas.sparse.frame.SparseDataFrame'>
Index: 3 entries, 192.168.0.104 to 192.168.0.104
Data columns (total 3 columns):
facebook.com    3 non-null float64
google.com      3 non-null float64
youtube.com     3 non-null float64

这篇关于处理非常大的数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆