内存高效的Python( pandas )每个时期的一个csv文件的类别聚合 [英] memory efficient Python (pandas) aggregates of categories from one csv file per period

查看:160
本文介绍了内存高效的Python( pandas )每个时期的一个csv文件的类别聚合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图避免与pandas或IOPro(仍在调查)的分段错误,所以我在寻找替代解决方案,esp。更高效的。下面的代码运行良好的小数据,但崩溃读取90个月的面板几个GB的Linux服务器上的256 GB RAM,版本pandas 0.16.2 np19py26_0,iopro 1.7.1 np19py27_p0和python 2.7.10 0。



我在这里做的是,我汇总每个人(LopNr)和月份的药品购买记录(TKOST成本)的帐户,同时还使用他们的ATC将药物分成不同类别代码。



所以当原始数据看起来像这样,在每月的csv文件(说2006年7月这里,与csv中的许多其他列不需要) :

  LopNr TKOST ATC 
1 5 N01
1 11 N01
1 6 N15

等。



  LopNr TKOST年月
1 22 2006 7



,对于几个类别(例如,从这里开始的ATC的神经),或者在单个数据文件中对这些类别的单独摘要神经列等)。



我选择了IOPro,而不是简单的熊猫更高效的内存,但现在我得到一个分割错误。

 # -  *  -  coding:utf-8  -  *  -  
import iopro
from pandas import *

neuro = DataFrame()
cardio = DataFrame()
cancer = DataFrame()
addiction = DataFrame()
Adrugs = DataFrame()
Mdrugs = DataFrame()
Vdrugs = DataFrame()
all_drugs = DataFrame()

对于xrange中的year(2005,2013):
13):
if year == 2005 and month< 7:
continue
filename ='PATH / lmed_'+ str(year)+'_mon'+ str(month)+'。txt'
adapter = iopro.text_adapter ='csv',field_names = True,output ='dataframe',delimiter ='\t')
monthly = adapter [['LopNr','ATC','TKOST']] [:]
monthly ['year'] = year
monthly ['month'] = month
neuro = neuro.append(monthly [(monthly.ATC.str.startswith('N'))& (〜(monthly.TKOST.isnull()))])
cardio = cardio.append(monthly [(monthly.ATC.str.startswith('C'))&(〜(monthly.TKOST.isnull ())])
cancer = cancer.append(monthly [(monthly.ATC.str.startswith('L'))&(〜(monthly.TKOST.isnull())])
addiction = addiction.append(monthly [(monthly.ATC.str.startswith('N07'))&(〜(monthly.TKOST.isnull())]))
Adrugs = Adrugs.append每月[(monthly.ATC.str.startswith('A'))&(〜(monthly.TKOST.isnull())])
Mdrugs = Mdrugs.append(monthly [(monthly.ATC.str .startswith('M'))& (〜(monthly.TKOST.isnull()))])
Vdrugs = Vdrugs.append(monthly [(monthly.ATC.str.startswith('V'))&(〜(monthly.TKOST.isnull ())])
all_drugs = all_drugs.append(monthly [(〜(monthly.TKOST.isnull())])
del每月

all_drugs = all_drugs。 groupby(['LopNr','year','month'])。sum()
all_drugs = all_drugs.astype(int,copy = False)
all_drugs.to_csv('PATH / monthly_all_drugs_costs.csv ')
del all_drugs

neuro = neuro.groupby(['LopNr','year','month'])sum()
neuro = neuro.astype int,copy = False)
neuro.to_csv('PATH / monthly_neuro_costs.csv')
del neuro

cardio = cardio.groupby(['LopNr','year' ,'month'])。sum()
cardio = cardio.astype(int,copy = False)
cardio.to_csv('PATH / monthly_cardio_costs.csv')
del cardio

cancer = cancer.groupby(['LopNr','year','month'])sum()
cancer = cancer.astype(int,copy = False)
cancer.to_csv('PATH / monthly_cancer_costs.csv')
del cancer

addiction = addiction.groupby(['LopNr','year','month'])sum
addiction = addiction.astype(int,copy = False)
addiction.to_csv('PATH / monthly_addiction_costs.csv')
del addiction

Adrugs = groupby(['LopNr','year','month'])sum()
Adrugs = Adrugs.astype(int,copy = False)
Adrugs.to_csv('PATH / monthly_Adrugs_costs.csv ')
del adrugs

Mdrugs = Mdrugs.groupby(['LopNr','year','month'])sum()
Mdrugs = Mdrugs.astype int,copy = False)
Mdrugs.to_csv('PATH / monthly_Mdrugs_costs.csv')
del Mdrugs

Vdrugs = Vdrugs.groupby(['LopNr','year' ,'month'])。sum()
Vdrugs = Vdrugs.astype(int,copy = False)
Vdrugs.to_csv('PATH / monthly_Vdrugs_costs.csv')
del Vdrugs


解决方案

您的代码是重复的,可以使用字典和列表理解。这个解决方案应该消除你的内存问题,因为你一次只处理一个月的数据(虽然你有一个不断增加的每月摘要列表,我不认为会使用很多内存)。



我不能测试这个,但我相信它会做你上面的代码中的一切。

  import pandas as pd 
import iopro

items = {'neuro':'N',
'cardio':'C',
'cancer':' ,
'addiction':'N07',
'Adrugs':'A',
'Mdrugs':'M',
'Vdrugs':'V' b $ b'all_drugs':''}

#1.使用字典推导创建数据容器。
monthly_summaries = items.keys()中的项目的{item:list()}

#2.执行每月groupby操作。
for year in xrange(2005,2013):
for month in xrange(1,13):
if year == 2005 and month& 7:
continue
filename ='PATH / lmed_'+ str(year)+'_mon'+ str(month)+'。txt'
adapter = iopro.text_adapter b $ b parser ='csv',
field_names = True,
output ='data frame',
delimiter ='\t')
monthly = adapter [[' [%]
每月['month'] =月
dfs = {name:monthly [ (monthly.ATC.str.startswith('{0}'。format(code)))
& (〜(monthly.TKOST.isnull())]]
为name,代码在items.iteritems()}
[monthly_summaries [name] .append(dfs [name] .groupby(['LopNr ','year','month'])sum()
.astype(int,copy = False))
用于items.keys()中的名称]

#3.现在将所有每月摘要链接到单独的DataFrames中。
dfs = {name:pd.concat([monthly_summaries [name],ignore_axis = True])
for items.keys()}

#4.现在重组每月总摘要。
monthly_summaries = {name:dfs [name] .reset_index()。groupby(['LopNr','year','month'])sum()
for items in items.keys }

#5.最后,将聚合结果保存到文件。
[month_summaries [name] .to_csv('PATH / monthly_ {0} _costs.csv'.format(name))
项目中的名称()]


I am trying to avoid a segmentation fault with either pandas or IOPro (still investigating), so I am looking for alternative solutions, esp. more efficient ones. The code below runs fine with small data but crashed reading in 90 monthly panels of a few GBs on a Linux server with 256 GB RAM, versions pandas 0.16.2 np19py26_0, iopro 1.7.1 np19py27_p0, and python 2.7.10 0.

What I do here is that I aggregate accounts of drug purchase records (cost in TKOST) for each person (LopNr) and month, while also separating the drugs into categories using their ATC codes.

So while the original data would look like this, in monthly csv files (say July 2006 here, with many other columns in the csv I don't need):

LopNr TKOST ATC
1         5 N01
1        11 N01
1         6 N15

etc.

I wanted aggregate panels, with rows like

LopNr TKOST year month
1        22 2006     7

either separately for a few categories (e.g. neuro for ATCs starting with N here), or with separate summaries for these categories in a single datafile (so with a neuro column etc.).

I opted for IOPro and not simple pandas to be more efficient with memory, but now I am getting a segmentation error.

# -*- coding: utf-8 -*-
import iopro
from pandas import *

neuro   = DataFrame()
cardio  = DataFrame()
cancer  = DataFrame()
addiction  = DataFrame()
Adrugs  = DataFrame()
Mdrugs  = DataFrame()
Vdrugs  = DataFrame()
all_drugs  = DataFrame()

for year in xrange(2005,2013):
    for month in xrange(1,13):
        if year == 2005 and month < 7:
            continue
        filename = 'PATH/lmed_' + str(year) + '_mon'+ str(month) +'.txt'
        adapter = iopro.text_adapter(filename,parser='csv',field_names=True,output='dataframe',delimiter='\t')
        monthly = adapter[['LopNr','ATC','TKOST']][:]
        monthly['year']=year
        monthly['month']=month
        neuro = neuro.append(monthly[(monthly.ATC.str.startswith('N')) & (~(monthly.TKOST.isnull()))])
        cardio = cardio.append(monthly[(monthly.ATC.str.startswith('C')) & (~(monthly.TKOST.isnull()))])
        cancer = cancer.append(monthly[(monthly.ATC.str.startswith('L')) & (~(monthly.TKOST.isnull()))])
        addiction = addiction.append(monthly[(monthly.ATC.str.startswith('N07')) & (~(monthly.TKOST.isnull()))])
        Adrugs = Adrugs.append(monthly[(monthly.ATC.str.startswith('A')) & (~(monthly.TKOST.isnull()))])
        Mdrugs = Mdrugs.append(monthly[(monthly.ATC.str.startswith('M')) & (~(monthly.TKOST.isnull()))])
        Vdrugs = Vdrugs.append(monthly[(monthly.ATC.str.startswith('V')) & (~(monthly.TKOST.isnull()))])
        all_drugs = all_drugs.append(monthly[(~(monthly.TKOST.isnull()))])
        del monthly

all_drugs = all_drugs.groupby(['LopNr','year','month']).sum()
all_drugs = all_drugs.astype(int,copy=False)
all_drugs.to_csv('PATH/monthly_all_drugs_costs.csv')
del all_drugs

neuro = neuro.groupby(['LopNr','year','month']).sum()
neuro = neuro.astype(int,copy=False)
neuro.to_csv('PATH/monthly_neuro_costs.csv')
del neuro

cardio = cardio.groupby(['LopNr','year','month']).sum()
cardio = cardio.astype(int,copy=False)
cardio.to_csv('PATH/monthly_cardio_costs.csv')
del cardio

cancer = cancer.groupby(['LopNr','year','month']).sum()
cancer = cancer.astype(int,copy=False)
cancer.to_csv('PATH/monthly_cancer_costs.csv')
del cancer

addiction = addiction.groupby(['LopNr','year','month']).sum()
addiction = addiction.astype(int,copy=False)
addiction.to_csv('PATH/monthly_addiction_costs.csv')
del addiction

Adrugs = Adrugs.groupby(['LopNr','year','month']).sum()
Adrugs = Adrugs.astype(int,copy=False)
Adrugs.to_csv('PATH/monthly_Adrugs_costs.csv')
del Adrugs

Mdrugs = Mdrugs.groupby(['LopNr','year','month']).sum()
Mdrugs = Mdrugs.astype(int,copy=False)
Mdrugs.to_csv('PATH/monthly_Mdrugs_costs.csv')
del Mdrugs

Vdrugs = Vdrugs.groupby(['LopNr','year','month']).sum()
Vdrugs = Vdrugs.astype(int,copy=False)
Vdrugs.to_csv('PATH/monthly_Vdrugs_costs.csv')
del Vdrugs

解决方案

Your code is quite repetitive and could be simplified with dictionary and list comprehensions. This solution should eliminate your memory issues, as you only process one month's data at a time (although you have a growing list of monthly summaries which I don't believe will use much memory).

I can't test this, but I believe it will do everything in your code above.

import pandas as pd
import iopro

items = {'neuro': 'N', 
         'cardio': 'C', 
         'cancer': 'L', 
         'addiction': 'N07', 
         'Adrugs': 'A', 
         'Mdrugs': 'M', 
         'Vdrugs': 'V', 
         'all_drugs': ''}

# 1. Create data container using dictionary comprehension.
monthly_summaries = {item: list() for item in items.keys()}

# 2. Perform monthly groupby operations.
for year in xrange(2005, 2013):
    for month in xrange(1, 13):
        if year == 2005 and month < 7:
            continue
        filename = 'PATH/lmed_' + str(year) + '_mon'+ str(month) +'.txt'
        adapter = iopro.text_adapter(filename,
                                     parser='csv', 
                                     field_names=True, 
                                     output='data frame', 
                                     delimiter='\t')
        monthly = adapter[['LopNr','ATC','TKOST']][:]
        monthly['year'] = year
        monthly['month'] = month
        dfs = {name: monthly[(monthly.ATC.str.startswith('{0}'.format(code))) 
                             & (~(monthly.TKOST.isnull()))]
                     for name, code in items.iteritems()}
        [monthly_summaries[name].append(dfs[name].groupby(['LopNr','year','month']).sum()
                                        .astype(int, copy=False)) 
         for name in items.keys()]

# 3. Now concatenate all of the monthly summaries into separate DataFrames.
dfs = {name: pd.concat([monthly_summaries[name], ignore_axis=True]) 
       for name in items.keys()}

# 4. Now regroup the aggregate monthly summaries.
monthly_summaries = {name: dfs[name].reset_index().groupby(['LopNr','year','month']).sum()
                    for name in items.keys()}

# 5. Finally, save the aggregated results to files.
[monthly_summaries[name].to_csv('PATH/monthly_{0}_costs.csv'.format(name))
 for name in items()]

这篇关于内存高效的Python( pandas )每个时期的一个csv文件的类别聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆