内存高效的Python（ pandas ）每个时期的一个csv文件的类别聚合 [英] memory efficient Python (pandas) aggregates of categories from one csv file per period

查看：160 发布时间：2017/2/26 16:54:30 python csv pandas segmentation-fault

本文介绍了内存高效的Python（ pandas ）每个时期的一个csv文件的类别聚合的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图避免与pandas或IOPro（仍在调查）的分段错误，所以我在寻找替代解决方案，esp。更高效的。下面的代码运行良好的小数据，但崩溃读取90个月的面板几个GB的Linux服务器上的256 GB RAM，版本pandas 0.16.2 np19py26_0，iopro 1.7.1 np19py27_p0和python 2.7.10 0。

我在这里做的是，我汇总每个人（LopNr）和月份的药品购买记录（TKOST成本）的帐户，同时还使用他们的ATC将药物分成不同类别代码。

所以当原始数据看起来像这样，在每月的csv文件（说2006年7月这里，与csv中的许多其他列不需要）：

  LopNr TKOST ATC 
 1 5 N01 
 1 11 N01 
 1 6 N15

等。

  LopNr TKOST年月
 1 22 2006 7 
  
 
 
 ，对于几个类别（例如，从这里开始的ATC的神经），或者在单个数据文件中对这些类别的单独摘要神经列等）。
 
 
 我选择了IOPro，而不是简单的熊猫更高效的内存，但现在我得到一个分割错误。
 ＃ -  *  -  coding：utf-8  -  *  -  
 import iopro 
 from pandas import * 
 
 neuro = DataFrame（）
 cardio = DataFrame（）
 cancer = DataFrame（）
 addiction = DataFrame（）
 Adrugs = DataFrame（）
 Mdrugs = DataFrame（）
 Vdrugs = DataFrame（）
 all_drugs = DataFrame（）
 
对于xrange中的year（2005,2013）：
 13）：
 if year == 2005 and month< 7：
 continue 
 filename ='PATH / lmed_'+ str（year）+'_mon'+ str（month）+'。txt'
 adapter = iopro.text_adapter ='csv'，field_names = True，output ='dataframe'，delimiter ='\t'）
 monthly = adapter [['LopNr'，'ATC'，'TKOST']] [：] 
 monthly ['year'] = year 
 monthly ['month'] = month 
 neuro = neuro.append（monthly [（monthly.ATC.str.startswith（'N'））& （〜（monthly.TKOST.isnull（）））]）
 cardio = cardio.append（monthly [（monthly.ATC.str.startswith（'C'））&（〜（monthly.TKOST.isnull （））]）
 cancer = cancer.append（monthly [（monthly.ATC.str.startswith（'L'））&（〜（monthly.TKOST.isnull（））]）
 addiction = addiction.append（monthly [（monthly.ATC.str.startswith（'N07'））&（〜（monthly.TKOST.isnull（））]））
 Adrugs = Adrugs.append每月[（monthly.ATC.str.startswith（'A'））&（〜（monthly.TKOST.isnull（））]）
 Mdrugs = Mdrugs.append（monthly [（monthly.ATC.str .startswith（'M'））& （〜（monthly.TKOST.isnull（）））]）
 Vdrugs = Vdrugs.append（monthly [（monthly.ATC.str.startswith（'V'））&（〜（monthly.TKOST.isnull （））]）
 all_drugs = all_drugs.append（monthly [（〜（monthly.TKOST.isnull（））]）
 del每月
 
 all_drugs = all_drugs。 groupby（['LopNr'，'year'，'month']）。sum（）
 all_drugs = all_drugs.astype（int，copy = False）
 all_drugs.to_csv（'PATH / monthly_all_drugs_costs.csv '）
 del all_drugs 
 
 neuro = neuro.groupby（['LopNr'，'year'，'month']）sum（）
 neuro = neuro.astype int，copy = False）
 neuro.to_csv（'PATH / monthly_neuro_costs.csv'）
 del neuro 
 
 cardio = cardio.groupby（['LopNr'，'year' ，'month']）。sum（）
 cardio = cardio.astype（int，copy = False）
 cardio.to_csv（'PATH / monthly_cardio_costs.csv'）
 del cardio 
 
 cancer = cancer.groupby（['LopNr'，'year'，'month']）sum（）
 cancer = cancer.astype（int，copy = False）
 cancer.to_csv（'PATH / monthly_cancer_costs.csv'）
 del cancer 
 
 addiction = addiction.groupby（['LopNr'，'year'，'month']）sum 
 addiction = addiction.astype（int，copy = False）
 addiction.to_csv（'PATH / monthly_addiction_costs.csv'）
 del addiction 
 
 Adrugs = groupby（['LopNr'，'year'，'month']）sum（）
 Adrugs = Adrugs.astype（int，copy = False）
 Adrugs.to_csv（'PATH / monthly_Adrugs_costs.csv '）
 del adrugs 
 
 Mdrugs = Mdrugs.groupby（['LopNr'，'year'，'month']）sum（）
 Mdrugs = Mdrugs.astype int，copy = False）
 Mdrugs.to_csv（'PATH / monthly_Mdrugs_costs.csv'）
 del Mdrugs 
 
 Vdrugs = Vdrugs.groupby（['LopNr'，'year' ，'month']）。sum（）
 Vdrugs = Vdrugs.astype（int，copy = False）
 Vdrugs.to_csv（'PATH / monthly_Vdrugs_costs.csv'）
 del Vdrugs 
  
 
 
解决方案
您的代码是重复的，可以使用字典和列表理解。这个解决方案应该消除你的内存问题，因为你一次只处理一个月的数据（虽然你有一个不断增加的每月摘要列表，我不认为会使用很多内存）。
 
 
 我不能测试这个，但我相信它会做你上面的代码中的一切。
  import pandas as pd 
 import iopro 
 
 items = {'neuro'：'N'，
'cardio'：'C'，
'cancer'：' ，
'addiction'：'N07'，
'Adrugs'：'A'，
'Mdrugs'：'M'，
'Vdrugs'：'V' b $ b'all_drugs'：''} 
 
＃1.使用字典推导创建数据容器。 
 monthly_summaries = items.keys（）中的项目的{item：list（）} 
 
＃2.执行每月groupby操作。 
 for year in xrange（2005，2013）：
 for month in xrange（1，13）：
 if year == 2005 and month& 7：
 continue 
 filename ='PATH / lmed_'+ str（year）+'_mon'+ str（month）+'。txt'
 adapter = iopro.text_adapter b $ b parser ='csv'，
 field_names = True，
 output ='data frame'，
 delimiter ='\t'）
 monthly = adapter [[' [％] 
每月['month'] =月
 dfs = {name：monthly [ （monthly.ATC.str.startswith（'{0}'。format（code）））
& （〜（monthly.TKOST.isnull（））]] 
为name，代码在items.iteritems（）} 
 [monthly_summaries [name] .append（dfs [name] .groupby（['LopNr '，'year'，'month']）sum（）
 .astype（int，copy = False））
用于items.keys（）中的名称] 
 
 ＃3.现在将所有每月摘要链接到单独的DataFrames中。 
 dfs = {name：pd.concat（[monthly_summaries [name]，ignore_axis = True]）
 for items.keys（）} 
 
＃4.现在重组每月总摘要。 
 monthly_summaries = {name：dfs [name] .reset_index（）。groupby（['LopNr'，'year'，'month']）sum（）
 for items in items.keys } 
 
＃5.最后，将聚合结果保存到文件。 
 [month_summaries [name] .to_csv（'PATH / monthly_ {0} _costs.csv'.format（name））
项目中的名称（）] 
  
 
I am trying to avoid a segmentation fault with either pandas or IOPro (still investigating), so I am looking for alternative solutions, esp. more efficient ones. The code below runs fine with small data but crashed reading in 90 monthly panels of a few GBs on a Linux server with 256 GB RAM, versions pandas 0.16.2 np19py26_0, iopro 1.7.1 np19py27_p0, and python 2.7.10 0.

What I do here is that I aggregate accounts of drug purchase records (cost in TKOST) for each person (LopNr) and month, while also separating the drugs into categories using their ATC codes.

So while the original data would look like this, in monthly csv files (say July 2006 here, with many other columns in the csv I don't need):
LopNr TKOST ATC
1         5 N01
1        11 N01
1         6 N15
etc.

I wanted aggregate panels, with rows like 
LopNr TKOST year month
1        22 2006     7
either separately for a few categories (e.g. neuro for ATCs starting with N here), or with separate summaries for these categories in a single datafile (so with a neuro column etc.).

I opted for IOPro and not simple pandas to be more efficient with memory, but now I am getting a segmentation error.
# -*- coding: utf-8 -*-
import iopro
from pandas import *

neuro   = DataFrame()
cardio  = DataFrame()
cancer  = DataFrame()
addiction  = DataFrame()
Adrugs  = DataFrame()
Mdrugs  = DataFrame()
Vdrugs  = DataFrame()
all_drugs  = DataFrame()

for year in xrange(2005,2013):
    for month in xrange(1,13):
        if year == 2005 and month < 7:
            continue
        filename = 'PATH/lmed_' + str(year) + '_mon'+ str(month) +'.txt'
        adapter = iopro.text_adapter(filename,parser='csv',field_names=True,output='dataframe',delimiter='\t')
        monthly = adapter[['LopNr','ATC','TKOST']][:]
        monthly['year']=year
        monthly['month']=month
        neuro = neuro.append(monthly[(monthly.ATC.str.startswith('N')) & (~(monthly.TKOST.isnull()))])
        cardio = cardio.append(monthly[(monthly.ATC.str.startswith('C')) & (~(monthly.TKOST.isnull()))])
        cancer = cancer.append(monthly[(monthly.ATC.str.startswith('L')) & (~(monthly.TKOST.isnull()))])
        addiction = addiction.append(monthly[(monthly.ATC.str.startswith('N07')) & (~(monthly.TKOST.isnull()))])
        Adrugs = Adrugs.append(monthly[(monthly.ATC.str.startswith('A')) & (~(monthly.TKOST.isnull()))])
        Mdrugs = Mdrugs.append(monthly[(monthly.ATC.str.startswith('M')) & (~(monthly.TKOST.isnull()))])
        Vdrugs = Vdrugs.append(monthly[(monthly.ATC.str.startswith('V')) & (~(monthly.TKOST.isnull()))])
        all_drugs = all_drugs.append(monthly[(~(monthly.TKOST.isnull()))])
        del monthly

all_drugs = all_drugs.groupby(['LopNr','year','month']).sum()
all_drugs = all_drugs.astype(int,copy=False)
all_drugs.to_csv('PATH/monthly_all_drugs_costs.csv')
del all_drugs

neuro = neuro.groupby(['LopNr','year','month']).sum()
neuro = neuro.astype(int,copy=False)
neuro.to_csv('PATH/monthly_neuro_costs.csv')
del neuro

cardio = cardio.groupby(['LopNr','year','month']).sum()
cardio = cardio.astype(int,copy=False)
cardio.to_csv('PATH/monthly_cardio_costs.csv')
del cardio

cancer = cancer.groupby(['LopNr','year','month']).sum()
cancer = cancer.astype(int,copy=False)
cancer.to_csv('PATH/monthly_cancer_costs.csv')
del cancer

addiction = addiction.groupby(['LopNr','year','month']).sum()
addiction = addiction.astype(int,copy=False)
addiction.to_csv('PATH/monthly_addiction_costs.csv')
del addiction

Adrugs = Adrugs.groupby(['LopNr','year','month']).sum()
Adrugs = Adrugs.astype(int,copy=False)
Adrugs.to_csv('PATH/monthly_Adrugs_costs.csv')
del Adrugs

Mdrugs = Mdrugs.groupby(['LopNr','year','month']).sum()
Mdrugs = Mdrugs.astype(int,copy=False)
Mdrugs.to_csv('PATH/monthly_Mdrugs_costs.csv')
del Mdrugs

Vdrugs = Vdrugs.groupby(['LopNr','year','month']).sum()
Vdrugs = Vdrugs.astype(int,copy=False)
Vdrugs.to_csv('PATH/monthly_Vdrugs_costs.csv')
del Vdrugs

 解决方案 
Your code is quite repetitive and could be simplified with dictionary and list comprehensions.  This solution should eliminate your memory issues, as you only process one month's data at a time (although you have a growing list of monthly summaries which I don't believe will use much memory).

I can't test this, but I believe it will do everything in your code above.
import pandas as pd
import iopro

items = {'neuro': 'N', 
         'cardio': 'C', 
         'cancer': 'L', 
         'addiction': 'N07', 
         'Adrugs': 'A', 
         'Mdrugs': 'M', 
         'Vdrugs': 'V', 
         'all_drugs': ''}

# 1. Create data container using dictionary comprehension.
monthly_summaries = {item: list() for item in items.keys()}

# 2. Perform monthly groupby operations.
for year in xrange(2005, 2013):
    for month in xrange(1, 13):
        if year == 2005 and month < 7:
            continue
        filename = 'PATH/lmed_' + str(year) + '_mon'+ str(month) +'.txt'
        adapter = iopro.text_adapter(filename,
                                     parser='csv', 
                                     field_names=True, 
                                     output='data frame', 
                                     delimiter='\t')
        monthly = adapter[['LopNr','ATC','TKOST']][:]
        monthly['year'] = year
        monthly['month'] = month
        dfs = {name: monthly[(monthly.ATC.str.startswith('{0}'.format(code))) 
                             & (~(monthly.TKOST.isnull()))]
                     for name, code in items.iteritems()}
        [monthly_summaries[name].append(dfs[name].groupby(['LopNr','year','month']).sum()
                                        .astype(int, copy=False)) 
         for name in items.keys()]

# 3. Now concatenate all of the monthly summaries into separate DataFrames.
dfs = {name: pd.concat([monthly_summaries[name], ignore_axis=True]) 
       for name in items.keys()}

# 4. Now regroup the aggregate monthly summaries.
monthly_summaries = {name: dfs[name].reset_index().groupby(['LopNr','year','month']).sum()
                    for name in items.keys()}

# 5. Finally, save the aggregated results to files.
[monthly_summaries[name].to_csv('PATH/monthly_{0}_costs.csv'.format(name))
 for name in items()]


                        
这篇关于内存高效的Python（ pandas ）每个时期的一个csv文件的类别聚合的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

内存高效的Python（ pandas ）每个时期的一个csv文件的类别聚合 [英] memory efficient Python (pandas) aggregates of categories from one csv file per period

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

内存高效的Python（ pandas ）每个时期的一个csv文件的类别聚合 [英] memory efficient Python (pandas) aggregates of categories from one csv file per period

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭