改进Python中的映射lambdas(pandas) [英] improve upon mapped lambdas in Python (pandas)
问题描述
我消化了几个csv文件(每个都有一个或多个年份的数据),将医疗治疗分为大类,同时只保留原始信息的一个子集,甚至聚合到每月数字(AR =年和月)的治疗每人(LopNr)。许多治疗同时属于不同类别(多个诊断代码在csv中的相关列中列出,因此我将该字段分成列的列并且通过属于ICD-9的相关范围的任何诊断代码对行进行分类代码)。
I am digesting several csv files (each with one or more year of data) to categorize medical treatments into broad categories, while also keeping only a subset of the original information, and even aggregate up to a monthly number (by AR=year and month) of treatments per person (LopNr). Many treatments belong to different categories at the same time (multiple diagnosis codes are listed in the relevant column in the csv, thus I separate that field into a column of lists and categorize rows by any diagnosis codes belonging to a relevant range of ICD-9 codes).
我使用IOPro来保存内存,但我仍然在进行segfault(仍在调查)。文本文件各有几个GB,但本机具有256 GB RAM。其中一个包是有bug的,或者我需要更多的内存高效的解决方案。
I am using IOPro to save on memory, but I am still running into a segfault (still investigating). The text files are several GBs each, but this machine has 256 GB RAM. Either one of the packages is buggy, or I need a more memory efficient solution.
我使用版本pandas 0.16.2 np19py26_0,iopro 1.7.1 np19py27_p0和python 2.7.10 0在Linux下。
I am using versions pandas 0.16.2 np19py26_0, iopro 1.7.1 np19py27_p0, and python 2.7.10 0 under Linux.
所以原始数据看起来像这样:
So the original data would look something like this:
LopNr AR INDATUMA DIAGNOS …
1 2007 20070812 C32 F17
1 2007 20070816 C36
我希望看到这样的聚合:
And I hope to see aggregates like this:
LopNr AR month tobacco …
1 2007 8 2
顺便说一句,我最终需要Stata dta文件, cvs,因为pandas.DataFrame.to_stata在我的经验中似乎片状,但也许我也缺少了一些东西。
By the way, I would need Stata dta files in the end, but I go through cvs because pandas.DataFrame.to_stata seemed flaky in my experience, but maybe I am missing something there too.
# -*- coding: utf-8 -*-
import iopro
import numpy as np
from pandas import *
all_treatments = DataFrame()
filelist = ['oppenvard20012005','oppenvard20062010','oppenvard2011','oppenvard2012','slutenvard1997','slutenvard2011','slutenvard2012','slutenvard19982004','slutenvard20052010']
tobacco = lambda lst: any( (((x >= 'C30') and (x<'C40')) or ((x >= 'F17') and (x<'F18'))) for x in lst)
nutrition = lambda lst: any( (((x >= 'D50') and (x<'D54')) or ((x >= 'E10') and (x<'E15')) or ((x >= 'E40') and (x<'E47')) or ((x >= 'E50') and (x<'E69'))) for x in lst)
mental = lambda lst: any( (((x >= 'F') and (x<'G')) ) for x in lst)
alcohol = lambda lst: any( (((x >= 'F10') and (x<'F11')) or ((x >= 'K70') and (x<'K71'))) for x in lst)
circulatory = lambda lst: any( (((x >= 'I') and (x<'J')) ) for x in lst)
dental = lambda lst: any( (((x >= 'K02') and (x<'K04')) ) for x in lst)
accident = lambda lst: any( (((x >= 'V01') and (x<'X60')) ) for x in lst)
selfharm = lambda lst: any( (((x >= 'X60') and (x<'X85')) ) for x in lst)
cancer = lambda lst: any( (((x >= 'C') and (x<'D')) ) for x in lst)
endonutrimetab = lambda lst: any( (((x >= 'E') and (x<'F')) ) for x in lst)
pregnancy = lambda lst: any( (((x >= 'O') and (x<'P')) ) for x in lst)
other_stress = lambda lst: any( (((x >= 'J00') and (x<'J48')) or ((x >= 'L20') and (x<'L66')) or ((x >= 'K20') and (x<'K60')) or ((x >= 'R') and (x<'S')) or ((x >= 'X86') and (x<'Z77'))) for x in lst)
for file in filelist:
filename = 'PATH' + file +'.txt'
adapter = iopro.text_adapter(filename,parser='csv',field_names=True,output='dataframe',delimiter='\t')
treatments = adapter[['LopNr','AR','DIAGNOS','INDATUMA']][:]
treatments['month'] = treatments['INDATUMA'] % 10000
treatments['day'] = treatments['INDATUMA'] % 100
treatments['month'] = (treatments['month']-treatments['day'])/100
del treatments['day']
diagnoses = treatments['DIAGNOS'].str.split(' ')
del treatments['DIAGNOS']
treatments['tobacco'] = diagnoses.map(tobacco)
treatments['nutrition'] = diagnoses.map(nutrition)
treatments['mental'] = diagnoses.map(mental)
treatments['alcohol'] = diagnoses.map(alcohol)
treatments['circulatory'] = diagnoses.map(circulatory)
treatments['dental'] = diagnoses.map(dental)
treatments['accident'] = diagnoses.map(accident)
treatments['selfharm'] = diagnoses.map(selfharm)
treatments['cancer'] = diagnoses.map(cancer)
treatments['endonutrimetab'] = diagnoses.map(endonutrimetab)
treatments['pregnancy'] = diagnoses.map(pregnancy)
treatments['other_stress'] = diagnoses.map(other_stress)
all_treatments = all_treatments.append(treatments)
all_treatments = all_treatments.groupby(['LopNr','AR','month']).aggregate(np.count_nonzero) #.sum()
all_treatments = all_treatments.astype(int,copy=False,raise_on_error=False)
all_treatments.to_csv('PATH.csv')
推荐答案
b
$ b
A few comments:
- 如上所述,为了提高可读性,应该简化lambda表达式,可能使用
def
。
- As noted above, you should simplify your lambda expressions for readability, possibly using
def
.
例如:
def tobacco(codes):
return any( 'C30' <= x < 'C40' or
'F17' <= x < 'F18' for x in codes)
您也可以将这些函数向量化,如下:
You can also vectorize these functions as follows:
def tobacco(codes_column):
return [any('C30' <= code < 'C40' or
'F17' <= code < 'F18'
for code in codes) if codes else False
for codes in codes_column]
diagnoses = all_treatments['DIAGNOS'].str.split(' ').tolist()
all_treatments['tobacco'] = tobacco(diagnoses)
-
您将
all_treatments
初始化到DataFrame,然后附加到它。这是非常低效的。尝试all_treatments = list()
,并在循环外添加all_treatments = pd.concat(all_treatments,ignore_index = True)
就在您的groupby
之前。此外,应该all_treatments.append(treatments)
(vs.all_treatments = all_treatments.append(treatments)
)
You initialize
all_treatments
to a DataFrame, and then append to it. This is very inefficient. Tryall_treatments = list()
, and addall_treatments = pd.concat(all_treatments, ignore_index=True)
outside the loop just before yourgroupby
. In addition, it should beall_treatments.append(treatments)
(vs.all_treatments = all_treatments.append(treatments)
)
要计算用于分组目的的月份,您可以使用:
To calculate the month for the purpose of grouping, you can use:
all_treatments ['month'] = all_treatments.INDATUMA%10000 // 100
最后,函数对每个文件一旦读取,尝试将它们应用于 all_treatments
DataFrame。
Lastly, instead of applying your lambda functions to each file once its read, try applying them to the all_treatments
DataFrame instead.
ps您还可以尝试在 groupby
语句而不是 .aggregate上尝试
.sum()
(np.count_nonzero)
p.s. You may also want to try .sum()
on your groupby
statement instead of .aggregate(np.count_nonzero)
这篇关于改进Python中的映射lambdas(pandas)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!