改进Python中的映射lambdas(pandas) [英] improve upon mapped lambdas in Python (pandas)

查看:168
本文介绍了改进Python中的映射lambdas(pandas)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我消化了几个csv文件(每个都有一个或多个年份的数据),将医疗治疗分为大类,同时只保留原始信息的一个子集,甚至聚合到每月数字(AR =年和月)的治疗每人(LopNr)。许多治疗同时属于不同类别(多个诊断代码在csv中的相关列中列出,因此我将该字段分成列的列并且通过属于ICD-9的相关范围的任何诊断代码对行进行分类代码)。

I am digesting several csv files (each with one or more year of data) to categorize medical treatments into broad categories, while also keeping only a subset of the original information, and even aggregate up to a monthly number (by AR=year and month) of treatments per person (LopNr). Many treatments belong to different categories at the same time (multiple diagnosis codes are listed in the relevant column in the csv, thus I separate that field into a column of lists and categorize rows by any diagnosis codes belonging to a relevant range of ICD-9 codes).

我使用IOPro来保存内存,但我仍然在进行segfault(仍在调查)。文本文件各有几个GB,但本机具有256 GB RAM。其中一个包是有bug的,或者我需要更多的内存高效的解决方案。

I am using IOPro to save on memory, but I am still running into a segfault (still investigating). The text files are several GBs each, but this machine has 256 GB RAM. Either one of the packages is buggy, or I need a more memory efficient solution.

我使用版本pandas 0.16.2 np19py26_0,iopro 1.7.1 np19py27_p0和python 2.7.10 0在Linux下。

I am using versions pandas 0.16.2 np19py26_0, iopro 1.7.1 np19py27_p0, and python 2.7.10 0 under Linux.

所以原始数据看起来像这样:

So the original data would look something like this:

LopNr   AR INDATUMA DIAGNOS …
1     2007 20070812 C32 F17
1     2007 20070816     C36

我希望看到这样的聚合:

And I hope to see aggregates like this:

LopNr   AR month tobacco …
1     2007     8       2

顺便说一句,我最终需要Stata dta文件, cvs,因为pandas.DataFrame.to_stata在我的经验中似乎片状,但也许我也缺少了一些东西。

By the way, I would need Stata dta files in the end, but I go through cvs because pandas.DataFrame.to_stata seemed flaky in my experience, but maybe I am missing something there too.

# -*- coding: utf-8 -*-
import iopro
import numpy as np
from pandas import *

all_treatments  = DataFrame()
filelist = ['oppenvard20012005','oppenvard20062010','oppenvard2011','oppenvard2012','slutenvard1997','slutenvard2011','slutenvard2012','slutenvard19982004','slutenvard20052010']

tobacco = lambda lst: any( (((x >= 'C30') and (x<'C40')) or ((x >= 'F17') and (x<'F18')))  for x in lst)
nutrition = lambda lst: any( (((x >= 'D50') and (x<'D54')) or ((x >= 'E10') and (x<'E15')) or ((x >= 'E40') and (x<'E47')) or ((x >= 'E50') and (x<'E69')))  for x in lst)
mental = lambda lst: any( (((x >= 'F') and (x<'G')) )  for x in lst)
alcohol = lambda lst: any( (((x >= 'F10') and (x<'F11')) or ((x >= 'K70') and (x<'K71')))  for x in lst)
circulatory = lambda lst: any( (((x >= 'I') and (x<'J')) )  for x in lst)
dental = lambda lst: any( (((x >= 'K02') and (x<'K04')) )  for x in lst)
accident = lambda lst: any( (((x >= 'V01') and (x<'X60')) )  for x in lst)
selfharm = lambda lst: any( (((x >= 'X60') and (x<'X85')) )  for x in lst)
cancer = lambda lst: any( (((x >= 'C') and (x<'D')) )  for x in lst)
endonutrimetab = lambda lst: any( (((x >= 'E') and (x<'F')) )  for x in lst)
pregnancy = lambda lst: any( (((x >= 'O') and (x<'P')) )  for x in lst)
other_stress = lambda lst: any( (((x >= 'J00') and (x<'J48')) or ((x >= 'L20') and (x<'L66')) or ((x >= 'K20') and (x<'K60')) or ((x >= 'R') and (x<'S')) or ((x >= 'X86') and (x<'Z77')))  for x in lst)

for file in filelist:
    filename = 'PATH' + file +'.txt'
    adapter = iopro.text_adapter(filename,parser='csv',field_names=True,output='dataframe',delimiter='\t')
    treatments = adapter[['LopNr','AR','DIAGNOS','INDATUMA']][:]
    treatments['month'] = treatments['INDATUMA'] % 10000
    treatments['day'] = treatments['INDATUMA'] % 100
    treatments['month'] = (treatments['month']-treatments['day'])/100  
    del treatments['day']
    diagnoses = treatments['DIAGNOS'].str.split(' ')
    del treatments['DIAGNOS']
    treatments['tobacco'] = diagnoses.map(tobacco)
    treatments['nutrition'] = diagnoses.map(nutrition)
    treatments['mental'] = diagnoses.map(mental)
    treatments['alcohol'] = diagnoses.map(alcohol)
    treatments['circulatory'] = diagnoses.map(circulatory)
    treatments['dental'] = diagnoses.map(dental)
    treatments['accident'] = diagnoses.map(accident)
    treatments['selfharm'] = diagnoses.map(selfharm)
    treatments['cancer'] = diagnoses.map(cancer)
    treatments['endonutrimetab'] = diagnoses.map(endonutrimetab)
    treatments['pregnancy'] = diagnoses.map(pregnancy)
    treatments['other_stress'] = diagnoses.map(other_stress)
    all_treatments = all_treatments.append(treatments)
all_treatments = all_treatments.groupby(['LopNr','AR','month']).aggregate(np.count_nonzero) #.sum()
all_treatments = all_treatments.astype(int,copy=False,raise_on_error=False)
all_treatments.to_csv('PATH.csv')


推荐答案

b
$ b

A few comments:


  1. 如上所述,为了提高可读性,应该简化lambda表达式,可能使用 def

  1. As noted above, you should simplify your lambda expressions for readability, possibly using def.

例如:

def tobacco(codes):
    return any( 'C30' <= x < 'C40' or
                'F17' <= x < 'F18'  for x in codes)

您也可以将这些函数向量化,如下:

You can also vectorize these functions as follows:

def tobacco(codes_column):
    return [any('C30' <= code < 'C40' or
                'F17' <= code < 'F18'
                for code in codes) if codes else False
            for codes in codes_column]

diagnoses = all_treatments['DIAGNOS'].str.split(' ').tolist()
all_treatments['tobacco'] = tobacco(diagnoses)




  1. 您将 all_treatments 初始化到DataFrame,然后附加到它。这是非常低效的。尝试 all_treatments = list(),并在循环外添加 all_treatments = pd.concat(all_treatments,ignore_index = True)就在您的 groupby 之前。此外,应该 all_treatments.append(treatments)(vs. all_treatments = all_treatments.append(treatments)

  1. You initialize all_treatments to a DataFrame, and then append to it. This is very inefficient. Try all_treatments = list(), and add all_treatments = pd.concat(all_treatments, ignore_index=True) outside the loop just before your groupby. In addition, it should be all_treatments.append(treatments) (vs. all_treatments = all_treatments.append(treatments))

要计算用于分组目的的月份,您可以使用:

To calculate the month for the purpose of grouping, you can use:

all_treatments ['month'] = all_treatments.INDATUMA%10000 // 100

最后,函数对每个文件一旦读取,尝试将它们应用于 all_treatments DataFrame。

Lastly, instead of applying your lambda functions to each file once its read, try applying them to the all_treatments DataFrame instead.

ps您还可以尝试在 groupby 语句而不是 .aggregate上尝试 .sum() (np.count_nonzero)

p.s. You may also want to try .sum() on your groupby statement instead of .aggregate(np.count_nonzero)

这篇关于改进Python中的映射lambdas(pandas)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆