了解此Pandas脚本 [英] Understanding this Pandas script

查看:129
本文介绍了了解此Pandas脚本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我收到此代码,将数据分组为直方图类型数据。我一直试图理解这个熊猫脚本中的代码,以编辑,操纵和复制它。我对我理解的部分有评论。



代码



  import numpy as np 
import pandas as pd


column_names = ['col1','col2','col3','col4','col5','col6',
'col7' col8','col9','col10','col11'] #names用作列标签。如果没有指定名称,那么列可以通过数字例如。 df [0],df [1]等。

df = pd.read_csv('data.csv',header = None,names = column_names)#header = None表示没有列标题csv文件

df.ix [df.col11 =='x','col11'] = - 0.08 #trick,以便'x'行将被分组到类别> -0.1和< = - 0.05。这将允许将所有col11视为数字

bins = np.arange(-0.1,1.0,0.05)#bin以将col11值放入。> -0.1和< = - 0.05将是我们的特殊x行,> -0.05和< = 0将捕获所有的'0'值。
labels = np.array(['%s:%s'%(x,y)for x,y in zip(bins [: - 1],bins [1:])])#create标签
labels [0] ='x'#将第一个bin标签更改为'x'
labels [1] ='0'#将第二个bin标签更改为'

df ['col11'] = df ['col11']。astype(float)#convert col11为数字,所以我们可以做他们的数学


df ['bin'] = pd.cut(df ['col11'],bins = bins,labels = False)#设置另一列'bins',并放入一个整数,表示数字落入哪个bin。稍后我们将整数映射到bin标签


df.set_index('bin',inplace = True,drop = False,append = False)#groupby意味着使用索引运行得更快

def count_ones(x):
聚合函数来计算等于1的值
return np.sum(x == 1)

dfg = df ['bin','col7','col11']] groupby('bin')。agg({'col11':[np.mean],'col7':[count_ones,len]})
dfg.index = labels [dfg.index]

dfg.ix ['x',('col11','mean')] ='N / A'
print )
dfg.to_csv('new.csv')

理解是在这一部分:

  def count_ones(x):
1
return np.sum(x == 1)

dfg = df [['bin','col7','col11']] groupby ).agg({'col11':[np.mean],'col7':[count_ones,len]})
dfg.index = labels [dfg.index]

dfg。 ix ['x',('col11','mean')] ='N / A'
print(dfg)
dfg.to_csv('new.csv')

如果任何人能够评论这个脚本,我会非常感激。也请随时更正或添加到我的意见(这些是我假设,到目前为止他们可能不正确)。我希望这不是太主题的SOF。我将很乐意给任何可以帮助我这个的用户50点赏金。

解决方案

我会尝试解释我的代码。因为它使用了一些技巧。




  • 我已经调用 df 一个pandas DataFrame的简写名称

  • 我把它命名为 dfg ,意思是组 df

  • 让我建立表达式 dfg = df [['bin','col7','col11']] .groupby('bin')。agg({'col11':[np.mean],'col7':[count_ones,len]})


    $ b b

    • 代码 dfg = df [['bin','col7','col11']] 现在我有3列我感兴趣的是我的DataFrame df 。'bin''col7'和'col11' ,我想根据'bin'列中的值进行分组。这是通过 dfg = df [['bin','col7','col11']]。groupby('bin')。我现在有一组数据,即在bin#1中的所有记录,在bin#2中的所有记录等。

    • 我现在要对每个记录应用一些聚合函数我的bin组(一个聚合funcitn是像sum,平均值或计数)。

    • 现在,我想对我的每个bin中的记录应用三个聚合函数:col11的平均值,每个bin中的记录数以及每个记录中的记录数bin有'col7'等于一。平均值很容易; numpy已经有一个函数来计算平均值。如果我只是做'col11'的意思,我会写: dfg = df [['bin','col7','col11']] groupby 'col11':[np.mean]})。记录的数量也很容易; python的 len 函数(它不是一个真正的函数,但属性的列表等)将给我们列表中的项目数。所以我现在有 dfg = df [['bin','col7','col11']] groupby('bin')。agg({'col11':[np.mean] col7':[len]})。现在我不能想到一个现有的函数计数numpy数组中的数量(它必须处理一个numpy数组)。我可以定义我自己的函数在numpy数组,因此我的函数 count_ones

    • 现在,我将解构 count_ones 函数。传递给函数的varibale x 始终是一个1d numpy数组。在我们的具体情况下,将在bin#1中的所有col7值,在bin#2中的所有col7值等。代码 x == 1 将创建一个与x大小相同的布尔(TRUE / FALSE)数组。如果x中的对应值等于1,则布尔数组中的条目将为True,否则为false。因为python将True作为1,如果我总和我的布尔数组的值,我会得到一个值== 1的计数。现在我有我的 count_ones 函数通过: dfg = df [['bin','col7','col11 ']]。groupby('bin')。agg({'col11':[np.mean],'col7':[count_ones,len]})


    • 您可以看到 .agg 的语法是 .agg({'column_name_to_apply_to':[list_of_function使用布尔数组,你可以做各种各样的条件组合(x == 6)|(x(x)= 0)。

      == 3)将是'x等于6或x等于3'。'和'运算符是&总是在每个条件周围放置() >


  • 现在 dfg.index = labels [dfg.index] dfg 中,因为我以bin分组,分组数据(即我的dfg.index)的每一行的索引(或行标签)将是我的bin [0]将给我的第一个标签,标签[dfg.index] 正在使用numpy数组的花式索引。 3]会给我第四个标签使用正常的Python列表,你可以使用切片做标签[0:3],这将给我标签0,1和2.使用numpy数组,我们可以进一步,只是索引与值列表或另一个数组,所以label [np.array([0,2,4])将给我标签0,2,4。通过使用标签[dfg.index] 我请求与bin#对应的标签。基本上我将我的bin数字改为bin标签。我可以做到我的原始数据,但这将是数千行;通过做它之后的组,我做它21行左右。注意,我不能只是做 dfg.index = labels ,因为我的一些bin可能是空的,因此不按组数据显示。


  • 现在 dfg.ix ['x',('col11','mean')] ='N / A' 。记得当我做 df.ix [df.col11 =='x','col11'] = - 0.08 时,我的所有无效数据被视为一个数字,并将放置到第一bin。在应用group by和聚合函数后,我的第一个bin中的col11值的平均值将是-0.08(因为所有这些值都是-0.08)。现在我知道这不正确,所有的值-0.08实际上表示原值wsa x。你不能做一个x的意思。所以我手动把它放到N / A。即。 dfg.ix ['x',('col11','mean')] ='N / A'表示在dfg其中索引(或行)并且列是'col11 mean')将值设置为'N / A'。 ('col11','mean')我相信是pandas提出了aggreagate列名,即当我做了 .agg需要('column_name','aggregate_function_name')

    / p>



所有这一切的动机是:将所有数据转换为数字,这样我可以使用Pandas的力量, ,手动更改任何我知道的值是垃圾。如果您需要更多解释,请告诉我。


I received this code to group data into a histogram type data. I have been Attempting to understand the code in this pandas script in order to edit, manipulate and duplicate it. I have comments for the sections I understand.

Code

import numpy as np
import pandas as pd


column_names = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 
              'col7', 'col8', 'col9', 'col10', 'col11'] #names to be used as column labels.  If no names are specified then columns can be refereed to by number eg. df[0], df[1] etc.

df = pd.read_csv('data.csv', header=None, names=column_names) #header= None means there are no column headings in the  csv file

df.ix[df.col11 == 'x', 'col11']=-0.08 #trick so that 'x' rows will be grouped into a category >-0.1 and <= -0.05.  This will allow all of col11 to be treated as a numbers

bins = np.arange(-0.1, 1.0, 0.05) #bins to put col11 values in.  >-0.1 and <=-0.05 will be our special 'x' rows, >-0.05 and <=0 will capture all the '0' values.
labels = np.array(['%s:%s' % (x, y) for x, y in zip(bins[:-1], bins[1:])]) #create labels for the bins
labels[0] = 'x' #change first bin label to 'x'
labels[1] = '0' #change second bin label to '0'

df['col11'] = df['col11'].astype(float) #convert col11 to numbers so we can do math on them


df['bin'] = pd.cut(df['col11'], bins=bins, labels=False) # make another column 'bins' and put in an integer representing what bin the number falls into.Later we'll map the integer to the bin label


df.set_index('bin', inplace=True, drop=False, append=False) #groupby is meant to run faster with an index

def count_ones(x):
    """aggregate function to count values that equal 1"""
    return np.sum(x==1)

dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [count_ones, len]})
dfg.index = labels[dfg.index]

dfg.ix['x',('col11', 'mean')]='N/A'
print(dfg)
dfg.to_csv('new.csv')

The section I really struggle to understand is in this section:

def count_ones(x):
    """aggregate function to count values that equal 1"""
    return np.sum(x==1)

dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [count_ones, len]})
dfg.index = labels[dfg.index]

dfg.ix['x',('col11', 'mean')]='N/A'
print(dfg)
dfg.to_csv('new.csv')

If any one is able to comment this script I would be greatly appreciative. Also feel free to correct or add to my comments (these are what I assume so far they may not be correct). Im hoping this isnt too off topic for SOF. I will gladly give a 50 point bounty to any user who can help me with this.

解决方案

I'll try and explain my code. As it uses a few tricks.

  • I've called it df to give a shorthand name for a pandas DataFrame
  • I've called it dfg to mean group my df.
  • Let me build up the expression dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [count_ones, len]})

    • the code dfg = df[['bin','col7','col11']] is saying take the columns named 'bin' 'col7' and 'col11' from my DataFrame df.
    • Now that I have the 3 columns I am interested in, I want to group by the values in the 'bin' column. This is done by dfg = df[['bin','col7','col11']].groupby('bin'). I now have groups of data i.e. all records that are in bin #1, all records in bin#2, etc.
    • I now want to apply some aggregate functions to the records in each of my bin groups( An aggregate funcitn is something like sum, mean or count).
    • Now I want to apply three aggregate functions to the records in each of my bins: the mean of 'col11', the number of records in each bin, and the number of records in each bin that have 'col7' equal to one. The mean is easy; numpy already has a function to calculate the mean. If I was just doing the mean of 'col11' I would write: dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean]}). The number of records is also easy; python's len function (It's not really a function but a property of lists etc.) will give us the number of items in list. So I now have dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [len]}). Now I can't think of an existing function that counts the number of ones in a numpy array (it has to work on a numpy array). I can define my own functions that work on a numpy array, hence my function count_ones.
    • Now I'll deconstruct the count_ones function. the varibale x passed to the function is always going to be a 1d numpy array. In our specific case it will be all the 'col7' values that fall in bin#1, all the 'col7' values that fall in bin#2 etc.. The code x==1 will create a boolean (TRUE/FALSE) array the same size as x. The entries in the boolean array will be True if the corresponding values in x are equal to 1 and false otherwise. Because python treats True as 1 if I sum the values of my boolean array I'll get a count of the values that ==1. Now that I have my count_ones function I apply it to 'col7' by: dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [count_ones, len]})

    • You can see that the syntax of the .agg is .agg({'column_name_to_apply_to': [list_of_function names_to_apply]}

    • With the boolean arrays you can do all sorts of wierd condition combinations (x==6) | (x==3) would be 'x equal to 6 or x equal to 3'. The 'and' operator is &. Always put () around each condition

  • Now to dfg.index = labels[dfg.index]. In dfg, because I grouped by 'bin', the index (or row label) of each row of grouped data (i.e. my dfg.index) will be my bin numbers:1,2,3, labels[dfg.index] is using fancy indexing of a numpy array. labels[0] would give me the first label, labels[3] would give me the 4th label. With normal python lists you can use slices to do labels[0:3] which would give me labels 0,1, and 2. With numpy arrays we can go a step further and just index with a list of values or another array so labels[np.array([0,2,4]) would give me labels 0,2,4. By using labels[dfg.index] I'm requesting the labels corresponding to the bin#. Basically I'm changng my bin number to bin label. I could have done that to my original data but that would be thousands of rows; by doing it after the group by I'm doing it to 21 rows or so. Note that I cannot just do dfg.index = labels as some of my bins might be empty and therefore not present in the group by data.

  • Now the dfg.ix['x',('col11', 'mean')]='N/A' part. Remember way back when I did df.ix[df.col11 == 'x', 'col11']=-0.08 that was so all my invalid data was treated as a number and would be placed into the 1st bin. after applying group by and aggregate functions the mean of 'col11' values in my first bin will be -0.08 (because all such values are -0.08). Now I know this not correct, all values of -0.08 actually indicate that the original value wsa x. You can't do a mean of x. So I manually put it to N/A. ie. dfg.ix['x',('col11', 'mean')]='N/A' means in dfg where index (or row) is 'x' and column is 'col11 mean') set the value to 'N/A'. the ('col11', 'mean') I believe is how pandas comes up with the aggreagate column names i.e. when I did .agg({'col11': [np.mean]}), to refer to the resulting aggregate column i need ('column_name', 'aggregate_function_name')

The motivation for all this was: convert all data to numbers so I can use the power of Pandas, then after processing, manually change any values that I know are garbage. Let me know if you need any more explanation.

这篇关于了解此Pandas脚本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆