了解此Pandas脚本 [英] Understanding this Pandas script

查看：129 发布时间：2016/12/21 10:25:26 python python-2.7 numpy pandas comments

本文介绍了了解此Pandas脚本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我收到此代码，将数据分组为直方图类型数据。我一直试图理解这个熊猫脚本中的代码，以编辑，操纵和复制它。我对我理解的部分有评论。

代码

  import numpy as np 
 import pandas as pd 
 
 
 column_names = ['col1'，'col2'，'col3'，'col4'，'col5'，'col6'，
'col7' col8'，'col9'，'col10'，'col11'] #names用作列标签。如果没有指定名称，那么列可以通过数字例如。 df [0]，df [1]等。
 
 df = pd.read_csv（'data.csv'，header = None，names = column_names）＃header = None表示没有列标题csv文件
 
 df.ix [df.col11 =='x'，'col11'] =  -  0.08 #trick，以便'x'行将被分组到类别> -0.1和< =  -  0.05。这将允许将所有col11视为数字
 
 bins = np.arange（-0.1,1.0,0.05）#bin以将col11值放入。> -0.1和< =  - 0.05将是我们的特殊x行，> -0.05和< = 0将捕获所有的'0'值。 
 labels = np.array（['％s：％s'％（x，y）for x，y in zip（bins [： -  1]，bins [1：]）]）#create标签
 labels [0] ='x'＃将第一个bin标签更改为'x'
 labels [1] ='0'＃将第二个bin标签更改为'
 
 df ['col11'] = df ['col11']。astype（float）#convert col11为数字，所以我们可以做他们的数学
 
 
 df ['bin'] = pd.cut（df ['col11']，bins = bins，labels = False）＃设置另一列'bins'，并放入一个整数，表示数字落入哪个bin。稍后我们将整数映射到bin标签
 
 
 df.set_index（'bin'，inplace = True，drop = False，append = False）#groupby意味着使用索引运行得更快
 
 def count_ones（x）：
聚合函数来计算等于1的值
 return np.sum（x == 1）
 
 dfg = df ['bin'，'col7'，'col11']] groupby（'bin'）。agg（{'col11'：[np.mean]，'col7'：[count_ones，len]}）
 dfg.index = labels [dfg.index] 
 
 dfg.ix ['x'，（'col11'，'mean'）] ='N / A'
 print ）
 dfg.to_csv（'new.csv'）

理解是在这一部分：

  def count_ones（x）：
 1
 return np.sum（x == 1）
 
 dfg = df [['bin'，'col7'，'col11']] groupby ）.agg（{'col11'：[np.mean]，'col7'：[count_ones，len]}）
 dfg.index = labels [dfg.index] 
 
 dfg。 ix ['x'，（'col11'，'mean'）] ='N / A'
 print（dfg）
 dfg.to_csv（'new.csv'）

如果任何人能够评论这个脚本，我会非常感激。也请随时更正或添加到我的意见（这些是我假设，到目前为止他们可能不正确）。我希望这不是太主题的SOF。我将很乐意给任何可以帮助我这个的用户50点赏金。

解决方案

我会尝试解释我的代码。因为它使用了一些技巧。

 
 
  
 我已经调用 df 一个pandas DataFrame的简写名称
 
 我把它命名为 dfg ，意思是组 df 。
 
  让我建立表达式 dfg = df [['bin'，'col7'，'col11']] .groupby（'bin'）。agg（{'col11'：[np.mean]，'col7'：[count_ones，len]}） 
 
 $ b b  
 代码 dfg = df [['bin'，'col7'，'col11']] 现在我有3列我感兴趣的是我的DataFrame  df 。'bin''col7'和'col11' ，我想根据'bin'列中的值进行分组。这是通过 dfg = df [['bin'，'col7'，'col11']]。groupby（'bin'）。我现在有一组数据，即在bin＃1中的所有记录，在bin＃2中的所有记录等。
 
 我现在要对每个记录应用一些聚合函数我的bin组（一个聚合funcitn是像sum，平均值或计数）。 
 
 现在，我想对我的每个bin中的记录应用三个聚合函数：col11的平均值，每个bin中的记录数以及每个记录中的记录数bin有'col7'等于一。平均值很容易; numpy已经有一个函数来计算平均值。如果我只是做'col11'的意思，我会写： dfg = df [['bin'，'col7'，'col11']] groupby 'col11'：[np.mean]}）。记录的数量也很容易; python的 len 函数（它不是一个真正的函数，但属性的列表等）将给我们列表中的项目数。所以我现在有 dfg = df [['bin'，'col7'，'col11']] groupby（'bin'）。agg（{'col11'：[np.mean] col7'：[len]}）。现在我不能想到一个现有的函数计数numpy数组中的数量（它必须处理一个numpy数组）。我可以定义我自己的函数在numpy数组，因此我的函数 count_ones 。 
 
  现在，我将解构 count_ones 函数。传递给函数的varibale  x 始终是一个1d numpy数组。在我们的具体情况下，将在bin＃1中的所有col7值，在bin＃2中的所有col7值等。代码 x == 1 将创建一个与x大小相同的布尔（TRUE / FALSE）数组。如果x中的对应值等于1，则布尔数组中的条目将为True，否则为false。因为python将True作为1，如果我总和我的布尔数组的值，我会得到一个值== 1的计数。现在我有我的 count_ones 函数通过： dfg = df [['bin'，'col7'，'col11 ']]。groupby（'bin'）。agg（{'col11'：[np.mean]，'col7'：[count_ones，len]}） 
 
 
 您可以看到 .agg 的语法是 .agg（{'column_name_to_apply_to'：[list_of_function使用布尔数组，你可以做各种各样的条件组合（x == 6）|（x（x）= 0）。
 == 3）将是'x等于6或x等于3'。'和'运算符是&总是在每个条件周围放置（） > 
 
 
 
 
  现在 dfg.index = labels [dfg.index]  dfg 中，因为我以bin分组，分组数据（即我的dfg.index）的每一行的索引（或行标签）将是我的bin [0]将给我的第一个标签，标签[dfg.index] 正在使用numpy数组的花式索引。 3]会给我第四个标签使用正常的Python列表，你可以使用切片做标签[0：3]，这将给我标签0,1和2.使用numpy数组，我们可以进一步，只是索引与值列表或另一个数组，所以label [np.array（[0,2,4]）将给我标签0,2,4。通过使用标签[dfg.index] 我请求与bin＃对应的标签。基本上我将我的bin数字改为bin标签。我可以做到我的原始数据，但这将是数千行;通过做它之后的组，我做它21行左右。注意，我不能只是做 dfg.index = labels ，因为我的一些bin可能是空的，因此不按组数据显示。
 
 
  现在 dfg.ix ['x'，（'col11'，'mean'）] ='N / A' 。记得当我做 df.ix [df.col11 =='x'，'col11'] =  -  0.08 时，我的所有无效数据被视为一个数字，并将放置到第一bin。在应用group by和聚合函数后，我的第一个bin中的col11值的平均值将是-0.08（因为所有这些值都是-0.08）。现在我知道这不正确，所有的值-0.08实际上表示原值wsa x。你不能做一个x的意思。所以我手动把它放到N / A。即。  dfg.ix ['x'，（'col11'，'mean'）] ='N / A'表示在dfg其中索引（或行）并且列是'col11 mean'）将值设置为'N / A'。 （'col11'，'mean'）我相信是pandas提出了aggreagate列名，即当我做了 .agg需要（'column_name'，'aggregate_function_name'） 
 / p> 
 
 
 
 
 所有这一切的动机是：将所有数据转换为数字，这样我可以使用Pandas的力量， ，手动更改任何我知道的值是垃圾。如果您需要更多解释，请告诉我。
 
I received this code to group data into a histogram type data. I have been Attempting to understand the code in this pandas script in order to edit, manipulate and duplicate it. I have comments for the sections I understand. 

Code

import numpy as np
import pandas as pd


column_names = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 
              'col7', 'col8', 'col9', 'col10', 'col11'] #names to be used as column labels.  If no names are specified then columns can be refereed to by number eg. df[0], df[1] etc.

df = pd.read_csv('data.csv', header=None, names=column_names) #header= None means there are no column headings in the  csv file

df.ix[df.col11 == 'x', 'col11']=-0.08 #trick so that 'x' rows will be grouped into a category >-0.1 and <= -0.05.  This will allow all of col11 to be treated as a numbers

bins = np.arange(-0.1, 1.0, 0.05) #bins to put col11 values in.  >-0.1 and <=-0.05 will be our special 'x' rows, >-0.05 and <=0 will capture all the '0' values.
labels = np.array(['%s:%s' % (x, y) for x, y in zip(bins[:-1], bins[1:])]) #create labels for the bins
labels[0] = 'x' #change first bin label to 'x'
labels[1] = '0' #change second bin label to '0'

df['col11'] = df['col11'].astype(float) #convert col11 to numbers so we can do math on them


df['bin'] = pd.cut(df['col11'], bins=bins, labels=False) # make another column 'bins' and put in an integer representing what bin the number falls into.Later we'll map the integer to the bin label


df.set_index('bin', inplace=True, drop=False, append=False) #groupby is meant to run faster with an index

def count_ones(x):
    """aggregate function to count values that equal 1"""
    return np.sum(x==1)

dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [count_ones, len]})
dfg.index = labels[dfg.index]

dfg.ix['x',('col11', 'mean')]='N/A'
print(dfg)
dfg.to_csv('new.csv')
The section I really struggle to understand is in this section: 
def count_ones(x):
    """aggregate function to count values that equal 1"""
    return np.sum(x==1)

dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [count_ones, len]})
dfg.index = labels[dfg.index]

dfg.ix['x',('col11', 'mean')]='N/A'
print(dfg)
dfg.to_csv('new.csv')
If any one is able to comment this script I would be greatly appreciative. Also feel free to correct or add to my comments (these are what I assume so far they may not be correct). Im hoping this isnt too off topic for SOF. I will gladly give a 50 point bounty to any user who can help me with this. 
 解决方案 
I'll try and explain my code. As it uses a few tricks.


I've called it df to give a shorthand name for a pandas DataFrame
I've called it dfg to mean group my df.
Let me build up the expression dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [count_ones, len]})


the code dfg = df[['bin','col7','col11']] is saying take the columns named 'bin' 'col7' and 'col11' from my DataFrame df.
Now that I have the 3 columns I am interested in, I want to group by the values in the 'bin' column.  This is done by dfg = df[['bin','col7','col11']].groupby('bin').  I now have groups of data i.e. all records that are in bin #1, all records in bin#2, etc.
I now want to apply some aggregate functions to the records in each of my bin groups(  An aggregate funcitn is something like sum, mean or count).  
Now I want to apply three aggregate functions to the records in each of my bins: the mean of 'col11', the number of records in each bin, and the number of records in each bin that have 'col7' equal to one.   The mean is easy; numpy already has a function to calculate the mean. If I was just doing the mean of 'col11' I would write:  dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean]}).  The number of records is also easy; python's len function (It's not really a function but  a property of lists etc.) will give us the number of items in list.  So I now have dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [len]}).  Now I can't think of an existing function that counts the number of ones in a numpy array (it has to work on a numpy array).  I can define my own functions that work on a numpy array, hence my function count_ones.  
Now I'll deconstruct the count_ones function.  the varibale x passed to the function is always going to be a 1d numpy array.  In our specific case it will be all the 'col7' values that fall in bin#1, all the 'col7' values that fall in bin#2 etc.. The code x==1 will create a boolean (TRUE/FALSE) array the same size as x.  The entries in the boolean array will be True if the corresponding values in x are equal to 1 and false otherwise.  Because python treats True as 1 if I sum the values of my boolean array I'll get a count of the values that ==1.  Now that I have my count_ones function I apply it to 'col7' by: dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [count_ones, len]})
You can see that the syntax of the .agg is .agg({'column_name_to_apply_to': [list_of_function names_to_apply]}
With the boolean arrays you can do all sorts of wierd condition combinations (x==6) | (x==3) would be 'x equal to 6 or x equal to 3'.  The 'and' operator is &.  Always put () around each condition 

Now to dfg.index = labels[dfg.index].  In dfg, because I grouped by 'bin', the index (or row label) of each row of grouped data (i.e. my dfg.index) will be my bin numbers:1,2,3,  labels[dfg.index] is using fancy indexing of a numpy array.  labels[0] would give me the first label, labels[3] would give me the 4th label.  With normal python lists you can use slices to do labels[0:3] which would give me labels 0,1, and 2. With numpy arrays we can go a step further and just index with  a list of values or another array so labels[np.array([0,2,4]) would give me labels 0,2,4.  By using labels[dfg.index] I'm requesting the labels corresponding to the bin#.  Basically I'm changng my bin number to bin label.  I could have done that to my original data but that would be thousands of rows; by doing it after the group by I'm doing it to 21 rows or so.  Note that I cannot just do dfg.index = labels as some of my bins might be empty and therefore not present in the group by data.
Now the dfg.ix['x',('col11', 'mean')]='N/A' part.  Remember way back when I did df.ix[df.col11 == 'x', 'col11']=-0.08 that was so all my invalid data was treated as a number and would be placed into the 1st bin.  after applying group by and aggregate functions the mean of 'col11' values in my first bin will be -0.08 (because all such values are -0.08).  Now I know this not correct, all values of -0.08 actually indicate that the original value wsa x.  You can't do a mean of x.  So I manually put it to N/A. ie. dfg.ix['x',('col11', 'mean')]='N/A' means in dfg where index (or row) is 'x' and column is 'col11 mean') set the value to 'N/A'.  the ('col11', 'mean') I believe is how pandas comes up with the aggreagate column names i.e. when I did .agg({'col11': [np.mean]}), to refer to the resulting aggregate column i need ('column_name', 'aggregate_function_name')


The motivation for all this was: convert all data to numbers so I can use the power of Pandas, then after processing,  manually change any values that I know are garbage.  Let me know if you need any more explanation.

                        这篇关于了解此Pandas脚本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

了解此Pandas脚本 [英] Understanding this Pandas script

问题描述

代码

Code

相关文章

Python最新文章

热门教程

热门工具

登录关闭

了解此Pandas脚本 [英] Understanding this Pandas script

问题描述

代码

Code

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭