了解此Pandas脚本 [英] Understanding this Pandas script
问题描述
我收到此代码,将数据分组为直方图类型数据。我一直试图理解这个熊猫脚本中的代码,以编辑,操纵和复制它。我对我理解的部分有评论。
代码
import numpy as np
import pandas as pd
column_names = ['col1','col2','col3','col4','col5','col6',
'col7' col8','col9','col10','col11'] #names用作列标签。如果没有指定名称,那么列可以通过数字例如。 df [0],df [1]等。
df = pd.read_csv('data.csv',header = None,names = column_names)#header = None表示没有列标题csv文件
df.ix [df.col11 =='x','col11'] = - 0.08 #trick,以便'x'行将被分组到类别> -0.1和< = - 0.05。这将允许将所有col11视为数字
bins = np.arange(-0.1,1.0,0.05)#bin以将col11值放入。> -0.1和< = - 0.05将是我们的特殊x行,> -0.05和< = 0将捕获所有的'0'值。
labels = np.array(['%s:%s'%(x,y)for x,y in zip(bins [: - 1],bins [1:])])#create标签
labels [0] ='x'#将第一个bin标签更改为'x'
labels [1] ='0'#将第二个bin标签更改为'
df ['col11'] = df ['col11']。astype(float)#convert col11为数字,所以我们可以做他们的数学
df ['bin'] = pd.cut(df ['col11'],bins = bins,labels = False)#设置另一列'bins',并放入一个整数,表示数字落入哪个bin。稍后我们将整数映射到bin标签
df.set_index('bin',inplace = True,drop = False,append = False)#groupby意味着使用索引运行得更快
def count_ones(x):
聚合函数来计算等于1的值
return np.sum(x == 1)
dfg = df ['bin','col7','col11']] groupby('bin')。agg({'col11':[np.mean],'col7':[count_ones,len]})
dfg.index = labels [dfg.index]
dfg.ix ['x',('col11','mean')] ='N / A'
print )
dfg.to_csv('new.csv')
理解是在这一部分:
def count_ones(x):
1
return np.sum(x == 1)
dfg = df [['bin','col7','col11']] groupby ).agg({'col11':[np.mean],'col7':[count_ones,len]})
dfg.index = labels [dfg.index]
dfg。 ix ['x',('col11','mean')] ='N / A'
print(dfg)
dfg.to_csv('new.csv')
如果任何人能够评论这个脚本,我会非常感激。也请随时更正或添加到我的意见(这些是我假设,到目前为止他们可能不正确)。我希望这不是太主题的SOF。我将很乐意给任何可以帮助我这个的用户50点赏金。
我会尝试解释我的代码。因为它使用了一些技巧。
- 我已经调用
df
一个pandas DataFrame的简写名称 - 我把它命名为
dfg
,意思是组df
。
-
让我建立表达式
dfg = df [['bin','col7','col11']] .groupby('bin')。agg({'col11':[np.mean],'col7':[count_ones,len]})
$ b b- 代码
dfg = df [['bin','col7','col11']]
现在我有3列我感兴趣的是我的DataFramedf
。'bin''col7'和'col11' ,我想根据'bin'列中的值进行分组。这是通过dfg = df [['bin','col7','col11']]。groupby('bin')
。我现在有一组数据,即在bin#1中的所有记录,在bin#2中的所有记录等。 - 我现在要对每个记录应用一些聚合函数我的bin组(一个聚合funcitn是像sum,平均值或计数)。
- 现在,我想对我的每个bin中的记录应用三个聚合函数:col11的平均值,每个bin中的记录数以及每个记录中的记录数bin有'col7'等于一。平均值很容易; numpy已经有一个函数来计算平均值。如果我只是做'col11'的意思,我会写:
dfg = df [['bin','col7','col11']] groupby 'col11':[np.mean]})
。记录的数量也很容易; python的len
函数(它不是一个真正的函数,但属性的列表等)将给我们列表中的项目数。所以我现在有dfg = df [['bin','col7','col11']] groupby('bin')。agg({'col11':[np.mean] col7':[len]})
。现在我不能想到一个现有的函数计数numpy数组中的数量(它必须处理一个numpy数组)。我可以定义我自己的函数在numpy数组,因此我的函数count_ones
。 -
现在,我将解构
count_ones
函数。传递给函数的varibalex
始终是一个1d numpy数组。在我们的具体情况下,将在bin#1中的所有col7值,在bin#2中的所有col7值等。代码x == 1
将创建一个与x大小相同的布尔(TRUE / FALSE)数组。如果x中的对应值等于1,则布尔数组中的条目将为True,否则为false。因为python将True作为1,如果我总和我的布尔数组的值,我会得到一个值== 1的计数。现在我有我的count_ones
函数通过:dfg = df [['bin','col7','col11 ']]。groupby('bin')。agg({'col11':[np.mean],'col7':[count_ones,len]})
-
您可以看到
== 3)将是'x等于6或x等于3'。'和'运算符是&总是在每个条件周围放置.agg
的语法是.agg({'column_name_to_apply_to':[list_of_function使用布尔数组,你可以做各种各样的条件组合(x == 6)|(x(x)= 0)。
()
>
- 代码
-
现在
dfg.index = labels [dfg.index] $ c $在
正在使用numpy数组的花式索引。 3]会给我第四个标签使用正常的Python列表,你可以使用切片做标签[0:3],这将给我标签0,1和2.使用numpy数组,我们可以进一步,只是索引与值列表或另一个数组,所以label [np.array([0,2,4])将给我标签0,2,4。通过使用dfg
中,因为我以bin分组,分组数据(即我的dfg.index)的每一行的索引(或行标签)将是我的bin [0]将给我的第一个标签,标签[dfg.index]标签[dfg.index]
我请求与bin#对应的标签。基本上我将我的bin数字改为bin标签。我可以做到我的原始数据,但这将是数千行;通过做它之后的组,我做它21行左右。注意,我不能只是做dfg.index = labels
,因为我的一些bin可能是空的,因此不按组数据显示。 -
现在
/ p>dfg.ix ['x',('col11','mean')] ='N / A'
。记得当我做df.ix [df.col11 =='x','col11'] = - 0.08
时,我的所有无效数据被视为一个数字,并将放置到第一bin。在应用group by和聚合函数后,我的第一个bin中的col11值的平均值将是-0.08(因为所有这些值都是-0.08)。现在我知道这不正确,所有的值-0.08实际上表示原值wsa x。你不能做一个x的意思。所以我手动把它放到N / A。即。dfg.ix ['x',('col11','mean')] ='N / A'
表示在dfg其中索引(或行)并且列是'col11 mean')将值设置为'N / A'。('col11','mean')
我相信是pandas提出了aggreagate列名,即当我做了.agg需要
('column_name','aggregate_function_name')
所有这一切的动机是:将所有数据转换为数字,这样我可以使用Pandas的力量, ,手动更改任何我知道的值是垃圾。如果您需要更多解释,请告诉我。
I received this code to group data into a histogram type data. I have been Attempting to understand the code in this pandas script in order to edit, manipulate and duplicate it. I have comments for the sections I understand.
Code
import numpy as np
import pandas as pd
column_names = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6',
'col7', 'col8', 'col9', 'col10', 'col11'] #names to be used as column labels. If no names are specified then columns can be refereed to by number eg. df[0], df[1] etc.
df = pd.read_csv('data.csv', header=None, names=column_names) #header= None means there are no column headings in the csv file
df.ix[df.col11 == 'x', 'col11']=-0.08 #trick so that 'x' rows will be grouped into a category >-0.1 and <= -0.05. This will allow all of col11 to be treated as a numbers
bins = np.arange(-0.1, 1.0, 0.05) #bins to put col11 values in. >-0.1 and <=-0.05 will be our special 'x' rows, >-0.05 and <=0 will capture all the '0' values.
labels = np.array(['%s:%s' % (x, y) for x, y in zip(bins[:-1], bins[1:])]) #create labels for the bins
labels[0] = 'x' #change first bin label to 'x'
labels[1] = '0' #change second bin label to '0'
df['col11'] = df['col11'].astype(float) #convert col11 to numbers so we can do math on them
df['bin'] = pd.cut(df['col11'], bins=bins, labels=False) # make another column 'bins' and put in an integer representing what bin the number falls into.Later we'll map the integer to the bin label
df.set_index('bin', inplace=True, drop=False, append=False) #groupby is meant to run faster with an index
def count_ones(x):
"""aggregate function to count values that equal 1"""
return np.sum(x==1)
dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [count_ones, len]})
dfg.index = labels[dfg.index]
dfg.ix['x',('col11', 'mean')]='N/A'
print(dfg)
dfg.to_csv('new.csv')
The section I really struggle to understand is in this section:
def count_ones(x):
"""aggregate function to count values that equal 1"""
return np.sum(x==1)
dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [count_ones, len]})
dfg.index = labels[dfg.index]
dfg.ix['x',('col11', 'mean')]='N/A'
print(dfg)
dfg.to_csv('new.csv')
If any one is able to comment this script I would be greatly appreciative. Also feel free to correct or add to my comments (these are what I assume so far they may not be correct). Im hoping this isnt too off topic for SOF. I will gladly give a 50 point bounty to any user who can help me with this.
I'll try and explain my code. As it uses a few tricks.
- I've called it
df
to give a shorthand name for a pandas DataFrame - I've called it
dfg
to mean group mydf
. Let me build up the expression
dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [count_ones, len]})
- the code
dfg = df[['bin','col7','col11']]
is saying take the columns named 'bin' 'col7' and 'col11' from my DataFramedf
. - Now that I have the 3 columns I am interested in, I want to group by the values in the 'bin' column. This is done by
dfg = df[['bin','col7','col11']].groupby('bin')
. I now have groups of data i.e. all records that are in bin #1, all records in bin#2, etc. - I now want to apply some aggregate functions to the records in each of my bin groups( An aggregate funcitn is something like sum, mean or count).
- Now I want to apply three aggregate functions to the records in each of my bins: the mean of 'col11', the number of records in each bin, and the number of records in each bin that have 'col7' equal to one. The mean is easy; numpy already has a function to calculate the mean. If I was just doing the mean of 'col11' I would write:
dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean]})
. The number of records is also easy; python'slen
function (It's not really a function but a property of lists etc.) will give us the number of items in list. So I now havedfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [len]})
. Now I can't think of an existing function that counts the number of ones in a numpy array (it has to work on a numpy array). I can define my own functions that work on a numpy array, hence my functioncount_ones
. Now I'll deconstruct the
count_ones
function. the varibalex
passed to the function is always going to be a 1d numpy array. In our specific case it will be all the 'col7' values that fall in bin#1, all the 'col7' values that fall in bin#2 etc.. The codex==1
will create a boolean (TRUE/FALSE) array the same size as x. The entries in the boolean array will be True if the corresponding values in x are equal to 1 and false otherwise. Because python treats True as 1 if I sum the values of my boolean array I'll get a count of the values that ==1. Now that I have mycount_ones
function I apply it to 'col7' by:dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [count_ones, len]})
You can see that the syntax of the
.agg
is.agg({'column_name_to_apply_to': [list_of_function names_to_apply]}
With the boolean arrays you can do all sorts of wierd condition combinations (x==6) | (x==3) would be 'x equal to 6 or x equal to 3'. The 'and' operator is &. Always put
()
around each condition
- the code
Now to
dfg.index = labels[dfg.index]
. Indfg
, because I grouped by 'bin', the index (or row label) of each row of grouped data (i.e. my dfg.index) will be my bin numbers:1,2,3,labels[dfg.index]
is using fancy indexing of a numpy array. labels[0] would give me the first label, labels[3] would give me the 4th label. With normal python lists you can use slices to do labels[0:3] which would give me labels 0,1, and 2. With numpy arrays we can go a step further and just index with a list of values or another array so labels[np.array([0,2,4]) would give me labels 0,2,4. By usinglabels[dfg.index]
I'm requesting the labels corresponding to the bin#. Basically I'm changng my bin number to bin label. I could have done that to my original data but that would be thousands of rows; by doing it after the group by I'm doing it to 21 rows or so. Note that I cannot just dodfg.index = labels
as some of my bins might be empty and therefore not present in the group by data.Now the
dfg.ix['x',('col11', 'mean')]='N/A'
part. Remember way back when I diddf.ix[df.col11 == 'x', 'col11']=-0.08
that was so all my invalid data was treated as a number and would be placed into the 1st bin. after applying group by and aggregate functions the mean of 'col11' values in my first bin will be -0.08 (because all such values are -0.08). Now I know this not correct, all values of -0.08 actually indicate that the original value wsa x. You can't do a mean of x. So I manually put it to N/A. ie.dfg.ix['x',('col11', 'mean')]='N/A'
means in dfg where index (or row) is 'x' and column is 'col11 mean') set the value to 'N/A'. the('col11', 'mean')
I believe is how pandas comes up with the aggreagate column names i.e. when I did.agg({'col11': [np.mean]})
, to refer to the resulting aggregate column i need('column_name', 'aggregate_function_name')
The motivation for all this was: convert all data to numbers so I can use the power of Pandas, then after processing, manually change any values that I know are garbage. Let me know if you need any more explanation.
这篇关于了解此Pandas脚本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!