如何在数据框中创建新列,这将是其他列和条件函数的功能,而无需使用for循环遍历行? [英] How to create a new column in dataframe, which will be a function of other columns and conditionals without iteratng over the rows with a for loop?
问题描述
我有一个相对较大的数据框(8737行和16列,包括所有变量类型,字符串,整数,布尔值等),我想根据一个方程式和一些条件创建一个新列.基本上,我想遍历一个特定的列,获取其值,然后乘以,求和等.创建一个新值,然后检查是否满足某些条件(> =或<到设置值).如果满足条件,那么我需要保留计算的输出,否则分配一个固定值.
I have a relatively large data frame (8737 rows and 16 columns of all variable types, strings, integers, booleans etc.) and I want to create a new column based on an equation and some conditionals. Basically, I want to iterate over one particular column, take its values and after multiplications, sums etc. create a new value which then I check if it satisfies some conditions (>= or < to a set value). If it satisfies the conditionals then I need to keep the output of the calculation, else assign a fixed value.
我这样做是通过for循环遍历整个数据集,这需要花费大量时间.我是python的新手,除了在没有for循环的情况下交替使用现有列之外,我在网上找不到任何类似的问题解决方案.
I am doing that by looping over the entire dataset with a for loop, which takes a huge amount of time. I am quite new to python and couldn't quite find any similar problem solution online, other than alternating existing columns without a for loop.
为了简单起见,我将这个数据帧称为df_test:
Lets say for the sake of simplicity I have this data frame called df_test:
A B C D S
0 0.001568 0.321316 -0.269841 3.232037 5.0
1 1.926186 -1.111863 -0.387165 5.541699 NaN
2 2.110923 -0.403940 -0.029895 -9.688968 NaN
3 0.609391 1.697205 -1.827488 -1.273713 NaN
4 -0.577739 0.394475 -1.524400 16.505185 NaN
5 0.456884 -1.238733 0.453586 -4.868735 NaN
其中S是我需要计算的列,从设置值开始. S的下一个值我需要是S的上一个值,再加上诸如此类的一些计算:
where S is the column I need to calculate, starting from a set value. Next value of S I need to be the previous value of S plus some calculation like so:
df_test.S[1]=df_test.S[0]+df_test.D[1]*abs(df_test.C[1])*0.5
然后,应按条件评估此值.如果它大于等于例如10,则为它分配10(而不是计算),如果它小于或等于5,则为其分配5.
Then this value should be evaluated by a conditional. If it is larger than equal to, for example 10, then assign 10 to it (instead of the calculation) and if its less or equal to 5 then assign 5 to it.
我在数据集上使用了for循环,并为每个元素运行了所需的方程式.基本上它是这样的:
I use a for loop over the data set and for every element I run the equation that I need. Basically it works like this:
for i in range (1,df_test.shape[0]):
df_test.S[i]=df_test.S[i-1]+df_test.D[i]*abs(df_test.C[i])*0.5
if df_test.S[i]<5:
df_test.S[i]=5
elif df_test.S[i]>10:
df_test.S[i]=10
此用于8737行的代码大约需要20分钟才能完成.
This code for 8737 rows takes around 20 mins to complete.
如果您需要任何说明,请问我.预先谢谢你.
If you need any clarifications, please ask me. Thank you in advance.
推荐答案
您可以分两步轻松地做到这一点:
You can do that really easily in two steps:
df.loc[1:, 'S'] = df.loc[1:, "D"] * 0.5 * df.loc[1:, "C"].abs() # Computes the numerical expression you want
df["S"] = df["S"].cumsum() # Add the previous to the current item of S
# Then compute your `if` condition
df.loc[df["S"] < 5, 'S'] = 5
df.loc[df["S"] > 10, 'S'] = 10
==>没有for
循环.
这篇关于如何在数据框中创建新列,这将是其他列和条件函数的功能,而无需使用for循环遍历行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!