如何在 pandas 中使用发电机 [英] How to use generators in Pandas
问题描述
我正在学习使用生成器,但不太了解它们的工作原理.
I'm learning to use generators but don't quite understand how they work.
我想做的是遍历行,并将一个单元格乘以每一行中的另一个单元格,然后用结果创建一个新列.
What I want to do is iterate over rows and multiply a cell by another cell in each row, then create a new column with the results.
rate = (df['Fee'][i] for df['Fee'] in df / df['Costs'][i] for df['Costs'] in df * 100)
df['rate']=df.iterrows(rate)
因此,在上面,我试图制作一个生成器,用于计算费用中所占费用的百分比.
So above, I've tried to make a generator which calculates what the percentage the fee is from the costs.
我意识到使用for循环会容易得多,但是我想学习在这种情况下如何使用生成器.
I realise this would be much easier with a for loop but I wanted to learn how a generator would be used in this instance.
下面的示例数据框.
Industry Expr1 Fee Costs
Food & Drink June 9970.320 116171.15
Music Industry June 7255.534 131492.59
Manufacturing June 5278.960 171315.01
Music Industry June 6120.596 143688.78
Telecommunications April 4123.986 78733.09
推荐答案
简洁的答案是您不这样做".或如熊猫文档所述:
The succinct answer is "You don't". Or as the Pandas documentation puts it:
进行数据分析时,与原始NumPy数组一样,通常不需要逐个值逐个循环.系列也可以传递到大多数需要ndarray的NumPy方法中.
When doing data analysis, as with raw NumPy arrays looping through Series value-by-value is usually not necessary. Series can also be passed into most NumPy methods expecting an ndarray.
这也适用于DataFrames和许多利用ndarray
的其他结构.要获得更多见解,我真的建议您了解有关熊猫/NumPy/SciPy在内部如何工作的更多信息.
This also applies to DataFrames and many other structures that leverage ndarray
. For more insight I would really recommend learning more about how pandas/NumPy/SciPy work internally.
关于此特定主题,我将指向您 NumPy-广播
Regarding this particular topic I would point you to Pandas - Intro to Data Structures - Data Alignment and Arithmetic and NumPy - Broadcasting
这些程序包在后台使用了大量的C代码来优化操作.尽管生成器/迭代器很棒,但它们永远无法匹配这样的优化代码.例如,给您的问题示例是一个简单的测试.
Behind the scenes these packages use a lot of C code to optimize operations. While generators/iterators are great they will never be able to match such optimized code. For example, given your problem example here is a simple test.
np.all((df.Fee / df.Costs).values == np.array([x / y for x, y in df[['Fee', 'Costs']].values]))
True
%timeit (df.Fee / df.Costs).values
78.5 µs ± 1.88 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.array([x / y for x, y in df[['Fee', 'Costs']].values])
331 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
如您所见,Pandas内部使用的内置划分方法快大约5倍.那是一个非常小的样本量.
As you can see the built in method of division used internally by Pandas is ~ 5x faster. And that is on a terribly small sample size.
这篇关于如何在 pandas 中使用发电机的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!