如何在 pandas 中使用发电机 [英] How to use generators in Pandas

查看:101
本文介绍了如何在 pandas 中使用发电机的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习使用生成器,但不太了解它们的工作原理.

I'm learning to use generators but don't quite understand how they work.

我想做的是遍历行,并将一个单元格乘以每一行中的另一个单元格,然后用结果创建一个新列.

What I want to do is iterate over rows and multiply a cell by another cell in each row, then create a new column with the results.

rate = (df['Fee'][i] for df['Fee'] in df / df['Costs'][i] for df['Costs'] in df * 100)

df['rate']=df.iterrows(rate)

因此,在上面,我试图制作一个生成器,用于计算费用中所占费用的百分比.

So above, I've tried to make a generator which calculates what the percentage the fee is from the costs.

我意识到使用for循环会容易得多,但是我想学习在这种情况下如何使用生成器.

I realise this would be much easier with a for loop but I wanted to learn how a generator would be used in this instance.

下面的示例数据框.

          Industry  Expr1        Fee        Costs
      Food & Drink   June   9970.320    116171.15
    Music Industry   June   7255.534    131492.59
     Manufacturing   June   5278.960    171315.01
    Music Industry   June   6120.596    143688.78
Telecommunications  April   4123.986     78733.09

推荐答案

简洁的答案是您不这样做".或如熊猫文档所述:

The succinct answer is "You don't". Or as the Pandas documentation puts it:

进行数据分析时,与原始NumPy数组一样,通常不需要逐个值逐个循环.系列也可以传递到大多数需要ndarray的NumPy方法中.

When doing data analysis, as with raw NumPy arrays looping through Series value-by-value is usually not necessary. Series can also be passed into most NumPy methods expecting an ndarray.

这也适用于DataFrames和许多利用ndarray的其他结构.要获得更多见解,我真的建议您了解有关熊猫/NumPy/SciPy在内部如何工作的更多信息.

This also applies to DataFrames and many other structures that leverage ndarray. For more insight I would really recommend learning more about how pandas/NumPy/SciPy work internally.

关于此特定主题,我将指向您

Regarding this particular topic I would point you to Pandas - Intro to Data Structures - Data Alignment and Arithmetic and NumPy - Broadcasting

这些程序包在后台使用了大量的C代码来优化操作.尽管生成器/迭代器很棒,但它们永远无法匹配这样的优化代码.例如,给您的问题示例是一个简单的测试.

Behind the scenes these packages use a lot of C code to optimize operations. While generators/iterators are great they will never be able to match such optimized code. For example, given your problem example here is a simple test.

np.all((df.Fee / df.Costs).values == np.array([x / y for x, y in df[['Fee', 'Costs']].values]))
True

%timeit (df.Fee / df.Costs).values
78.5 µs ± 1.88 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit np.array([x / y for x, y in df[['Fee', 'Costs']].values])
331 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

如您所见,Pandas内部使用的内置划分方法快大约5倍.那是一个非常小的样本量.

As you can see the built in method of division used internally by Pandas is ~ 5x faster. And that is on a terribly small sample size.

这篇关于如何在 pandas 中使用发电机的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆