在dask Assign()或apply()中使用变量列名 [英] variable column name in dask assign() or apply()
问题描述
我有可以在pandas
中使用的代码,但是在将其转换为使用dask
时遇到了麻烦.有一个局部解决方案此处 ,但不允许我使用变量作为要创建/分配给的列的名称.
I have code that works in pandas
, but I'm having trouble converting it to use dask
. There is a partial solution here, but it does not allow me to use a variable as the name of the column I am creating/assigning to.
这是有效的pandas
代码:
percent_cols = ['num_unique_words', 'num_words_over_6']
def find_fraction(row, col):
return row[col] / row['num_words']
for c in percent_cols:
df[c] = df.apply(find_fraction, col=c, axis=1)
这是dask
代码无法满足我的要求:
Here's the dask
code that doesn't do what I want:
data = dd.from_pandas(df, npartitions=8)
for c in percent_cols:
data = data.assign(c = data[c] / data.num_words)
这会将结果分配给名为c
的新列,而不是修改data[c]
的值(我想要的).如果我可以让列名是一个变量,那么创建一个新列就可以了.例如,如果这可行:
This assigns the result to a new column called c
rather than modifying the value of data[c]
(what I want). Creating a new column would be fine if I could have the column name be a variable. E.g., if this worked:
for c in percent_cols:
name = c + "new"
data = data.assign(name = data[c] / data.num_words)
出于明显的原因,python不允许=
左边的表达式,而忽略name
的先前值.
For obvious reasons, python doesn't allow an expression left of an =
and ignores the previous value of name
.
如何使用变量作为要分配给的列的名称?循环迭代的次数远远超过我愿意复制/粘贴的次数.
How can I use a variable for the name of the column I am assigning to? The loop iterates far more times than I'm willing to copy/paste.
推荐答案
这可以解释为Python语言问题:
This can be interpreted as a Python language question:
问题:如何在关键字参数中使用变量的值作为名称?
Question: How do I use a variable's value as the name in a keyword argument?
答案:使用字典并**
解压缩
c = 'name'
f(c=5) # 'c' is used as the keyword argument name, not what we want
f(**{c: 5}) # 'name' is used as the keyword argument name, this is great
Dask.dataframe解决方案
对于您的特定问题,我建议以下内容:
Dask.dataframe solution
For your particular question I recommend the following:
d = {col: df[col] / df['num_words'] for col in percent_cols}
df = df.assign(**d)
也考虑与熊猫一起做
.assign
方法也可在Pandas中使用,并且可能比使用.apply
更快.
Consider doing this with Pandas as well
The .assign
method is available in Pandas as well and may be faster than using .apply
.
这篇关于在dask Assign()或apply()中使用变量列名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!