Pandas NumPy:数学运算时使用序列设置数组元素 [英] pandas numpy : setting an array element with a sequence while math operation
本文介绍了Pandas NumPy:数学运算时使用序列设置数组元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个名为df4的df,您可以通过以下代码获得它:
df4s = """
contract RB BeginDate ValIssueDate EndDate Valindex0 48 46 47 49 50
2 A00118 46 19850100 19880901 99999999 50 1 2 3 7 7
3 A00118 47 19000100 19880901 19831231 47 1 2 3 7 7
5 A00118 47 19850100 19880901 99999999 50 1 2 3 7 7
6 A00253 48 19000100 19820101 19811231 47 1 2 3 7 7
7 A00253 48 19820100 19820101 19841299 47 1 2 3 7 7
8 A00253 48 19850100 19820101 99999999 50 1 2 3 7 7
9 A00253 50 19000100 19820101 19781231 47 1 2 3 7 7
10 A00253 50 19790100 19820101 19841299 47 1 2 3 7 7
11 A00253 50 19850100 19820101 99999999 50 1 2 3 7 7
"""
df4 = pd.read_csv(StringIO(df4s.strip()), sep='s+',
dtype={"RB": int, "BeginDate": int, "EndDate": int,'ValIssueDate':int,'Valindex0':int})
输出将为:
contract RB BeginDate ValIssueDate EndDate Valindex0 48 46 47 49 50
2 A00118 46 19850100 19880901 99999999 50 1 2 3 7 7
3 A00118 47 19000100 19880901 19831231 47 1 2 3 7 7
5 A00118 47 19850100 19880901 99999999 50 1 2 3 7 7
6 A00253 48 19000100 19820101 19811231 47 1 2 3 7 7
7 A00253 48 19820100 19820101 19841299 47 1 2 3 7 7
8 A00253 48 19850100 19820101 99999999 50 1 2 3 7 7
9 A00253 50 19000100 19820101 19781231 47 1 2 3 7 7
10 A00253 50 19790100 19820101 19841299 47 1 2 3 7 7
11 A00253 50 19850100 19820101 99999999 50 1 2 3 7 7
我正在尝试按照以下逻辑构建一个新列,新列的值将基于2个已有列的值:
def test(RB):
n=1
for i in np.arange(RB,50):
n = n * df4[str(i)].values
return n
vfunc=np.vectorize(test)
df4['n']=vfunc(df4['RB'].values)
然后收到错误:
res = array(outputs, copy=False, subok=True, dtype=otypes[0])
ValueError: setting an array element with a sequence.
推荐答案
重建数据帧(感谢使用StringIO
方法)
In [82]: df4['RB'].values
Out[82]: array([46, 47, 47, 48, 48, 48, 50, 50, 50])
In [83]: test(46)
Out[83]: array([42, 42, 42, 42, 42, 42, 42, 42, 42])
In [84]: test(50)
Out[84]: 1
In [85]: [test(i) for i in df4['RB'].values]
Out[85]:
[array([42, 42, 42, 42, 42, 42, 42, 42, 42]),
array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
1,
1,
1]
In [86]: vfunc=np.vectorize(test)
In [87]: vfunc(df4['RB'].values)
TypeError: only size-1 arrays can be converted to Python scalars
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<ipython-input-87-8db8cd5dc5ab>", line 1, in <module>
vfunc(df4['RB'].values)
File "/usr/local/lib/python3.8/dist-packages/numpy/lib/function_base.py", line 2163, in __call__
return self._vectorize_call(func=func, args=vargs)
File "/usr/local/lib/python3.8/dist-packages/numpy/lib/function_base.py", line 2249, in _vectorize_call
res = asanyarray(outputs, dtype=otypes[0])
ValueError: setting an array element with a sequence.
注意完整的回溯。vectorize
在从这组大小混合的数组创建返回数组时遇到问题。它猜测, based on a trial calculation that it should return an
int`数据类型。
如果我们告诉它返回一个对象dtype数组,我们会得到:
In [88]: vfunc=np.vectorize(test, otypes=['object'])
In [89]: vfunc(df4['RB'].values)
Out[89]:
array([array([42, 42, 42, 42, 42, 42, 42, 42, 42]),
array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
array([7, 7, 7, 7, 7, 7, 7, 7, 7]), 1, 1, 1], dtype=object)
我们可以将其分配到df列:
In [90]: df4['n']=_
In [91]: df4
Out[91]:
contract RB BeginDate ... 49 50 n
2 A00118 46 19850100 ... 7 7 [42, 42, 42, 42, 42, 42, 42, 42, 42]
3 A00118 47 19000100 ... 7 7 [21, 21, 21, 21, 21, 21, 21, 21, 21]
5 A00118 47 19850100 ... 7 7 [21, 21, 21, 21, 21, 21, 21, 21, 21]
6 A00253 48 19000100 ... 7 7 [7, 7, 7, 7, 7, 7, 7, 7, 7]
7 A00253 48 19820100 ... 7 7 [7, 7, 7, 7, 7, 7, 7, 7, 7]
8 A00253 48 19850100 ... 7 7 [7, 7, 7, 7, 7, 7, 7, 7, 7]
9 A00253 50 19000100 ... 7 7 1
10 A00253 50 19790100 ... 7 7 1
11 A00253 50 19850100 ... 7 7 1
我们也可以将Out[85]
列表
df4['n']=Out[85]
时间大致相同:
In [94]: timeit vfunc(df4['RB'].values)
211 µs ± 5.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [95]: timeit [test(i) for i in df4['RB'].values]
217 µs ± 6.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
通常vectorize
较慢,但test
本身可能足够慢,迭代方法没有太大区别。请记住(如有必要,请重新阅读文档),vectorize
不是性能工具。它不会‘编译’您的函数,也不会让它运行得更快。
返回对象数据类型数组的替代方法:
In [96]: vfunc=np.frompyfunc(test,1,1)
In [97]: vfunc(df4['RB'].values)
Out[97]:
array([array([42, 42, 42, 42, 42, 42, 42, 42, 42]),
array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
array([7, 7, 7, 7, 7, 7, 7, 7, 7]), 1, 1, 1], dtype=object)
In [98]: timeit vfunc(df4['RB'].values)
202 µs ± 6.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
这篇关于Pandas NumPy:数学运算时使用序列设置数组元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文