python-受 pandas 条件和/或布尔索引困扰 [英] python - stumped by pandas conditionals and/or boolean indexing
问题描述
我在条件/布尔索引方面遇到麻烦.我正在尝试使用逻辑来填充数据帧(dfp),该逻辑取决于形状相似的数据帧(dfs)加上自身上一行(dfp)的数据. 这是我最近的失败...
I am having trouble with conditionals / boolean indexing. I am trying to populate a dataframe (dfp) with logic which is conditional on data from a similarly shaped dataframe (dfs) plus the previous row of itself (dfp). This is my latest fail...
import pandas as pd
dfs = pd.DataFrame({'a':[1,0,-1,0,1,0,0,-1,0,0],'b':[0,1,0,0,-1,0,1,0,-1,0]})
In [171]: dfs
Out[171]:
a b
0 1 0
1 0 1
2 -1 0
3 0 0
4 1 -1
5 0 0
6 0 1
7 -1 0
8 0 -1
9 0 0
dfp = pd.DataFrame(index=dfs.index,columns=dfs.columns)
dfp[(dfs==1)|((dfp.shift(1)==1)&(dfs!=-1))] = 1
In [166]: dfp.fillna(0)
Out[166]:
a b
0 1.0 0.0
1 0.0 1.0
2 0.0 0.0
3 0.0 0.0
4 1.0 0.0
5 0.0 0.0
6 0.0 1.0
7 0.0 0.0
8 0.0 0.0
9 0.0 0.0
因此,如果满足以下两个条件之一,我希望dfp在第n行中具有1:
So I would like dfp to have a 1 in row n if either of 2 conditions are met:
1) dfs same row = 1 or 2) both dfp previous row = 1 and dfs same row <> -1
我希望我的最终输出看起来像这样:
I would like my final output to look like this:
a b
0 1 0
1 1 1
2 0 1
3 0 1
4 1 0
5 1 0
6 1 1
7 0 1
8 0 0
9 0 0
更新/ 有时视觉效果更有用-下面是如何在Excel中进行绘制的方法.
UPDATE / Sometimes the visual is more helpful - below is how it would map out in Excel.
在此先感谢您的宝贵时间.
Thanks in advance, very grateful for your time.
推荐答案
让我们总结一下不变量:
Let's summarize the invariants:
- 如果
dfs
值为1
,则dfp
值为1
. - 如果
dfs
值为-1
,则dfp
值为0
. - 如果
dfs
值是0
,那么如果以前的dfp
值是1
,则dfp
值是1
,否则它是0
.
- If the
dfs
value is1
then thedfp
value is1
. - If the
dfs
value is-1
then thedfp
value is0
. - If the
dfs
value is0
then thedfp
value is1
if the previousdfp
value is1
otherwise it's0
.
或者用另一种方式表达:
Or to formulate in another way:
- 如果第一个值为
1
,则dfp
以1
开头,否则为0
-
dfp
的值为0
,直到dfs
中没有1
. -
dfp
的值为1
,直到dfs
中存在-1
.
- The
dfp
starts with1
if the first value is1
, otherwise0
- The
dfp
values are0
until there is a1
indfs
. - The
dfp
values are1
until there is a-1
indfs
.
这在python中很容易公式化:
This is very easy to formulate in python:
def create_new_column(dfs_col):
newcol = np.zeros_like(dfs_col)
if dfs_col[0] == 1:
last = 1
else:
last = 0
for idx, val in enumerate(dfs_col):
if last == 1 and val == -1:
last = 0
if last == 0 and val == 1:
last = 1
newcol[idx] = last
return newcol
测试:
>>> create_new_column(dfs.a)
array([1, 1, 0, 0, 1, 1, 1, 0, 0, 0], dtype=int64)
>>> create_new_column(dfs.b)
array([0, 1, 1, 1, 0, 0, 1, 1, 0, 0], dtype=int64)
但是在Python中这是非常低效的,因为在numpy数组(和pandas Series/DataFrames)上迭代很慢,并且在Python中的for
-loops效率也很低.
However this is very inefficient in Python because iterating over numpy-arrays (and pandas Series/DataFrames) is slow and the for
-loops in python are inefficient as well.
但是,如果您具有numba
或Cython
,则可以对其进行编译,并且它(可能)比任何NumPy解决方案都快(因为),因为NumPy需要多次滚动和/或累加操作.
However if you have numba
or Cython
you can compile this and it will be (probably) faster than any NumPy solution could be, because NumPy would require several rolling and/or accumulate operations.
例如使用numba:
>>> import numba
>>> numba_version = numba.njit(create_new_column) # compilation step
>>> numba_version(np.asarray(dfs.a)) # need cast to np.array
array([1, 1, 0, 0, 1, 1, 1, 0, 0, 0], dtype=int64)
>>> numba_version(np.asarray(dfs.b)) # need cast to np.array
array([0, 1, 1, 1, 0, 0, 1, 1, 0, 0], dtype=int64)
即使dfs
具有数百万行,numba解决方案也将仅花费毫秒:
Even if dfs
has millions of rows the numba solution will take only milliseconds:
>>> dfs = pd.DataFrame({'a':np.random.randint(-1, 2, 1000000),'b':np.random.randint(-1, 2, 1000000)})
>>> %timeit numba_version(np.asarray(dfs.b))
100 loops, best of 3: 9.37 ms per loop
这篇关于python-受 pandas 条件和/或布尔索引困扰的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!