有没有一种有效的方法来选择大 pandas 数据框中的多行? [英] Is there an efficient way to select multiple rows in a large pandas data frame?

查看:73
本文介绍了有没有一种有效的方法来选择大 pandas 数据框中的多行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究一个大熊猫adatframe,它具有约1亿行和2列.我想遍历数据框并根据col1和col2的值有效地设置第三列.这就是我目前正在做的-

I am working on a large pandas adatframe with about 100 million rows and 2 columns. I want to iterate over the dataframe and efficiently set a third column depending on the values of col1 and col2. This is what I am currently doing -

df[col3] = 0
for idx, row in df.iterrows():
    val1 = row[col1]
    val2 = row[col2]
    df1 = df.loc[(df.col1 == val2) & (df.col2 == val1)]
    if len(df1) > 0:
        df.loc[(df.col1 == val2) & (df.col2 == val1), col3] = 1
Example:
    df = pd.DataFrame({'col1':[0,1,2,3,4,11], 'col2':[10,11,12,4,3,0]})
    >> df.head()
        col1 col2
     0  0   10
     1  1   11
     2  2   12
     3  3   4
     4  4   3
     5  3   10
    I want to add 'col3' such that last 2 rows of the third column are
    1. Think of it as a reverse_edge column which is 1 when for each 
    (val1, val2) in col1, col2 there is a (val2, val1) in col1, col2
        col1    col2    col3
      0 0        10      0
      1 1        11      0
      2 2        12      0
      3 3        4       1
      4 4        3       1
      5 11       0       0

进行此计算的最有效方法是什么?目前,遍历整个数据框需要花费我几个小时.

What is the most efficient way to do this computation? It is currently taking me hours to traverse the entire dataframe.

将col1中的每个值和col2中的对应值视为图形中的一条边(val1-> val2).我想知道是否存在反向边缘(val2-> val1).

Think of each value in col1 and corresponding value in col2 as an edge in a graph (val1 -> val2). I want to know if a reverse edge exists or not (val2 -> val1).

推荐答案

使用:

df1 = pd.DataFrame(np.sort(df[['col1', 'col2']], axis=1), index=df.index)
df['col3'] = df1.duplicated(keep=False).astype(int)
print (df)
   col1  col2  col3
0     0    10     0
1     1    11     0
2     2    12     0
3     3     4     1
4     4     3     1

另一个使用merge的解决方案,比较子集,与2d array进行比较,最后使用

Another solution with merge and compare subsets, compare to 2d arrays, last use np.all for check all True per rows:

df2 = df.merge(df, how='left', left_on='col2', right_on='col1')

df['col3'] = ((df2[['col1_x','col2_x']].values == 
               df2[['col2_y','col1_y']].values).all(axis=1).astype(int))
#pandas 0.24+
#https://stackoverflow.com/a/54508052
#df['col3'] = ((df2[['col1_x','col2_x']].to_numpy() ==
                df2[['col2_y','col1_y']].to_numpy()).all(axis=1).astype(int))
print (df)
   col1  col2  col3
0     0    10     0
1     1    11     0
2     2    12     0
3     3     4     1
4     4     3     1
5    11     0     0


print ((df2[['col1_x','col2_x']].values == df2[['col2_y','col1_y']].values))


[[False False]
 [False  True]
 [False False]
 [ True  True]
 [ True  True]
 [False  True]]

这篇关于有没有一种有效的方法来选择大 pandas 数据框中的多行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆