在pandas数据框上执行复杂搜索的最快方法 [英] Fastest way to perform complex search on pandas dataframe

查看:73
本文介绍了在pandas数据框上执行复杂搜索的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试找出对熊猫数据框执行搜索和排序的最快方法.以下是我要完成的数据框之前和之后的数据框.

I am trying to figure out the fastest way to perform search and sort on a pandas dataframe. Below are before and after dataframes of what I am trying to accomplish.

之前:

flightTo  flightFrom  toNum  fromNum  toCode  fromCode
   ABC       DEF       123     456     8000    8000
   DEF       XYZ       456     893     9999    9999
   AAA       BBB       473     917     5555    5555
   BBB       CCC       917     341     5555    5555

搜索/排序后:

flightTo  flightFrom  toNum  fromNum  toCode  fromCode
   ABC       XYZ       123     893     8000    9999
   AAA       CCC       473     341     5555    5555

在此示例中,我实质上是在尝试过滤掉最终目的地之间存在的航班".应该使用某种类型的dropplicate方法完成此操作,但让我感到困惑的是如何处理所有列.二进制搜索将是实现此目标的最佳方法吗?提示表示赞赏,并努力弄清这一点.

In this example I am essentially trying to filter out 'flights' that exist in between end destinations. This should be done by using some sort of drop duplicates method but what leaves me confused is how to handle all of the columns. Would a binary search be the best way to accomplish this? Hints appreciated, trying hard to figure this out.

可能的边缘情况:

如果数据被交换并且我们的终端连接在同一列怎么办?

What if the data is switched up and our end connections are in the same column?

flight1  flight2      1Num    2Num     1Code   2Code
   ABC       DEF       123     456     8000    8000
   XYZ       DEF       893     456     9999    9999

搜索/排序后:

flight1  flight2      1Num    2Num     1Code   2Code
   ABC       XYZ       123     893     8000    9999

从逻辑上讲这种情况不应该发生.毕竟您怎么能去DEF-ABC和DEF-XYZ?您不能,但是端点"仍然是ABC-XYZ

This case logically shouldn't happen. After all how can you go DEF-ABC and DEF-XYZ? You can't, but the 'endpoints' would still be ABC-XYZ

推荐答案

这是网络问题,因此我们使用networkx,请注意,这里可以有两个以上的停靠点,这意味着可以有类似

This is network problem , so we using networkx , notice , here you can have more than two stops , which means you can have some case like NY-DC-WA-NC

import networkx as nx
G=nx.from_pandas_edgelist(df, 'flightTo', 'flightFrom')

# create the nx object from pandas dataframe

l=list(nx.connected_components(G))

# then we get the list of components which as tied to each other , 
# in a net work graph , they are linked 
L=[dict.fromkeys(y,x) for x, y in enumerate(l)]

# then from the above we can create our map dict , 
# since every components connected to each other , 
# then we just need to pick of of them as key , then map with others

d={k: v for d in L for k, v in d.items()}

# create the dict for groupby , since we need _from as first item and _to as last item 
grouppd=dict(zip(df.columns.tolist(),['first','last']*3))
df.groupby(df.flightTo.map(d)).agg(grouppd) # then using agg with dict yield your output 

Out[22]: 
         flightTo flightFrom  toNum  fromNum  toCode  fromCode
flightTo                                                      
0             ABC        XYZ    123      893    8000      9999
1             AAA        CCC    473      341    5555      5555

安装networkx

Installation networkx

  • :pip install networkx
  • Anaconda :conda install -c anaconda networkx
  • Pip: pip install networkx
  • Anaconda: conda install -c anaconda networkx

这篇关于在pandas数据框上执行复杂搜索的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆