在pandas数据框上执行复杂搜索的最快方法 [英] Fastest way to perform complex search on pandas dataframe
问题描述
我正在尝试找出对熊猫数据框执行搜索和排序的最快方法.以下是我要完成的数据框之前和之后的数据框.
I am trying to figure out the fastest way to perform search and sort on a pandas dataframe. Below are before and after dataframes of what I am trying to accomplish.
之前:
flightTo flightFrom toNum fromNum toCode fromCode
ABC DEF 123 456 8000 8000
DEF XYZ 456 893 9999 9999
AAA BBB 473 917 5555 5555
BBB CCC 917 341 5555 5555
搜索/排序后:
flightTo flightFrom toNum fromNum toCode fromCode
ABC XYZ 123 893 8000 9999
AAA CCC 473 341 5555 5555
在此示例中,我实质上是在尝试过滤掉最终目的地之间存在的航班".应该使用某种类型的dropplicate方法完成此操作,但让我感到困惑的是如何处理所有列.二进制搜索将是实现此目标的最佳方法吗?提示表示赞赏,并努力弄清这一点.
In this example I am essentially trying to filter out 'flights' that exist in between end destinations. This should be done by using some sort of drop duplicates method but what leaves me confused is how to handle all of the columns. Would a binary search be the best way to accomplish this? Hints appreciated, trying hard to figure this out.
可能的边缘情况:
如果数据被交换并且我们的终端连接在同一列怎么办?
What if the data is switched up and our end connections are in the same column?
flight1 flight2 1Num 2Num 1Code 2Code
ABC DEF 123 456 8000 8000
XYZ DEF 893 456 9999 9999
搜索/排序后:
flight1 flight2 1Num 2Num 1Code 2Code
ABC XYZ 123 893 8000 9999
从逻辑上讲这种情况不应该发生.毕竟您怎么能去DEF-ABC和DEF-XYZ?您不能,但是端点"仍然是ABC-XYZ
This case logically shouldn't happen. After all how can you go DEF-ABC and DEF-XYZ? You can't, but the 'endpoints' would still be ABC-XYZ
推荐答案
这是网络问题,因此我们使用networkx
,请注意,这里可以有两个以上的停靠点,这意味着可以有类似
This is network problem , so we using networkx
, notice , here you can have more than two stops , which means you can have some case like NY-DC-WA-NC
import networkx as nx
G=nx.from_pandas_edgelist(df, 'flightTo', 'flightFrom')
# create the nx object from pandas dataframe
l=list(nx.connected_components(G))
# then we get the list of components which as tied to each other ,
# in a net work graph , they are linked
L=[dict.fromkeys(y,x) for x, y in enumerate(l)]
# then from the above we can create our map dict ,
# since every components connected to each other ,
# then we just need to pick of of them as key , then map with others
d={k: v for d in L for k, v in d.items()}
# create the dict for groupby , since we need _from as first item and _to as last item
grouppd=dict(zip(df.columns.tolist(),['first','last']*3))
df.groupby(df.flightTo.map(d)).agg(grouppd) # then using agg with dict yield your output
Out[22]:
flightTo flightFrom toNum fromNum toCode fromCode
flightTo
0 ABC XYZ 123 893 8000 9999
1 AAA CCC 473 341 5555 5555
安装networkx
Installation networkx
- 点:
pip install networkx
- Anaconda :
conda install -c anaconda networkx
- Pip:
pip install networkx
- Anaconda:
conda install -c anaconda networkx
这篇关于在pandas数据框上执行复杂搜索的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!