如何根据IP范围过滤数据帧 [英] How to filter dataframe based on ip range
问题描述
我有2列的数据框.我想根据json文件中存在的ip范围过滤此数据帧.
I have dataframe which has 2 columns. I want to filter this dataframe based on ip ranges present in json file.
ip_ranges.json
[
{"start": "45.43.144.0", "end": "45.43.161.255"}
{"start": "104.222.130.0", "end": "104.222.191.255"}
...
]
数据框:
ip,p_value
97.98.173.96,3.7
73.83.192.21,6.9
...
注意:ip_range.json包含10万个元素,我的数据框有30万行.
Note: ip_range.json contains 100k elements and my dataframe has 300k rows.
目前,我是这样实现的
- 创建了python列表以存储每个范围内的所有ip.例如["45.43.144.0","45.43.144.1","45.43.144.2",...,"45.43.161.255"].对于所有IP范围,都采用类似的方式.
- 从此列表中删除了重复的元素
- 使用此列表构造的数据框
- 在"ip"上合并了两个数据框
此过程对于一小部分ip_ranges可以正常工作.但是,对于大量的ip_ranges而言,该过程需要更长的时间才能完成.
This process works fine for small set of ip_ranges. But for large set of ip_ranges, the process takes longer time to complete.
是否有更好的方法可以更有效地执行此操作?
Is there any better approach to perform this more efficiently?
推荐答案
只是一个想法:将您放入具有From
和To
列的数据框ip_range
中.使用提供的快速代码,例如df中的那些)转换为十进制数字. >这里.
Just an idea: Put you ranges into a dataframe ip_range
with columns From
and To
. Convert all ip-addresses (the ones in df
, too) to decimal numbers with the fast code provided for example here.
现在可以快速生成范围:
Now generating the ranges can be done fast:
ip_range['Rng'] = ip_range.apply(lambda x: np.arange(x.From, x.To+1), axis=1)
这些范围可以转换为DataFrame:
These ranges can be converted into a DataFrame:
ips = pd.DataFrame(itertools.chain(*ip_range['Rng']))
此DataFrame可以轻松地与df
合并.
This DataFrame can easily be merged with df
.
这篇关于如何根据IP范围过滤数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!