在 python 列表中查找相似的条目 [英] Finding similar entries in python lists
问题描述
我有 2 个元组列表 list1 = [(1.332, 3.23344, 3.22), (2.122, 2.11, 2.33), ... (1, 2, 3)]
和 list2 = [(4.23, 12.2, 3.333), (1.234, 3.21, 4.342), ... (1.1, 2.2, 3.3)]
.这些列表都很长,两个列表都有数百万.对于上下文,这些数据点中的每一个都是在两个不同数据集中的某种位置度量.现在我想将 list1
中的每个条目对应到 list2
中的一个条目,如果它足够接近".足够接近是指位置之间的距离小于某个阈值(例如 0.1).我最初的想法是在 list1
中的每个条目上使用 min
函数.即,以下内容:
I have 2 lists of tuples list1 = [(1.332, 3.23344, 3.22), (2.122, 2.11, 2.33), ... (1, 2, 3)]
and list2 = [(4.23, 12.2, 3.333), (1.234, 3.21, 4.342), ... (1.1, 2.2, 3.3)]
. These lists are both very long, somewhere in the millions for both lists. For context, each of these data points is some measure of position in two different datasets. Now I want to correspond each entry in list1
to an entry in list2
if it is "close enough". By close enough I mean the distance between the positions is less than some threshold value (say .1 for example). My initial thought was using the min
function on each entry in list1
. That is, the following:
import numpy as np
import random
def dist(pt1, pt2):
return np.sqrt( ((pt2[0] - pt1[0]) ** 2) + ((pt2[1] - pt1[1]) ** 2) + ((pt2[2] - pt1[2]) ** 2) )
list1 = [(random.random(), random.random(), random.random()) for _ in range(25)]
list2 = [(random.random(), random.random(), random.random()) for _ in range(20)]
threshold = .5
linker = []
for i, entry in enumerate(list1):
m = min(list2, key=lambda x: dist(entry, x))
if dist(entry, m) < threshold:
linker.append((i, list2.index(m))
所以这会将 list1
中的每个索引链接到 list2
中的索引.但我觉得必须有一些已经专门针对此任务开发的算法,速度要快得多,是吗?
So this would link each index in list1
to and index in list2
. But I feel like there must be some already developed algorithm for this task specifically which is much faster, is there?
推荐答案
您正在寻找数据集中每个点与第二个数据集的最近邻.
You're finding the nearest neighbor of each point in a dataset to a second dataset.
- 您发布的方法的复杂度为 O(N^2)
- 由于 N ~ 100 万,这变得站不住脚.
对于大型数据集最近邻方法 更好,因为它们的复杂度为 O(N*log(N))
For large datasets nearest neighbor approaches are much better since they have complexity O(N*log(N))
Python 中两个流行的是 KDTree 和 BallTree
Two popular ones in Python are KDTree and BallTree
一个用 BallTree 解决这个问题的例子
An example of solving this with BallTree
import numpy as np
from sklearn.neighbors import BallTree
# Generate Dataset 1 (random positions in 3D)
rng = np.random.RandomState(0)
X = rng.random_sample((10, 3)) # 10 points in 3 dimensions
# Setup nearest neighbor tree for dataset 1
# to process nearest neighbor queries
tree = BallTree(X, leaf_size=2)
# Generate Dataset 2 (random positions in 3D)
Y = rng.random_sample((10, 3))
# For each point in Dataset 2
# find the index and distance to the closest
# point in Dataset 1 (using the nearest neighbor tree
# for dataset 1)
dist, ind = tree.query(Y, k=1) # nearest neighbor
# Results
for i, (ind, d) in enumerate(zip(ind, dist)):
print(f'Y index {i}, closest index X is {ind}, dist {d}')
输出
Y index 0, closest index X is [3], dist [0.14046915]
Y index 1, closest index X is [1], dist [0.40653272]
Y index 2, closest index X is [7], dist [0.29291477]
Y index 3, closest index X is [1], dist [0.25785655]
Y index 4, closest index X is [1], dist [0.39477652]
Y index 5, closest index X is [9], dist [0.50373484]
Y index 6, closest index X is [1], dist [0.24894356]
Y index 7, closest index X is [4], dist [0.14716665]
Y index 8, closest index X is [5], dist [0.25875381]
Y index 9, closest index X is [8], dist [0.24204497]
这篇关于在 python 列表中查找相似的条目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!