在 python 列表中查找相似的条目 [英] Finding similar entries in python lists

查看:43
本文介绍了在 python 列表中查找相似的条目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 2 个元组列表 list1 = [(1.332, 3.23344, 3.22), (2.122, 2.11, 2.33), ... (1, 2, 3)]list2 = [(4.23, 12.2, 3.333), (1.234, 3.21, 4.342), ... (1.1, 2.2, 3.3)].这些列表都很长,两个列表都有数百万.对于上下文,这些数据点中的每一个都是在两个不同数据集中的某种位置度量.现在我想将 list1 中的每个条目对应到 list2 中的一个条目,如果它足够接近".足够接近是指位置之间的距离小于某个阈值(例如 0.1).我最初的想法是在 list1 中的每个条目上使用 min 函数.即,以下内容:

I have 2 lists of tuples list1 = [(1.332, 3.23344, 3.22), (2.122, 2.11, 2.33), ... (1, 2, 3)] and list2 = [(4.23, 12.2, 3.333), (1.234, 3.21, 4.342), ... (1.1, 2.2, 3.3)]. These lists are both very long, somewhere in the millions for both lists. For context, each of these data points is some measure of position in two different datasets. Now I want to correspond each entry in list1 to an entry in list2 if it is "close enough". By close enough I mean the distance between the positions is less than some threshold value (say .1 for example). My initial thought was using the min function on each entry in list1. That is, the following:

import numpy as np
import random

def dist(pt1, pt2): 
    return np.sqrt( ((pt2[0] - pt1[0]) ** 2) + ((pt2[1] - pt1[1]) ** 2) + ((pt2[2] - pt1[2]) ** 2) ) 

list1 = [(random.random(), random.random(), random.random()) for _ in range(25)]                                                                                              

list2 = [(random.random(), random.random(), random.random()) for _ in range(20)]   

threshold = .5
linker = []
for i, entry in enumerate(list1): 
    m = min(list2, key=lambda x: dist(entry, x)) 
    if dist(entry, m) < threshold: 
         linker.append((i, list2.index(m))

所以这会将 list1 中的每个索引链接到 list2 中的索引.但我觉得必须有一些已经专门针对此任务开发的算法,速度要快得多,是吗?

So this would link each index in list1 to and index in list2. But I feel like there must be some already developed algorithm for this task specifically which is much faster, is there?

推荐答案

您正在寻找数据集中每个点与第二个数据集的最近邻.

You're finding the nearest neighbor of each point in a dataset to a second dataset.

  1. 您发布的方法的复杂度为 O(N^2)
  2. 由于 N ~ 100 万,这变得站不住脚.

对于大型数据集最近邻方法 更好,因为它们的复杂度为 O(N*log(N))

For large datasets nearest neighbor approaches are much better since they have complexity O(N*log(N))

Python 中两个流行的是 KDTree 和 BallTree

Two popular ones in Python are KDTree and BallTree

一个用 BallTree 解决这个问题的例子

An example of solving this with BallTree

sklearn BallTree 文档

import numpy as np
from sklearn.neighbors import BallTree

# Generate Dataset 1 (random positions in 3D)
rng = np.random.RandomState(0)
X = rng.random_sample((10, 3))  # 10 points in 3 dimensions

# Setup nearest neighbor tree  for dataset 1
# to process nearest neighbor queries
tree = BallTree(X, leaf_size=2)

# Generate Dataset 2 (random positions in 3D)
Y = rng.random_sample((10, 3))

# For each point in Dataset 2
# find the index and distance to the closest 
# point in Dataset 1 (using the nearest neighbor tree
# for dataset 1)
dist, ind = tree.query(Y, k=1)  # nearest neighbor  

# Results
for i, (ind, d) in enumerate(zip(ind, dist)):
  print(f'Y index {i}, closest index X is {ind}, dist {d}')

输出

Y index 0, closest index X is [3], dist [0.14046915]
Y index 1, closest index X is [1], dist [0.40653272]
Y index 2, closest index X is [7], dist [0.29291477]
Y index 3, closest index X is [1], dist [0.25785655]
Y index 4, closest index X is [1], dist [0.39477652]
Y index 5, closest index X is [9], dist [0.50373484]
Y index 6, closest index X is [1], dist [0.24894356]
Y index 7, closest index X is [4], dist [0.14716665]
Y index 8, closest index X is [5], dist [0.25875381]
Y index 9, closest index X is [8], dist [0.24204497]

这篇关于在 python 列表中查找相似的条目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆