将字符串从一个numpy数组匹配到另一个 [英] Match strings from one numpy array to another

查看:115
本文介绍了将字符串从一个numpy数组匹配到另一个的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在与python 3一起工作,并且我已经面对了一段时间了,我似乎无法弄清楚.

Hi I am working with python 3 and I've been facing this issue for a while now and I can't seem to figure this out.

我有2个包含strings

array_one = np.array(['alice', 'in', 'a', 'wonder', 'land', 'alice in', 'in a', 'a wonder', 'wonder land', 'alice in a', 'in a wonder', 'a wonder land', 'alice in a wonder', 'in a wonder land', 'alice in a wonder land'])

如果您注意到,array_one实际上是一个包含1-gram, 2-gram, 3-gram, 4-gram, 5-gram句子alice in a wonder land的数组.

If you notice, the array_one is actually an array containing 1-gram, 2-gram, 3-gram, 4-gram, 5-gram for the sentence alice in a wonder land.

我故意把wonderland当作两个词wonderland.

I purposefully have taken wonderland as two words wonder and land.

现在我有另一个numpy array,其中包含一些位置和名称.

Now I have another numpy array that contains some locations and names.

array_two = np.array(['new york', 'las vegas', 'wonderland', 'florida'])

现在我要做的是获取array_two中存在的array_one中的所有元素.

Now what I want to do is get all the elements in the array_one that exist in array_two.

如果我使用两个数组中的np.intersect1d提取交集,则不会得到任何匹配项,因为wonderlandarray_one中的两个独立单词,而在array_two中则是单个单词.

If I take out an intersection using np.intersect1d of the two arrays I don't get any matches since wonderland is two separate words in array_one while in array_two it's a single word.

有没有办法做到这一点?我已经尝试过使用堆栈的解决方案(),但是它们似乎不适用于python 3

Is there any way to do this? I've tried solutions from stack (this) but they don't seem to work with python 3

array_one最多具有60-100个项目,而array_two最多最多具有100万个项目,但平均为250,000-500,000个项目.

array_one would at max have 60-100 items while array_two would at max have roughly 1 million items but an average of 250,000 - 500,000 items.


编辑

由于我现在无法找到解决方案,因此我使用了一种非常幼稚的方法,我从两个arrays中都替换了white space,然后使用了生成的boolean数组([True,False,True ])过滤原始数组.下面是代码:


Edit

I've used a very naive approach since I wasn't able to find a solution uptill now, I replaced white space from both arrays and then using the resultant boolean array ([True, False, True]) to `filter on the origional array. Below is the code:

import numpy.core.defchararray as np_f
import numpy as np


array_two_wr = np_f.replace(array_two, ' ', '')
array_one_wr = np_f.replace(array_one, ' ', '')
intersections = array_two[np.in1d(array_two_wr, array_one_wr)]

但是我不确定这是考虑array_two

推荐答案

很抱歉提出两个答案,但是在添加了上面的局部敏感哈希技术之后,我意识到您可以利用数据中的类分离(查询向量和潜在匹配向量),使用布隆过滤器.

Sorry to post two answers, but after adding the locality-sensitive-hashing technique above, I realized you could exploit the class separation in your data (query vectors and potential matching vectors) by using a bloom filter.

Bloom过滤器是一个漂亮的对象,可以让您传入一些对象,然后查询以查看是否已将给定对象添加到Bloom过滤器中.这是一个 Bloom过滤器的绝佳视觉演示.

A bloom filter is a beautiful object that lets you pass in some objects, then query to see whether a given object has been added to the bloom filter. Here's an awesome visual demo of a bloom filter.

在您的情况下,我们可以将array_two的每个成员添加到Bloom过滤器中,然后查询array_one的每个成员以查看它是否在Bloom过滤器中.使用pip install bloom-filter:

In your case we can add each member of array_two to the bloom filter, then query each member of array_one to see whether it's in the bloom filter. Using pip install bloom-filter:

from bloom_filter import BloomFilter # pip instal bloom-filter
import numpy as np
import re

def clean(s):
  '''Clean a string'''
  return re.sub(r'\s+', '', s)

array_one = np.array(['alice', 'in', 'a', 'wonder', 'land', 'alice in', 'in a', 'a wonder', 'wonder land', 'alice in a', 'in a wonder', 'a wonder land', 'alice in a wonder', 'in a wonder land', 'alice in a wonder land'])
array_two = np.array(['new york', 'las vegas', 'wonderland', 'florida'])

# initialize bloom filter with particular size
bloom = BloomFilter(max_elements=10000, error_rate=0.1)
# add each member of array_two to bloom filter
[bloom.add(clean(i)) for i in array_two]
# find the members in array_one in array_two
matches = [i for i in array_one if clean(i) in bloom]
print(matches)

结果:['wonder land']

根据您的要求,这可能是一个非常有效(且高度可扩展)的解决方案.

Depending on your requirements, this could be a very efficient (and highly-scalable) solution.

这篇关于将字符串从一个numpy数组匹配到另一个的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆