Python比较标记化列表 [英] Python Compare Tokenized Lists
问题描述
我需要针对该问题的最快解决方案,因为它将应用于巨大的数据集:
I need the fastest-possible solution to this problem as it will be applied to a huge data set:
给出此主列表:
m=['abc','bcd','cde','def']
...以及列表的此参考列表:
...and this reference list of lists:
r=[['abc','def'],['bcd','cde'],['abc','def','bcd']]
我想将r中的每个列表与主列表(m)进行比较,并生成一个新的列表.此新对象的匹配度为1(基于m的顺序),0为不匹配项.因此,新对象(列表列表)将始终具有与m相同长度的列表. 这是我根据上面的m和r期望的结果:
I'd like to compare each list within r to the master list (m) and generate a new list of lists. This new object will have a 1 for matches based on the order in m and 0 for non-matches. So the new object (list of lists) will always have the lists of the same length as m. Here's what I would expect based on m and r above:
[[1,0,0,1],[0,1,1,0],[1,1,0,1]]
因为r的第一个元素是['abc','def']
并且具有匹配项
加上第m个元素的第1个和第4个元素,则结果为[1,0,0,1]
.
Because the first element of r is ['abc','def']
and has a match
with the 1st and 4th elements of m, the result is then [1,0,0,1]
.
到目前为止,这是我的方法(可能太慢了,缺少零):
Here's my approach so far (probably way too slow and is missing zeros):
output=[]
for i in r:
output.append([1 for x in m if x in i])
导致:
[[1, 1], [1, 1], [1, 1, 1]]
提前谢谢!
推荐答案
您可以使用嵌套列表推导,如下所示:
You can use a nested list comprehension like this:
>>> m = ['abc','bcd','cde','def']
>>> r = [['abc','def'],['bcd','cde'],['abc','def','bcd']]
>>> [[1 if mx in rx else 0 for mx in m] for rx in r]
[[1, 0, 0, 1], [0, 1, 1, 0], [1, 1, 0, 1]]
此外,您可以使用int(...)
缩短1 if ... else 0
,并且可以将r
的子列表转换为set
,这样单个mx in rx
的查找会更快.
Also, you could shorten the 1 if ... else 0
using int(...)
, and you can convert the sublists of r
to set
, so that the individual mx in rx
lookups are faster.
>>> [[int(mx in rx) for mx in m] for rx in r]
[[1, 0, 0, 1], [0, 1, 1, 0], [1, 1, 0, 1]]
>>> [[int(mx in rx) for mx in m] for rx in map(set, r)]
[[1, 0, 0, 1], [0, 1, 1, 0], [1, 1, 0, 1]]
虽然int(...)
比1 if ... else 0
短,但它似乎也较慢,因此您可能不应该使用它.在重复查找之前将r
的子列表转换为set
应该可以加快较长列表的速度,但是对于您的示例列表很短,实际上这比幼稚的方法要慢.
While int(...)
is a bit shorter than 1 if ... else 0
, it also seems to be slower, so you probably should not use that. Converting the sublists of r
to set
prior to the repeated lookup should speed things up for longer lists, but for you very short example lists, it's in fact slower than the naive approach.
>>> %timeit [[1 if mx in rx else 0 for mx in m] for rx in r]
100000 loops, best of 3: 4.74 µs per loop
>>> %timeit [[int(mx in rx) for mx in m] for rx in r]
100000 loops, best of 3: 8.07 µs per loop
>>> %timeit [[1 if mx in rx else 0 for mx in m] for rx in map(set, r)]
100000 loops, best of 3: 5.82 µs per loop
对于更长的列表,如预期的那样,使用set
会变得更快:
For longer lists, using set
becomes faster, as would be expected:
>>> m = [random.randint(1, 100) for _ in range(50)]
>>> r = [[random.randint(1,100) for _ in range(10)] for _ in range(20)]
>>> %timeit [[1 if mx in rx else 0 for mx in m] for rx in r]
1000 loops, best of 3: 412 µs per loop
>>> %timeit [[1 if mx in rx else 0 for mx in m] for rx in map(set, r)]
10000 loops, best of 3: 208 µs per loop
这篇关于Python比较标记化列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!