最快,以确定在一个二维阵列的最长连续值范围的方式 [英] Quickest way to determine the longest consecutive value range in a 2D-Array
问题描述
问题
假设我们正在与大型数据集工作,为简单起见,我们用这个较小的一个在这个问题:
数据= [[植物,4,11],
[植物,4,12]
[植物,34.4]
[植物,6,5]
[植物,54,45]
[动物,5,76]
[动物,7.33]
[动物,11,1]
和我们要找出哪些列有连续值的最长范围,这将是最快的方式找出来,这是最好的列?
的幼稚的做法
我发现它很快可以通过对每个列进行排序
sortedDatasets = []
对于i在范围(1,LEN(数据集[0]):
sortedDatasets.append(排序(数据,键=拉姆达X:X [I]))
但这里来了laggy部分:我们可以从这里下去,做一个循环
每个分类数据集,并计算连续的元素,但是当涉及加工 for循环
蟒蛇是很慢的。
现在我的问题:有没有比这个天真的方法更快的方法,有可能连这些2D-容器的内置功能。</ P>?
更新:
pcisely范围的含义可以通过这个伪算法描述更多$ P $ - 这包括递增,如果当前值==下一个值
:
如果nextValue&GT;当前值+1:
{复位计数器}
其他:
{增加计数器}
您可以使用 GROUPBY
合理的效率做到这一点。我会做这种分阶段进行,所以你可以看到它是如何工作的。
从进口和itertools GROUPBY数据= [
[植物,4,11,
[植物,4,12,
[植物,34,4]
[植物,6,5],
[植物,54,45],
[动物,5,76,
[动物,7,33],
[动物,11,1],
]#获取数字列和放大器;就地对它们进行排序
sorted_columns = [排序(COL)为ZIP COL(*数据集)[1:]
打印sorted_columns
打印#检查`元组由t`连续数
keyfunc =拉姆达T:T [1] == T [0] + 1连续号码的每一列中运行#搜索
在sorted_columns西:
在此列中相邻的一对数字的元组#创建
对ZIP =(COL,COL [1:])
打印对
对于k,G中GROUPBY(对,键= keyfunc):
打印K,列表(G)
打印
输出
[[4,4,5,6,7,11,34,54],[1,4,5,11,12,33,45,76]][(4,4),(4,5),(5,6),(6,7),(7,11),(11,34),(34,54)]
假[(4,4)]
真[(4,5),(5,6),(6,7)]
假[(7,11),(11,34),(34,54)][(1,4),(4,5),(5,11),(11,12),(12,33),(33,45),(45,76)]
假[(1,4)]
真[(4,5)]
假[(5,11)]
真[(11,12)]
假[(12,33),(33,45),(45,76)]
现在,攻击你的实际的问题:
从进口和itertools GROUPBY数据= [
[植物,4,11,
[植物,4,12,
[植物,34,4]
[植物,6,5],
[植物,54,45],
[动物,5,76,
[动物,7,33],
[动物,11,1],
]#获取数字列和放大器;就地对它们进行排序
sorted_columns = [排序(COL)为ZIP COL(*数据集)[1:]#检查`元组由t`连续数
keyfunc =拉姆达T:T [1] == T [0] + 1#搜寻连续号码的每一列中的最长的
奔跑= []
对于我,列在枚举(sorted_columns,1):
对ZIP =(COL,COL [1:])
M = MAX(LEN(名单(G))对于k,G中GROUPBY(对,键= keyfunc)如果k)
runs.append((M,I))印数
#PRINT发现的最高运行长度,它是在中柱
打印MAX(运行)
输出
[(3,1),(1,2)]
(3,1)
FWIW,这可以浓缩成一条线。这是一个小更有效,因为它使用了几个发电机前pressions,而不是名单COM prehensions的,但它不是特别可读的:
打印MAX((MAX(LEN(名单(G))
对于k,G中GROUPBY(ZIP(COL,COL [1:]),键=拉姆达T:T [1] == T [0] + 1)如果k),I)
对于我,列在枚举((排序(COL)用于COL的ZIP(*数据集)[1:]),1))
更新
我们可以通过一些小的改动处理新的连续序列定义。
首先,我们需要一个返回键功能真
如果排序的列中相邻的两个数字之间的差额为&lt; = 1
高清keyfunc(T):
返回T [1] - T [0]&下; = 1
而不是采取哪些相匹配的关键作用,我们现在做一些简单的算术题,看看值的范围的大小序列中的序列的长度和。
高清runlen(SEQ):
返回1 + SEQ [-1] [1] - SEQ [0] [0]
全部放在一起:
高清keyfunc(T):
返回T [1] - T [0]&下; = 1高清runlen(SEQ):
返回1 + SEQ [-1] [1] - SEQ [0] [0]#获取数字列和放大器;就地对它们进行排序
sorted_columns = [排序(COL)为ZIP COL(*数据集)[1:]#搜寻连续号码的每一列中的最长的
奔跑= []
对于我,列在枚举(sorted_columns,1):
对ZIP =(COL,COL [1:])
M = MAX(runlen(名单(G))对于k,G中GROUPBY(对,键= keyfunc)如果k)
runs.append((M,I))印数
#PRINT发现的最高运行长度,它是在中柱
打印MAX(运行)
更新2
由于在评论中指出,最大
引发 ValueError错误
如果arg是空序列。一个简单的方法来处理是包装在 try..except
块中的最大
电话。这是相当有效的,如果该异常情况很少见, try..except
实际上快于相当于 IF ... ELSE
当没有引发异常的逻辑。因此,我们的可能的做这样的事情:
运行=(runlen(名单(G))对于k,G中GROUPBY(对,键= keyfunc)如果k)
尝试:
M = MAX(运行)
除了ValueError错误:
m = 0的
runs.append((M,I))
但如果发生异常,亦常最好还是使用其他方法。
下面是一个使用一个不折不扣的发电机功能的新版本, find_runs
,到位发电机前pression的。 find_runs
简单收益
SA零它开始处理列数据之前,所以最大
总会有至少一个值来处理。我内联 runlen
计算,以节省额外的函数调用的开销。这个重构也使得它更容易建立的运行
清单列表中的COM prehension。
从进口和itertools GROUPBY数据= [
[植物,4,11日,3]
[植物,4,12,5],
[植物,34,4,7],
[植物,6,5,9],
[植物,54,45,11]
[动物,5,76,13]
[动物,7,33,15]
[动物,11,1,17]
]高清keyfunc(T):
返回T [1] - T [0]&下; = 1高清find_runs(COL):
对ZIP =(COL,COL [1:])
#此窒息,如果我们没有发现任何运行停止`max`
产生出0
对于k,G中GROUPBY(对,键= keyfunc):
如果k:
#Determine运行长度
SEQ =列表(G)
收率1+ SEQ [-1] [1] - SEQ [0] [0]#获取数字列和放大器;就地对它们进行排序
sorted_columns = [排序(COL)为ZIP COL(*数据集)[1:]#搜寻连续号码的每一列中的最长的
运行= [(MAX(find_runs(COL)),I)对我,列在枚举(sorted_columns,1)]
印数#PRINT发现的最高运行长度,它是在中柱
打印MAX(运行)
输出
[(4,1),(2,2),(0,3)]
(4,1)
The problem
let's assume we're working with a large dataset and for the sake of simplicity we use this smaller one in this question:
dataset = [["PLANT", 4,11],
["PLANT", 4,12],
["PLANT", 34,4],
["PLANT", 6,5],
["PLANT", 54,45],
["ANIMAL", 5,76],
["ANIMAL", 7,33],
["Animal", 11,1]]
and we want to find out which column has the longest range of consecutive values, what would be the fastest way to find out, which is the best column?
The naive approach
I found out it quickly can be sorted by each column with
sortedDatasets = []
for i in range(1,len(dataset[0]):
sortedDatasets.append(sorted(dataset,key=lambda x: x[i]))
But here comes the laggy part: we could go on from here and do a for loop
for each of the sorted datasets, and count the consecutive elements but when it comes to processing for loops
python is very slow.
Now my the question: is there a faster way than this naive approach, is there maybe even an built-in function for these 2D-containers?
Update:
More precisely the meaning of a range can be described by this pseudo algorithm - this includes incrementing if current value == next value
:
if nextValue > current Value +1:
{reset counter}
else:
{increment counter}
You can do this with reasonable efficiency using groupby
. I'll do this in stages, so you can see how it works.
from itertools import groupby
dataset = [
["PLANT", 4, 11],
["PLANT", 4, 12],
["PLANT", 34, 4],
["PLANT", 6, 5],
["PLANT", 54, 45],
["ANIMAL", 5, 76],
["ANIMAL", 7, 33],
["ANIMAL", 11, 1],
]
# Get numeric columns & sort them in-place
sorted_columns = [sorted(col) for col in zip(*dataset)[1:]]
print sorted_columns
print
# Check if tuple `t` consists of consecutive numbers
keyfunc = lambda t: t[1] == t[0] + 1
# Search for runs of consecutive numbers in each column
for col in sorted_columns:
#Create tuples of adjacent pairs of numbers in this column
pairs = zip(col, col[1:])
print pairs
for k,g in groupby(pairs, key=keyfunc):
print k, list(g)
print
output
[[4, 4, 5, 6, 7, 11, 34, 54], [1, 4, 5, 11, 12, 33, 45, 76]]
[(4, 4), (4, 5), (5, 6), (6, 7), (7, 11), (11, 34), (34, 54)]
False [(4, 4)]
True [(4, 5), (5, 6), (6, 7)]
False [(7, 11), (11, 34), (34, 54)]
[(1, 4), (4, 5), (5, 11), (11, 12), (12, 33), (33, 45), (45, 76)]
False [(1, 4)]
True [(4, 5)]
False [(5, 11)]
True [(11, 12)]
False [(12, 33), (33, 45), (45, 76)]
Now, to attack your actual question:
from itertools import groupby
dataset = [
["PLANT", 4, 11],
["PLANT", 4, 12],
["PLANT", 34, 4],
["PLANT", 6, 5],
["PLANT", 54, 45],
["ANIMAL", 5, 76],
["ANIMAL", 7, 33],
["ANIMAL", 11, 1],
]
# Get numeric columns & sort them in-place
sorted_columns = [sorted(col) for col in zip(*dataset)[1:]]
# Check if tuple `t` consists of consecutive numbers
keyfunc = lambda t: t[1] == t[0] + 1
#Search for the longest run of consecutive numbers in each column
runs = []
for i, col in enumerate(sorted_columns, 1):
pairs = zip(col, col[1:])
m = max(len(list(g)) for k,g in groupby(pairs, key=keyfunc) if k)
runs.append((m, i))
print runs
#Print the highest run length found and the column it was found in
print max(runs)
output
[(3, 1), (1, 2)]
(3, 1)
FWIW, this can be condensed into a single line. It's a little more efficient since it uses a couple of generator expressions instead of list comprehensions, but it's not particularly readable:
print max((max(len(list(g))
for k,g in groupby(zip(col, col[1:]), key=lambda t: t[1] == t[0] + 1) if k), i)
for i, col in enumerate((sorted(col) for col in zip(*dataset)[1:]), 1))
Update
We can handle your new consecutive sequence definition by making a few minor changes.
Firstly, we need a key function that returns True
if the difference between an adjacent pair of numbers in a sorted column is <= 1.
def keyfunc(t):
return t[1] - t[0] <= 1
And instead of taking the length of the sequences which match that key function we now do some simple arithmetic to see the size of the range of values in the sequence.
def runlen(seq):
return 1 + seq[-1][1] - seq[0][0]
Putting it all together:
def keyfunc(t):
return t[1] - t[0] <= 1
def runlen(seq):
return 1 + seq[-1][1] - seq[0][0]
# Get numeric columns & sort them in-place
sorted_columns = [sorted(col) for col in zip(*dataset)[1:]]
#Search for the longest run of consecutive numbers in each column
runs = []
for i, col in enumerate(sorted_columns, 1):
pairs = zip(col, col[1:])
m = max(runlen(list(g)) for k,g in groupby(pairs, key=keyfunc) if k)
runs.append((m, i))
print runs
#Print the highest run length found and the column it was found in
print max(runs)
Update 2
As noted in the comments, max
raises ValueError
if its arg is an empty sequence. A simple way to handle that is to wrap the max
call in a try..except
block. This is quite efficient if the exception happens rarely, try..except
is actually faster than equivalent if...else
logic when the exception isn't raised. So we could do something like this:
run = (runlen(list(g)) for k,g in groupby(pairs, key=keyfunc) if k)
try:
m = max(run)
except ValueError:
m = 0
runs.append((m, i))
But if that exception happens fairly frequently it's better to use another approach.
Here's a new version that uses a fully-fledged generator function, find_runs
, in place of the generator expression. find_runs
simply yield
s a zero before it starts processing the column data so max
will always have at least one value to process. I've inlined the runlen
calculation to save on the overhead of an additional function call. This refactoring also makes it easier to build the runs
list in a list comprehension.
from itertools import groupby
dataset = [
["PLANT", 4, 11, 3],
["PLANT", 4, 12, 5],
["PLANT", 34, 4, 7],
["PLANT", 6, 5, 9],
["PLANT", 54, 45, 11],
["ANIMAL", 5, 76, 13],
["ANIMAL", 7, 33, 15],
["ANIMAL", 11, 1, 17],
]
def keyfunc(t):
return t[1] - t[0] <= 1
def find_runs(col):
pairs = zip(col, col[1:])
#This stops `max` from choking if we don't find any runs
yield 0
for k, g in groupby(pairs, key=keyfunc):
if k:
#Determine run length
seq = list(g)
yield 1 + seq[-1][1] - seq[0][0]
# Get numeric columns & sort them in-place
sorted_columns = [sorted(col) for col in zip(*dataset)[1:]]
#Search for the longest run of consecutive numbers in each column
runs = [(max(find_runs(col)), i) for i, col in enumerate(sorted_columns, 1)]
print runs
#Print the highest run length found and the column it was found in
print max(runs)
output
[(4, 1), (2, 2), (0, 3)]
(4, 1)
这篇关于最快,以确定在一个二维阵列的最长连续值范围的方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!