最快,以确定在一个二维阵列的最长连续值范围的方式 [英] Quickest way to determine the longest consecutive value range in a 2D-Array

查看:215
本文介绍了最快,以确定在一个二维阵列的最长连续值范围的方式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题

假设我们正在与大型数据集工作,为简单起见,我们用这个较小的一个在这个问题:

 数据= [[植物,4,11],
           [植物,4,12]
           [植物,34.4]
           [植物,6,5]
           [植物,54,45]
           [动物,5,76]
           [动物,7.33]
           [动物,11,1]

和我们要找出哪些列有连续值的最长范围,这将是最快的方式找出来,这是最好的列?

的幼稚的做法

我发现它很快可以通过对每个列进行排序

  sortedDatasets = []
对于i在范围(1,LEN(数据集[0]):
    sortedDatasets.append(排序(数据,键=拉姆达X:X [I]))

但这里来了laggy部分:我们可以从这里下去,做一个循环每个分类数据集,并计算连续的元素,但是当涉及加工 for循环蟒蛇是很慢的。

现在我的问题:有没有比这个天真的方法更快的方法,有可能连这些2D-容器的内置功能。<​​/ P>?


更新:

pcisely范围的含义可以通过这个伪算法描述

更多$ P $ - 这包括递增,如果当前值==下一个值

 如果nextValue&GT;当前值+1:
     {复位计数器}
其他:
     {增加计数器}


解决方案

您可以使用 GROUPBY 合理的效率做到这一点。我会做这种分阶段进行,所以你可以看到它是如何工作的。

 从进口和itertools GROUPBY数据= [
    [植物,4,11,
    [植物,4,12,
    [植物,34,4]
    [植物,6,5],
    [植物,54,45],
    [动物,5,76,
    [动物,7,33],
    [动物,11,1],
]#获取数字列和放大器;就地对它们进行排序
sorted_columns = [排序(COL)为ZIP COL(*数据集)[1:]
打印sorted_columns
打印#检查`元组由t`连续数
keyfunc =拉姆达T:T [1] == T [0] + 1连续号码的每一列中运行#搜索
在sorted_columns西:
    在此列中相邻的一对数字的元组#创建
    对ZIP =(COL,COL [1:])
    打印对
    对于k,G中GROUPBY(对,键= keyfunc):
        打印K,列表(G)
    打印

输出

  [[4,4,5,6,7,11,34,54],[1,4,5,11,12,33,45,76]][(4,4),(4,5),(5,6),(6,7),(7,11),(11,34),(34,54)]
假[(4,4)]
真[(4,5),(5,6),(6,7)]
假[(7,11),(11,34),(34,54)][(1,4),(4,5),(5,11),(11,12),(12,33),(33,45),(45,76)]
假[(1,4)]
真[(4,5)]
假[(5,11)]
真[(11,12)]
假[(12,33),(33,45),(45,76)]

现在,攻击你的实际的问题:

 从进口和itertools GROUPBY数据= [
    [植物,4,11,
    [植物,4,12,
    [植物,34,4]
    [植物,6,5],
    [植物,54,45],
    [动物,5,76,
    [动物,7,33],
    [动物,11,1],
]#获取数字列和放大器;就地对它们进行排序
sorted_columns = [排序(COL)为ZIP COL(*数据集)[1:]#检查`元组由t`连续数
keyfunc =拉姆达T:T [1] == T [0] + 1#搜寻连续号码的每一列中的最长的
奔跑= []
对于我,列在枚举(sorted_columns,1):
    对ZIP =(COL,COL [1:])
    M = MAX(LEN(名单(G))对于k,G中GROUPBY(对,键= keyfunc)如果k)
    runs.append((M,I))印数
#PRINT发现的最高运行长度,它是在中柱
打印MAX(运行)

输出

  [(3,1),(1,2)]
(3,1)


FWIW,这可以浓缩成一条线。这是一个小更有效,因为它使用了几个发电机前pressions,而不是名单COM prehensions的,但它不是特别可读的:

 打印MAX((MAX(LEN(名单(G))
    对于k,G中GROUPBY(ZIP(COL,COL [1:]),键=拉姆达T:T [1] == T [0] + 1)如果k),I)
        对于我,列在枚举((排序(COL)用于COL的ZIP(*数据集)[1:]),1))


更新

我们可以通过一些小的改动处理新的连续序列定义。

首先,我们需要一个返回键功能如果排序的列中相邻的两个数字之间的差额为&lt; = 1

 高清keyfunc(T):
    返回T [1] - T [0]&下; = 1

而不是采取哪些相匹配的关键作用,我们现在做一些简单的算术题,看看值的范围的大小序列中的序列的长度和。

 高清runlen(SEQ):
    返回1 + SEQ [-1] [1] - SEQ [0] [0]

全部放在一起:

 高清keyfunc(T):
    返回T [1] - T [0]&下; = 1高清runlen(SEQ):
    返回1 + SEQ [-1] [1] - SEQ [0] [0]#获取数字列和放大器;就地对它们进行排序
sorted_columns = [排序(COL)为ZIP COL(*数据集)[1:]#搜寻连续号码的每一列中的最长的
奔跑= []
对于我,列在枚举(sorted_columns,1):
    对ZIP =(COL,COL [1:])
    M = MAX(runlen(名单(G))对于k,G中GROUPBY(对,键= keyfunc)如果k)
    runs.append((M,I))印数
#PRINT发现的最高运行长度,它是在中柱
打印MAX(运行)


更新2

由于在评论中指出,最大引发 ValueError错误如果arg是空序列。一个简单的方法来处理是包装在 try..except 块中的最大电话。这是相当有效的,如果该异常情况很少见, try..except 实际上快于相当于 IF ... ELSE 当没有引发异常的逻辑。因此,我们的可能的做这样的事情:

 运行=(runlen(名单(G))对于k,G中GROUPBY(对,键= keyfunc)如果k)
尝试:
    M = MAX(运行)
除了ValueError错误:
    m = 0的
runs.append((M,I))

但如果发生异常,亦常最好还是使用其他方法。

下面是一个使用一个不折不扣的发电机功能的新版本, find_runs ,到位发电机前pression的。 find_runs 简单收益 SA零它开始处理列数据之前,所以最大总会有至少一个值来处理。我内联 runlen 计算,以节省额外的函数调用的开销。这个重构也使得它更容易建立的运行清单列表中的COM prehension。

 从进口和itertools GROUPBY数据= [
    [植物,4,11日,3]
    [植物,4,12,5],
    [植物,34,4,7],
    [植物,6,5,9],
    [植物,54,45,11]
    [动物,5,76,13]
    [动物,7,33,15]
    [动物,11,1,17]
]高清keyfunc(T):
    返回T [1] - T [0]&下; = 1高清find_runs(COL):
    对ZIP =(COL,COL [1:])
    #此窒息,如果我们没有发现任何运行停止`max`
    产生出0
    对于k,G中GROUPBY(对,键= keyfunc):
        如果k:
            #Determine运行长度
            SEQ =列表(G)
            收率1+ SEQ [-1] [1] - SEQ [0] [0]#获取数字列和放大器;就地对它们进行排序
sorted_columns = [排序(COL)为ZIP COL(*数据集)[1:]#搜寻连续号码的每一列中的最长的
运行= [(MAX(find_runs(COL)),I)对我,列在枚举(sorted_columns,1)]
印数#PRINT发现的最高运行长度,它是在中柱
打印MAX(运行)

输出

  [(4,1),(2,2),(0,3)]
(4,1)

The problem

let's assume we're working with a large dataset and for the sake of simplicity we use this smaller one in this question:

dataset = [["PLANT", 4,11],
           ["PLANT", 4,12],
           ["PLANT", 34,4],
           ["PLANT", 6,5],
           ["PLANT", 54,45],
           ["ANIMAL", 5,76],
           ["ANIMAL", 7,33],
           ["Animal", 11,1]]

and we want to find out which column has the longest range of consecutive values, what would be the fastest way to find out, which is the best column?

The naive approach

I found out it quickly can be sorted by each column with

sortedDatasets = []
for i in range(1,len(dataset[0]):
    sortedDatasets.append(sorted(dataset,key=lambda x: x[i]))

But here comes the laggy part: we could go on from here and do a for loop for each of the sorted datasets, and count the consecutive elements but when it comes to processing for loops python is very slow.

Now my the question: is there a faster way than this naive approach, is there maybe even an built-in function for these 2D-containers?


Update:

More precisely the meaning of a range can be described by this pseudo algorithm - this includes incrementing if current value == next value:

if nextValue > current Value +1: 
     {reset counter} 
else: 
     {increment counter}

解决方案

You can do this with reasonable efficiency using groupby. I'll do this in stages, so you can see how it works.

from itertools import groupby

dataset = [
    ["PLANT", 4, 11],
    ["PLANT", 4, 12],
    ["PLANT", 34, 4],
    ["PLANT", 6, 5],
    ["PLANT", 54, 45],
    ["ANIMAL", 5, 76],
    ["ANIMAL", 7, 33],
    ["ANIMAL", 11, 1],
]

# Get numeric columns & sort them in-place
sorted_columns = [sorted(col) for col in zip(*dataset)[1:]]
print sorted_columns
print

# Check if tuple `t` consists of consecutive numbers
keyfunc = lambda t: t[1] == t[0] + 1

# Search for runs of consecutive numbers in each column
for col in sorted_columns:
    #Create tuples of adjacent pairs of numbers in this column
    pairs = zip(col, col[1:])
    print pairs
    for k,g in groupby(pairs, key=keyfunc):
        print k, list(g)
    print

output

[[4, 4, 5, 6, 7, 11, 34, 54], [1, 4, 5, 11, 12, 33, 45, 76]]

[(4, 4), (4, 5), (5, 6), (6, 7), (7, 11), (11, 34), (34, 54)]
False [(4, 4)]
True [(4, 5), (5, 6), (6, 7)]
False [(7, 11), (11, 34), (34, 54)]

[(1, 4), (4, 5), (5, 11), (11, 12), (12, 33), (33, 45), (45, 76)]
False [(1, 4)]
True [(4, 5)]
False [(5, 11)]
True [(11, 12)]
False [(12, 33), (33, 45), (45, 76)]

Now, to attack your actual question:

from itertools import groupby

dataset = [
    ["PLANT", 4, 11],
    ["PLANT", 4, 12],
    ["PLANT", 34, 4],
    ["PLANT", 6, 5],
    ["PLANT", 54, 45],
    ["ANIMAL", 5, 76],
    ["ANIMAL", 7, 33],
    ["ANIMAL", 11, 1],
]

# Get numeric columns & sort them in-place
sorted_columns = [sorted(col) for col in zip(*dataset)[1:]]

# Check if tuple `t` consists of consecutive numbers
keyfunc = lambda t: t[1] == t[0] + 1

#Search for the longest run of consecutive numbers in each column
runs = []
for i, col in enumerate(sorted_columns, 1):
    pairs = zip(col, col[1:])
    m = max(len(list(g)) for k,g in groupby(pairs, key=keyfunc) if k)
    runs.append((m, i))

print runs
#Print the highest run length found and the column it was found in
print max(runs)

output

[(3, 1), (1, 2)]
(3, 1)


FWIW, this can be condensed into a single line. It's a little more efficient since it uses a couple of generator expressions instead of list comprehensions, but it's not particularly readable:

print max((max(len(list(g)) 
    for k,g in groupby(zip(col, col[1:]), key=lambda t: t[1] == t[0] + 1) if k), i)
        for i, col in enumerate((sorted(col) for col in zip(*dataset)[1:]), 1))


Update

We can handle your new consecutive sequence definition by making a few minor changes.

Firstly, we need a key function that returns True if the difference between an adjacent pair of numbers in a sorted column is <= 1.

def keyfunc(t):
    return t[1] - t[0] <= 1

And instead of taking the length of the sequences which match that key function we now do some simple arithmetic to see the size of the range of values in the sequence.

def runlen(seq):
    return 1 + seq[-1][1] - seq[0][0]

Putting it all together:

def keyfunc(t):
    return t[1] - t[0] <= 1

def runlen(seq):
    return 1 + seq[-1][1] - seq[0][0]

# Get numeric columns & sort them in-place
sorted_columns = [sorted(col) for col in zip(*dataset)[1:]]

#Search for the longest run of consecutive numbers in each column
runs = []
for i, col in enumerate(sorted_columns, 1):
    pairs = zip(col, col[1:])
    m = max(runlen(list(g)) for k,g in groupby(pairs, key=keyfunc) if k)
    runs.append((m, i))

print runs
#Print the highest run length found and the column it was found in
print max(runs)


Update 2

As noted in the comments, max raises ValueError if its arg is an empty sequence. A simple way to handle that is to wrap the max call in a try..except block. This is quite efficient if the exception happens rarely, try..except is actually faster than equivalent if...else logic when the exception isn't raised. So we could do something like this:

run = (runlen(list(g)) for k,g in groupby(pairs, key=keyfunc) if k)
try:
    m = max(run)
except ValueError:
    m = 0
runs.append((m, i))

But if that exception happens fairly frequently it's better to use another approach.

Here's a new version that uses a fully-fledged generator function, find_runs, in place of the generator expression. find_runs simply yields a zero before it starts processing the column data so max will always have at least one value to process. I've inlined the runlen calculation to save on the overhead of an additional function call. This refactoring also makes it easier to build the runs list in a list comprehension.

from itertools import groupby

dataset = [
    ["PLANT", 4, 11, 3],
    ["PLANT", 4, 12, 5],
    ["PLANT", 34, 4, 7],
    ["PLANT", 6, 5, 9],
    ["PLANT", 54, 45, 11],
    ["ANIMAL", 5, 76, 13],
    ["ANIMAL", 7, 33, 15],
    ["ANIMAL", 11, 1, 17],
]

def keyfunc(t):
    return t[1] - t[0] <= 1

def find_runs(col):
    pairs = zip(col, col[1:])
    #This stops `max` from choking if we don't find any runs
    yield 0
    for k, g in groupby(pairs, key=keyfunc):
        if k:
            #Determine run length
            seq = list(g)
            yield 1 + seq[-1][1] - seq[0][0]

# Get numeric columns & sort them in-place
sorted_columns = [sorted(col) for col in zip(*dataset)[1:]]

#Search for the longest run of consecutive numbers in each column
runs = [(max(find_runs(col)), i) for i, col in enumerate(sorted_columns, 1)]
print runs

#Print the highest run length found and the column it was found in
print max(runs)

output

[(4, 1), (2, 2), (0, 3)]
(4, 1)

这篇关于最快,以确定在一个二维阵列的最长连续值范围的方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆