如何从两个列表中删除与单独列表的重复值相对应的非最大值的索引? [英] How can I remove indices of non-max values that correspond to duplicate values of separate list from both lists?

查看:108
本文介绍了如何从两个列表中删除与单独列表的重复值相对应的非最大值的索引?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个列表,第一个代表观察时间,第二个代表那些时间的观察值.我正在尝试在各种长度的滚动窗口中找到最大观测值和相应的时间.以清酒为例,这是两个列表.

# observed values
linspeed = [280.0, 275.0, 300.0, 475.2, 360.1, 400.9, 215.3, 323.8, 289.7]

# times that correspond to observed values
time_count = [4.0, 6.0, 8.0, 8.0, 10.0, 10.0, 10.0, 14.0, 16.0]

# actual dataset is of size ~ 11,000

缺失时间(例如:3.0)对应于零观察值,而重复时间对应于下限时间的多次观察.由于我的窗口将在time_count上滚动(例如:前2小时,下2小时,此后2小时的最大值;前4小时,下4小时的最大值,...),我打算使用数组重塑例程.但是,重要的是要事先正确设置所有内容,这需要在给定重复次数的情况下找到最大值.为了解决这个问题,我尝试了下面的代码.

def list_duplicates(data_list):
    seen = set()
    seen_add = seen.add
    seen_twice = set(x for x in data_list if x in seen or seen_add(x))
    return list(seen_twice)

# check for duplicate values
dups = list_duplicates(time_count)
print(dups)
>> [8.0, 10.0]

# get index of duplicates
for dup in dups:
    print(time_count.index(dup))
>> 2
>> 4

在检查重复项的索引时,似乎此代码将仅返回第一次出现重复值的索引.由于涉及代码效率/速度的原因,我也尝试通过模块collections使用OrderedDict,但是字典也有类似的问题.给定用于非重复观察值的重复键,将保留重复键的第一个实例和相应的观察值,同时将所有其他键从dict中删除.根据此所以发布,我的第二次尝试就在下面.

for dup in dups:
    indexes = [i for i,x in enumerate(time_count) if x == dup]
print(indexes)
>> [4, 5, 6] # indices correspond to duplicate time 10s but not duplicate time 8s

对于time in time_count = 8.0我应该得到[2,3],对于time in time_count = 10.0我应该得到[4,5,6].从重复的time_counts中,475.2是与重复的time_count 8.0相对应的max linspeed,而400.9是与重复的time_count 10.0相对应的max linspeed,这意味着重复的time_counts剩余索引处的其他linspeed将是删除.

我不确定我还能尝试什么.我如何才能适应这种情况(或找到一种新方法)来以有效的方式找到与重复值相对应的所有索引?任何意见,将不胜感激. (PS-我将numpy标记为标签,因为我认为还有一种方法可以通过numpy来实现,但我还没有想到.)

解决方案

无需详细介绍如何实现和高效滚动窗口最大化过滤器;减少重复值可以看作是一个分组问题, numpy_indexed 程序包(免责声明:我是其作者)提供了有效且简单的解决方案,以:

import numpy_indexed as npi
unique_time, unique_speed = npi.group_by(time_count).max(linspeed)

对于大型输入数据集(即重要的地方),这应该比任何非矢量化解决方案都要快得多.内存消耗是线性的,并且性能通常为NlogN.但是由于time_count似乎已经进行了排序,因此性能也应该是线性的.

I have two lists, the first of which represents times of observation and the second of which represents the observed values at those times. I am trying to find the maximum observed value and the corresponding time given a rolling window of various length. For example-sake, here are the two lists.

# observed values
linspeed = [280.0, 275.0, 300.0, 475.2, 360.1, 400.9, 215.3, 323.8, 289.7]

# times that correspond to observed values
time_count = [4.0, 6.0, 8.0, 8.0, 10.0, 10.0, 10.0, 14.0, 16.0]

# actual dataset is of size ~ 11,000

The missing times (ex: 3.0) correspond to an observed value of zero, whereas duplicate times correspond to multiple observations to the floored time. Since my window will be rolling over the time_count (ex: max value in first 2 hours, next 2 hours, 2 hours after that; max value in first 4 hours, next 4 hours, ...), I plan to use an array-reshaping routine. However, it's important to set up everything properly before, which entails finding the maximum value given duplicate times. To solve this problem, I tried the code just below.

def list_duplicates(data_list):
    seen = set()
    seen_add = seen.add
    seen_twice = set(x for x in data_list if x in seen or seen_add(x))
    return list(seen_twice)

# check for duplicate values
dups = list_duplicates(time_count)
print(dups)
>> [8.0, 10.0]

# get index of duplicates
for dup in dups:
    print(time_count.index(dup))
>> 2
>> 4

When checking for the index of the duplicates, it appears that this code will only return the index of the first occurrence of the duplicate value. I also tried using OrderedDict via module collections for reasons concerning code efficiency/speed, but dictionaries have a similar problem. Given duplicate keys for non-duplicate observation values, the first instance of the duplicate key and corresponding observation value is kept while all others are dropped from the dict. Per this SO post, my second attempt is just below.

for dup in dups:
    indexes = [i for i,x in enumerate(time_count) if x == dup]
print(indexes)
>> [4, 5, 6] # indices correspond to duplicate time 10s but not duplicate time 8s

I should be getting [2,3] for time in time_count = 8.0 and [4,5,6] for time in time_count = 10.0. From the duplicate time_counts, 475.2 is the max linspeed that corresponds to duplicate time_count 8.0 and 400.9 is the max linspeed that corresponds to duplicate time_count 10.0, meaning that the other linspeeds at leftover indices of duplicate time_counts would be removed.

I'm not sure what else I can try. How can I adapt this (or find a new approach) to find all of the indices that correspond to duplicate values in an efficient manner? Any advice would be appreciated. (PS - I made numpy a tag because I think there is a way to do this via numpy that I haven't figured out yet.)

解决方案

Without going into the details of how to implement and efficient rolling-window-maximum filter; reducing the duplicate values can be seen as a grouping-problem, which the numpy_indexed package (disclaimer: I am its author) provides efficient and simple solutions to:

import numpy_indexed as npi
unique_time, unique_speed = npi.group_by(time_count).max(linspeed)

For large input datasets (ie, where it matters), this should be a lot faster than any non-vectorized solution. Memory consumption is linear and performance in general NlogN; but since time_count appears to be sorted already, performance should be linear too.

这篇关于如何从两个列表中删除与单独列表的重复值相对应的非最大值的索引?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆