如何根据匹配的子字符串从一个列表创建多个列表? [英] How to create multiple lists from one list according to matching substrings?

查看:61
本文介绍了如何根据匹配的子字符串从一个列表创建多个列表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在python中有一个由各种文件名组成的字符串列表,像这样(但更长):

I have a list of strings in python consisting of various filenames, like this (but much longer):

all_templates = ['fitting_file_expdisk_cutout-IMG-HSC-I-18115-6,3-OBJ-NEP175857.9+655841.2.feedme', 'fitting_file_sersic_cutout-IMG-HSC-I-18115-3,3-OBJ-NEP180508.6+655617.3.feedme', 'fitting_file_sersic_cutout-IMG-HSC-I-18115-1,8-OBJ-NEP180840.8+665226.2.feedme', 'fitting_file_sersic_cutout-IMG-HSC-I-18115-6,7-OBJ-NEP175927.6+664230.2.feedme', 'fitting_file_expdisk_cutout-IMG-HSC-I-18114-0,5-OBJ-zsel56238.feedme', 'fitting_file_devauc_cutout-IMG-HSC-I-18114-0,3-OBJ-NEP175616.1+660601.5.feedme', 'fitting_file_sersic_cutout-IMG-HSC-I-18115-6,4-OBJ-zsel56238.feedme']

我想为具有相同对象名称(从OBJ-开始并在.feedme之前结束的子字符串)的元素创建多个较小的列表.所以我会有一个这样的列表:

I'd like to create multiple smaller lists for elements that have the same object name (the substring starting with OBJ- and ending right before .feedme). So I'd have a list like this:

obj1 = ['fitting_file_expdisk_cutout-IMG-HSC-I-18114-0,5-OBJ-zsel56238.feedme', 'fitting_file_sersic_cutout-IMG-HSC-I-18115-6,4-OBJ-zsel56238.feedme']

,对于其他匹配的对象",依此类推.实际上,我有900多个唯一的对象",原始列表all_templates具有4000多个元素,因为每个对象都有3个或更多单独的模板文件(它们以随机顺序显示).因此,最终我将要拥有900多个列表(每个对象一个).我该怎么办?

and so on for other matching 'objects'. In reality I have over 900 unique 'objects', and the original list all_templates has over 4000 elements because each object has 3 or more separate template files (which are all appearing in a random order to start). So in the end I'll want to have over 900 lists (one per object). How can I do this?

这是我尝试过的方法,但它为我提供了每个子列表内的所有原始模板文件名的列表(每个文件名对于一个对象名称来说都是唯一的).

Here is what I tried, but it is giving me a list of ALL the original template filenames inside each sublist (which are each supposed to be unique for one object name).

import re
# Break up list into multiple lists according to substring (object name)
obj_list = [re.search(r'.*(OBJ.+)\.feedme', filename)[1] for filename in all_template_files]
obj_list = list(set(obj_list)) # create list of unique objects (remove duplicates)

templates_objs_sorted = [[]]*len(obj_list)
for i in range(len(obj_list)):
    for template in all_template_files:
        if obj_list[i] in template:
            templates_objs_sorted[i].append(template)

推荐答案

from collections import defaultdict
from pprint import pprint

all_templates = ['fitting_file_expdisk_cutout-IMG-HSC-I-18115-6,3-OBJ-NEP175857.9+655841.2.feedme', 'fitting_file_sersic_cutout-IMG-HSC-I-18115-3,3-OBJ-NEP180508.6+655617.3.feedme', 'fitting_file_sersic_cutout-IMG-HSC-I-18115-1,8-OBJ-NEP180840.8+665226.2.feedme', 'fitting_file_sersic_cutout-IMG-HSC-I-18115-6,7-OBJ-NEP175927.6+664230.2.feedme', 'fitting_file_expdisk_cutout-IMG-HSC-I-18114-0,5-OBJ-zsel56238.feedme', 'fitting_file_devauc_cutout-IMG-HSC-I-18114-0,3-OBJ-NEP175616.1+660601.5.feedme', 'fitting_file_sersic_cutout-IMG-HSC-I-18115-6,4-OBJ-zsel56238.feedme']

# simple helper function to extract the common object name
# you could probably use Regex... but then you'd have 2 problems
def objectName(path):
    start = path.index('-OBJ-')
    stop = path.index('.feedme')
    return path[(start + 5):stop]

# I really wanted to use a one line reduce here, but... 
grouped = defaultdict(list)
for each in all_templates:
    grouped[objectName(each)].append(each)
pprint(grouped)

ASIDE/TANGENT

好的,这真的让我感到困惑,因为我无法使用reduce来做一个简单的班轮.最终,我希望python具有良好的groupby函数.它具有该名称的功能,但仅限于连续键. Smalltalk,Objc和Swift都具有groupby机制,这些机制基本上使您可以通过任意传递函数来存储可说话的东西.

ASIDE/TANGENT

OK, it really bugged me that I couldn't do a simple one liner using reduce there. Ultimately, I wish python had a good groupby function. It has a function by that name, but it's limited to consecutive keys. Smalltalk, Objc, and Swift all have groupby mechanisms which basically allow you to bucketize an utterable by an arbitrary transfer function.

我最初的尝试是:

grouped = reduce(
    lambda accum, each: accum[objectName(each)].append(each),
    all_templates,
    defaultdict(list))

问题是lambda. Lambda限于单个表达式.为了使它在reduce中起作用,它最多会返回累积参数的修改版本.但是python除非有必要,否则不喜欢从函数/方法中返回内容.即使我们用<accessTheCurrentList> + [each]替换了append,我们也需要一种字典修改方法,该方法可以更新键上的值并返回修改后的字典.我找不到这样的东西.

The problem is the lambda. A lambda is limited to a single expression. And for it to work in reduce, it most return a modified version of the accumulated argument. But python doesn't like to return things from functions/methods unless it has to. Even if we replaced the append with <accessTheCurrentList> + [each], we'd need a dictionary modifying method that updated the value at a key and returned the modified dictionary. I could not find such a thing.

但是,我们可以做的是将更多信息加载到累加器中,例如元组.我们可以使用元组的一个插槽来保持传递defaultdict指针,而另一个插槽可以捕捉到对修改操作无益的None返回.它最终很难看,但它是一个衬里:

However, what we can do is load more information into our accumulator, for example, a tuple. We can use one slot of the tuple to keep passing the defaultdict pointer along, and the other to catch the unhelpful None return of the modifying operation. It ends up pretty ugly, but it is a one liner:

from functools import reduce
grouped = reduce(
    lambda accum, each: (accum[0], accum[0][objectName(each)].append(each)),
    all_templates,
    (defaultdict(list), None))[0]

这篇关于如何根据匹配的子字符串从一个列表创建多个列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆