以预定义的比例拆分文件名列表 [英] Splitting a list of file names in a predefined ratio

查看:68
本文介绍了以预定义的比例拆分文件名列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试形成一种优化的方法,根据文件名以x:y的比例拆分文件名列表(以下简称为示例).此文件列表是使用os.scandir购买的(性能优于os.listdir,src:

I am trying to form an optimized approach to splitting a list of file names(examples shortly) in a x:y ratio based on the file names. This file list was procured using os.scandir (better performance vs os.listdir, src: Python Docs scandir).

示例-

文件(不考虑扩展名)-

Files (extension disregarded)-

A_1,A_2,... A_10(这里A是文件名,1是文件的样本号)

A_1,A_2,...A_10 (here A is filename and 1 is the sample number of the file)

B_1,B_2,... B_10

B_1,B_2,...B_10

以此类推

假设x:y的比例为7:3 因此,我想将70%的文件名(A_1..A7,B_1..B_7)和30%(A_8--A_10,B_8..B_10)放在不同的列表中,这并不重要,因为第一个列表应该位于顺序表示文件可以是A_1,A_9,A_5等,只要它们将列表1中的7个文件拆分为列表2中的3个文件即可.

Let's say the x:y ratio is 7:3 So I would like 70% of file names (A_1..A7,B_1..B_7) and 30%(A_8--A_10,B_8..B_10) in different lists, it does not matter that the first list should be in that order meaning the files could be A_1,A_9,A_5 etc as long as they are split 7 files in list 1 to 3 files in list 2.

现在必须注意,该目录很大(〜150k个文件),每种文件的样本都不同,即,文件名A的文件可能有1000个文件,也可能只有5个. 400个唯一的文件名.

Now it must be noted that this directory is huge (~150k files) and the samples of each type of files vary, i.e. it maybe that files with filename A have 1000 files or it may have only 5. Also there are about 400 unique filenames.

此当前解决方案根本不应该称为解决方案,因为它违反了为每个文件名指定精确比率的目的.当前它正在以x:y的比例将fileObjects的列表(基本名称像A,数字像1,文件A_1中的数据,等等)作为一个整体进行拆分,并利用以下事实:使用时按任意顺序生成条目 os.scandir .

This current solution should not be called a solution at all as it defies the purpose of an accurate ratio for each filename. It is currently splitting the list of fileObjects(basically- name like A, number like 1, data within file A_1 and so on) as a whole in x:y ratio and taking advantage of the fact that entries are yielded in arbitrary order when using os.scandir.

ratio_number = int(len(list_of_fileObjects) *.7)
list_70 = list_of_fileObjects[:ratio_number]
list_30 = list_of_fileObjects[ratio_number:]

我的第二种方法(至少是一个有效的解决方案)是为每个文件名分别创建一个列表(涉及对整个文件列表进行排序),将其按比例分割,然后对每个文件名执行此操作.我正在寻找一个更pythonic/优雅的解决方案来解决这个问题.任何建议或帮助将不胜感激,尤其是考虑到要处理的数据量.

My second approach which would at least be a valid solution was to create a list separately for each filename(involves sorting the whole list of files), split it in the ratio and do this for each filename. I am looking for a more pythonic/elegant solution to this problem. Any suggestions or help would be appreciated especially considering the size of data being dealt with.

推荐答案

我找到了解决此问题的好方法.

I figured out a good solution to this problem.

all_file_names = {}

# ObjList is a list of objects but we only need  
# file_name from that object for our solution

for x in ObjList:
    if x.file_name not in all_file_names:
        all_file_names[x.file_name] = 1
    else:
        all_file_names[x.file_name] += 1

trainingData = []
testData = []
temp_dict = {}

for x in ObjList:
    ratio = int(0.7*all_file_names[x.file_name])+1
    if x.file_name not in temp_dict:
        temp_dict[x.file_name] = 1
        trainingData.append(x)
    else:
        temp_dict[x.file_name] += 1
        if(temp_dict[x.file_name] < ratio):
            trainingData.append(x)
        else:
            testData.append(x)

这篇关于以预定义的比例拆分文件名列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆