在子目录中随机选择x个文件 [英] Select randomly x files in subdirectories

查看:81
本文介绍了在子目录中随机选择x个文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要随机地在数据集中随机获取10个文件(图像),但是该数据集是分层结构的。

I need to take exactly 10 files (images) in a dataset randomly, but this dataset is hierarchically structured.

因此,对于每个包含图像的子目录,我都需要随机只放10个。有简单的方法可以做到这一点,还是我应该手动完成?

So I need that for each subdirectory that contains images hold just 10 of them randomly. Is there an easy way to do that or I should do it manually?

def getListOfFiles(dirName):
    ### create a list of file and sub directories 
    ### names in the given directory 
    listOfFile = os.listdir(dirName)
    allFiles = list()
    ### Iterate over all the entries
    for entry in listOfFile:

        ### Create full path
        fullPath = os.path.join(dirName, entry)
        ### If entry is a directory then get the list of files in this directory 
        if os.path.isdir(fullPath):
            allFiles = allFiles + getListOfFiles(fullPath)
        else:
            allFiles.append(random.sample(fullPath, 10))
    return allFiles

dirName = 'C:/Users/bla/bla'

### Get the list of all files in directory tree at given path
listOfFiles = getListOfFiles(dirName)

with open("elements.txt", mode='x') as f:
    for elem in listOfFiles:
        f.write(elem + '\n')


推荐答案

从未知大小的目录列表中进行采样的好方法是使用水库采样。使用这种方法,您无需先运行即可列出目录中的所有文件。一对一阅读并取样。甚至当您必须跨多个目录采样固定数量的文件时,它也可以工作。

Good approach to sample from unknown size directory listing is to use Reservoir Sampling. With this approach, you don't have to run upfront and list all files in the directory. Read it one-by-one and sample. It even works when you have to sample fixed number of files across multiple directories.

使用基于生成器的目录扫描代码会更好,该代码可以从一个目录中选择一个文件。

It would be good to use generator-based directory scanning code, which picks one file at a time, thus you don't use gobs of memory upfront to hold all file names.

沿线(注意!不需要的代码!)

Along the lines (NB! undested code!)

import numpy as np
import os

def ResSampleFiles(dirname, N):
    """pick N files from directory"""

    sampled_files = list()
    k = 0
    for item in scandir(dirname):
        if item.is_dir():
            continue
        full_path = os.path.join(dirname, item.name)
        if k < N:
            sampled_files.append(full_path)
        else:
            idx = np.random.randint(0, k+1)
            if (idx < N):
                sampled_files[idx] = full_path
        k += 1

    return sampled_files

这篇关于在子目录中随机选择x个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆