子文件夹中的Python随机行 [英] Python random lines from subfolders

查看:105
本文介绍了子文件夹中的Python随机行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在多个子文件夹的.txt文件中有许多任务.我试图从这些文件夹,它们包含的文件以及文件中的最后一行中随机选择总共10个任务.所选行应删除或标记,以便在下一次执行中不会被选中.这可能是一个太宽泛的问题,但我希望您能提出任何建议或指导.

I have many tasks in .txt files in multiple sub folders. I am trying to pick up a total 10 tasks randomly from these folders, their contained files and finally a text line within a file. The selected line should be deleted or marked so it will be not picked in the next execution. This may be too broad a question but I'd appreciate any input or direction.

这是我到目前为止的代码:

Here's the code I have so far:

#!/usr/bin/python  
import random   
with open('C:\\Tasks\\file.txt') as f:  
    lines = random.sample(f.readlines(),10)    
print(lines)

推荐答案

要在所有这些文件中获得适当的随机分布,您需要将它们视为一大行,并随机选择10条.换句话说,您必须至少读取一次所有这些文件,以至少弄清多少行.

To get a proper random distribution across all these files, you'd need to view them as one big set of lines and pick 10 at random. In other words, you'll have to read all these files at least once to at least figure out how many lines you have.

但是,您不必将所有行都保留在内存中.您必须分两个阶段执行此操作:为文件建立索引以计算每个文件中的行数,然后从这些文件中随机选择10条行.

You do not need to hold all the lines in memory however. You'd have to do this in two phases: index your files to count the number of lines in each, then pick 10 random lines to be read from these files.

第一次索引:

import os

root_path = r'C:\Tasks\\'
total_lines = 0
file_indices = dict()

# Based on https://stackoverflow.com/q/845058, bufcount function
def linecount(filename, buf_size=1024*1024):
    with open(filename) as f:
        return sum(buf.count('\n') for buf in iter(lambda: f.read(buf_size), ''))

for dirpath, dirnames, filenames in os.walk(root_path):
    for filename in filenames:
         if not filename.endswith('.txt'):
             continue
         path = os.path.join(dirpath, filename)
         file_indices[total_lines] = path
         total_lines += linecount(path)

offsets = list(file_indices.keys())
offsets.sort()

现在,我们有了一个偏移量映射,它指向文件名和总行数.现在,我们选择十个随机索引,然后从您的文件中读取这些索引:

Now we have a mapping of offsets, pointing to filenames, and a total line count. Now we pick ten random indices, and read these from your files:

import random
import bisect

tasks = list(range(total_lines))
task_indices = random.sample(tasks, 10)

for index in task_indices:
     # find the closest file index
     file_index = offsets[bisect.bisect(offsets, index) - 1]
     path = file_indices[file_index]
     curr_line = file_index
     with open(path) as f:
         while curr_line <= index:
             task = f.readline()
             curr_line += 1
     print(task)
     tasks.remove(index)

请注意,您只需索引一次即可;您可以将结果存储在某个地方,并且仅在文件更新时进行更新.

Note that you only need the indexing once; you can store the result somewhere and only update it when your files update.

还请注意,您的任务现在已存储"在tasks列表中;这些是文件中各行的索引,在打印所选任务时,我从该变量中删除了索引.下次运行random.sample()选项时,以前选择的任务将不再可用于下次选择.如果文件确实发生更改,则此结构将需要更新,因为必须重新计算索引. file_indices将帮助您完成该任务,但这不在此答案的范围内. :-)

Also note that your tasks are now 'stored' in the tasks list; these are indices to lines in your files, and I remove the index from that variable when printing the task selected. Next time you run the random.sample() choices, the tasks previously picked will no longer be available for picking the next time. This structure will need updating if your files ever do change, as the indexes have to be re-calculated. The file_indices will help you with that task, but that is outside the scope of this answer. :-)

如果仅需要一个 10个示例,请使用 Blckknght的解决方案,因为将一次浏览文件,而我的则需要额外打开10个文件.如果您需要多个样本,则每次需要样本时,此解决方案仅需要额外打开10个文件,就不会再次扫描所有文件.如果文件少于10个,请仍然使用Blckknght的答案. :-)

If you need only one 10-item sample, use Blckknght's solution instead, as it only will go through the files once, while mine require 10 extra file openings. If you need multiple samples, this solution only requires 10 extra file openings every time you need your sample, it won't scan through all the files again. If you have fewer than 10 files, still use Blckknght's answer. :-)

这篇关于子文件夹中的Python随机行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆