大文件中的Python随机N行(无重复行) [英] Python random N lines from large file (no duplicate lines)

查看:348
本文介绍了大文件中的Python随机N行(无重复行)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要使用python从大型txt文件中提取N行.这些文件基本上是制表符分隔的表.我的任务有以下限制:

I need to use python to take N number of lines from large txt file. These files are basically tab delimited tables. My task has the following constraints:

  • 这些文件可能包含标题(某些文件具有多行标题).
  • 标题需要以相同的顺序出现在输出中.
  • 每行只能使用一次.
  • 当前最大的文件约为150GB(约6亿行).
  • 文件中的行长大致相同,但不同文件之间的行数可能会有所不同.
  • 我通常会随机选择5000条线(我可能需要多达1000000条线)

目前,我已编写以下代码:

Currently I have written the following code:

inputSize=os.path.getsize(options.input)
usedPositions=[] #Start positions of the lines already in output

with open(options.input) as input:
    with open(options.output, 'w') as output:

        #Handling of header lines
        for i in range(int(options.header)):
            output.write(input.readline())
            usedPositions.append(input.tell())

        # Find and write all random lines, except last
        for j in range(int(args[0])):
            input.seek(random.randrange(inputSize)) # Seek to random position in file (probably middle of line)
            input.readline() # Read the line (probably incomplete). Next input.readline() results in a complete line.
            while input.tell() in usedPositions: # Take a new line if current one is taken
                input.seek(random.randrange(inputSize))
                input.readline() 
            usedPositions.append(input.tell()) # Add line start position to usedPositions
            randomLine=input.readline() # Complete line
            if len(randomLine) == 0: # Take first line if end of the file is reached
                input.seek(0)
                for i in range(int(options.header)): # Exclude headers
                    input.readline()
                randomLine=input.readline()
            output.write(randomLine)            

此代码似乎正常工作.

我知道这段代码更喜欢输入中最长行的行,因为seek()最有可能返回最长行上的位置,并将下一行写入输出.这无关紧要,因为输入文件中的行的长度大致相同. 我也知道,如果N大于输入文件中的行数,则此代码将导致无限循环.我不会对此进行检查,因为获取行数会花费很多时间.

I am aware that this code prefers lines that follow the longest lines in input, because seek() is most likely to return a position on the longest line and the next line is written to output. This is irrelevant as lines in the input file are roughly the same length. Also I am aware that this code results in an infinite loop if N is larger than number of lines in input file. I will not implement a check for this, as getting the line count takes a lot of time.

RAM和HDD限制无关紧要.我只关心程序的速度.有没有办法进一步优化此代码?还是有更好的方法?

RAM and HDD limitations are irrelevant. I am only concerned about the speed of the program. Is there a way to further optimize this code? Or perhaps there is a better approach?

编辑:为明确起见,一个文件中的行长度大致相同.但是,我有多个文件需要运行此脚本,并且这些文件的平均行长将有所不同.例如,文件A每行可能有〜100个字符,文件B每行可能有〜50000个字符.我不知道任何文件的平均行长.

EDIT: To clarify, the lines in one file have roughly the same length. However, i have multiple files that this script needs to run on and the average length of a line will be different for these files. For example file A may have ~100 characters per line and file B ~50000 characters per line. I do not know the average line length of any file beforehand.

推荐答案

只有一种方法可以避免对要采样的最后一行的所有文件进行顺序读取-令人惊讶的是,到目前为止还没有任何答案现在提到它:

There is only one way of avoiding a sequential read of all the file up to the last line you are sampling - I am surprised that none of the answers up to now mentioned it:

如果您具有典型的行长(如您所说),则必须在文件内的任意位置读取一些字节,该值应为该值的3或4倍.然后将在新行字符("\ n")上读取的块拆分,并选择第二个字段-随机位置的行.

You have to seek to an arbitrary location inside the file, read some bytes, if you have a typical line length, as you said, 3 or 4 times that value should do it. Then split the chunk you read on the new line characters ("\n"), and pick the second field - that is a line in a random position.

此外,为了能够始终如一地搜索文件,应以二进制读取"模式打开它,因此,应手动处理行尾标记的转换.

Also, in order to be able to consistently seek into the file, it should be opened in "binary read" mode, thus, the conversion of the end of line markers should be taken care of manually.

此技术无法为您提供已读取的行号,因此您将选定的行偏移量保留在文件中以避免重复:

This technique can't give you the line number that was read, thus you keep the selected line offset in the file to avoid repetition:

#! /usr/bin/python
# coding: utf-8

import random, os


CHUNK_SIZE = 1000
PATH = "/var/log/cron"

def pick_next_random_line(file, offset):
    file.seek(offset)
    chunk = file.read(CHUNK_SIZE)
    lines = chunk.split(os.linesep)
    # Make some provision in case yiou had not read at least one full line here
    line_offset = offset + len(os.linesep) + chunk.find(os.linesep) 
    return line_offset, lines[1]

def get_n_random_lines(path, n=5):
    lenght = os.stat(path).st_size
    results = []
    result_offsets = set()
    with open(path) as input:
        for x in range(n):
            while True:
                offset, line = pick_next_random_line(input, random.randint(0, lenght - CHUNK_SIZE))
                if not offset in result_offsets:
                    result_offsets.add(offset)
                    results.append(line)
                    break
    return results

if __name__ == "__main__":
    print get_n_random_lines(PATH)

这篇关于大文件中的Python随机N行(无重复行)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆