从文件中随机采样行 [英] Randomly sampling lines from a file

查看:40
本文介绍了从文件中随机采样行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个〜40gb和1800000行的csv文件.

I have a csv file which is ~40gb and 1800000 lines.

我想随机采样10,000行并将其打印到新文件中.

I want to randomly sample 10,000 lines and print them to a new file.

现在,我的方法是将sed用作:

Right now, my approach is to use sed as:

(sed -n '$vars' < input.txt) > output.txt

其中 $ vars 是随机生成的行列表.(例如:1p; 14p; 1700p; ...; 10203p)

Where $vars is a randomly generated list of lines. (Eg: 1p;14p;1700p;...;10203p)

虽然可行,但每次执行大约需要5分钟.这不是一个很长的时间,但是我想知道是否有人对如何更快地实现它有想法?

While this works, it takes about 5 minutes per execution. It's not a huge time, but I was wondering if anybody had ideas on how to make it quicker?

推荐答案

具有相同长度的行的最大优点是,您无需查找换行符即可知道每行的起始位置.文件大小约为40GB,包含约180万行,则行长度约为20KB/行.如果要采样1万行,则行之间大约有40MB.几乎可以肯定,这大约比磁盘上块的大小大三个数量级.因此,寻找下一个读取位置比读取文件中的每个字节要有效得多.

The biggest advantage to having lines of the same length is that you don't need to find newlines to know where each line starts. With a file size of ~40GB containing ~1.8M lines, you have a line length of ~20KB/line. If you want to sample 10K lines, you have ~40MB between lines. This is almost certainly around three orders of magnitude larger than the size of a block on your disk. Therefore, seeking to the next read location is much much more efficient than reading every byte in the file.

查找"将适用于行长不等的文件(例如,采用UTF-8编码的非ASCII字符),但需要对该方法进行较小的修改.如果行数不相等,则可以搜索到一个估计的位置,然后扫描到下一行的开头.这仍然非常有效,因为您需要阅读的每20KB内存会跳过40MB内存.因为您将选择字节位置而不是行位置,所以采样均匀性将受到轻微影响,并且您将无法确定要读取的行号.

Seeking will work with files that have unequal line lenghs (e.g., non-ascii characters in UTF-8 encoding), but will require minor modifications to the method. If you have unequal lines, you can seek to an estimated location, then scan to the start of the next line. This is still quite efficient because you will be skipping ~40MB for every ~20KB you need to read. Your sampling uniformity will be compromised slightly since you will select byte locations instead of line locations, and you won't know which line number you are reading for sure.

您可以直接使用生成行号的Python代码来实现您的解决方案.这是一个如何处理所有具有相同字节数(通常为ascii编码)的行的示例:

You can implement your solution directly with the Python code that generates your line numbers. Here is a sample of how to deal with lines that all have the same number of bytes (usually ascii encoding):

import random
from os.path import getsize

# Input file path
file_name = 'file.csv'
# How many lines you want to select
selection_count = 10000

file_size = getsize(file_name)
with open(file_name) as file:
    # Read the first line to get the length
    file.readline()
    line_size = file.tell()
    # You don't have to seek(0) here: if line #0 is selected,
    # the seek will happen regardless later.

    # Assuming you are 100% sure all lines are equal, this might
    # discard the last line if it doesn't have a trailing newline.
    # If that bothers you, use `math.round(file_size / line_size)`
    line_count = file_size // line_size
    # This is just a trivial example of how to generate the line numbers.
    # If it doesn't work for you, just use the method you already have.
    # By the way, this will just error out (ValueError) if you try to
    # select more lines than there are in the file, which is ideal
    selection_indices = random.sample(range(line_count), selection_count)
    selection_indices.sort()

    # Now skip to each line before reading it:
    prev_index = 0
    for line_index in selection_indices:
        # Conveniently, the default seek offset is the start of the file,
        # not from current position
        if line_index != prev_index + 1:
            file.seek(line_index * line_size)
        print('Line #{}: {}'.format(line_index, file.readline()), end='')
        # Small optimization to avoid seeking consecutive lines.
        # Might be unnecessary since seek probably already does
        # something like that for you
        prev_index = line_index

如果您愿意在行号分配中牺牲(非常)少量的统一性,则可以轻松地将类似技术应用于行长不相等的文件.您只需要生成随机字节偏移量,然后跳到偏移量后的下一行即可.在以下实现中,假定您知道行长度不超过40KB的事实.如果您的CSV具有以UTF-8编码的非ascii unicode字符,则必须执行类似的操作,因为即使所有行都包含相同数目的字符,它们也将包含不同数目的字节.在这种情况下,您将不得不以二进制模式打开文件,因为否则,当该字节恰好是中间字符时,当您跳到随机字节时,您可能会遇到解码错误:

If you are willing to sacrifice a (very) small amount of uniformity in the distribution of line numbers, you can easily apply a similar technique to files with unequal line lengths. You just generate random byte offsets, and skip to the next full line after the offset. In the following implementation, it is assumed that you know for a fact that no line is longer than 40KB in length. You would have to do something like this if your CSV had non-ascii unicode characters encoded in UTF-8, because even if the lines all contained the same number of characters, they would contain different numbers of bytes. In this case, you would have to open the file in binary mode, since otherwise you might run into decoding errors when you skip to a random byte, if that byte happens to be mid-character:

import random
from os.path import getsize

# Input file path
file_name = 'file.csv'
# How many lines you want to select
selection_count = 10000
# An upper bound on the line size in bytes, not chars
# This serves two purposes:
#   1. It determines the margin to use from the end of the file
#   2. It determines the closest two offsets are allowed to be and
#      still be 100% guaranteed to be in different lines
max_line_bytes = 40000

file_size = getsize(file_name)
# make_offset is a function that returns `selection_count` monotonically
# increasing unique samples, at least `max_line_bytes` apart from each
# other, in the range [0, file_size - margin). Implementation not provided.
selection_offsets = make_offsets(selection_count, file_size, max_line_bytes)
with open(file_name, 'rb') as file:
    for offset in selection_offsets:
        # Skip to each offset
        file.seek(offset)
        # Readout to the next full line
        file.readline()
        # Print the next line. You don't know the number.
        # You also have to decode it yourself.
        print(file.readline().decode('utf-8'), end='')

这里的所有代码都是Python 3.

All code here is Python 3.

这篇关于从文件中随机采样行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆