在Python中使用随机化来减少文本文件 [英] text file reduction with randomization in Python

查看:94
本文介绍了在Python中使用随机化来减少文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用bash解决了以下问题,但是鉴于需要减少的文件大小,我觉得它效率很低而且非常慢.希望有人对如何在Python中执行相同的操作有一个想法,并希望能加快速度.

I solved the following problem in bash, but I feel it's quite inefficient and very slow given the size of files I need to reduce. Was hoping somebody has an idea how to do the same in Python and hopefully speed things up.

最初的问题是减少非常大的文本文件(50-60百万行,制表符分隔的列). 其中一列被视为键,即我们确定文件中有多少行具有唯一键,然后随机选择它们的百分比(例如,如果减少75%,则占总数的四分之一)以附加到一个新文件,将保留我们的结果.我们将继续浏览其余的键,将包含每个唯一键的所有行随机化,然后减少相同百分比.如果无法完成缩小操作,我们只需将所有行都移到生成的文件中即可.

The original problem was to reduce very large text files (50-60 million lines, tab delimited columns). One of the columns is being treated as a key, i.e. we determine how many lines with a unique key are in the file and then randomly select a percentage of them (for example a quarter of total number if reducing by 75%) to append to a new file that will keep our results. We continue to go through the rest of the keys, randomizing and then reducing all lines containing each unique key by the same percentage. In case the reduction can't be done - we simply carry all the lines over to the resulting file.

正如我所说,我的bash脚本运行得很好,但是速度很慢,并且将各种awk和grep构造串在一起.众所周知,Python应该以一种更加优雅的方式来处理此问题,并且不会过多地损害内存(同样,在这种情况下,我们正在处理50+百万行文件). 任何建议/技巧都会有所帮助!谢谢!

As I said, my bash script works quite well, but it is slow and strings together various awk and grep constructs. By all accounts, Python should deal with this in a much more elegant way and without compromising memory too much (again, we are dealing with 50+ million lines files in this case). Any suggestions/tricks would be helpful! Thanks!

推荐答案

简单的解决方案是按键列对文件进行排序,例如

The simple solution would be to sort a file by the key column e.g., sort tab-separated input by the second column:

#!/bin/bash
printf "a\tz\nb\ty\nc\tx" | sort -k 2 -t $'\t'

然后解决一个更简单的问题,即为每个唯一键检索25%的随机行,其中所有具有相同键的行都相邻,并具有一个约束,即每个唯一键应至少保留一行:

And then solve a simpler problem of retrieving 25% of random lines for each unique key where all lines with equal keys are adjacent with a constraint that at least one line for each unique key should be preserved:

#!/usr/bin/env python
import random
import sys
from itertools import chain, groupby

def choose_random(iterator, fraction, random=random.random):
    """Lazy analog of:

        L = list(iterator)
        k = int(len(L) * fraction + .5) or 1 # get at least one
        result = random.sample(L, k)

    Note: this function doesn't randomize the order of elements
          that would require to keep selected elements in memory
          and number of output elements is not exactly k
    """
    # always yield at least one item if input is not empty
    item = next(iterator)
    it = (x for x in chain([item], iterator) if random() < fraction)
    for x in chain([next(it, item)], it):
        yield x

def getkey(line):
    return line.split("\t")[1] # 2nd column

for key, group in groupby(sys.stdin, key=getkey):
    sys.stdout.writelines(choose_random(group, fraction=0.25))

注意:输入文件的最后一行应包含换行符,否则,如果选择了最后一行,则输出会损坏.

Note: the last line in the input file should contain a newline otherwise the output is corrupted if the last line is chosen.

该脚本接受在stdin上排序的(按键列)输入,并将缩减的输出打印到stdout.它一次只需要在存储器中存储一行.这是一种单遍算法(O(n)).

The script accepts sorted (by the key column) input on stdin and prints reduced output to stdout. It requires to store only one line in memory at a time. It is a single-pass algorithm (O(n)).

这篇关于在Python中使用随机化来减少文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆