如何使用awk处理和保存数据块? [英] How to process and save data in chunks using awk?

查看:278
本文介绍了如何使用awk处理和保存数据块?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我打开一个大的(几GB)文件,我无法在整个文件中读取一次。



如果是csv文件,我们将使用:

  pd.read_csv中的chunk('path / filename',chunksize = 10 ** 7):
#将块保存到磁盘

或者我们可以做一些类似于大熊猫的事情:

 将大熊猫导入为pd 
打开(fn)作为文件:
在文件中的行:
#将行保存到磁盘,例如df = pd.concat([df,line_data]),然后保存df

块数据与awk脚本? Awk将解析/处理您想要的格式的文本,但是我不知道如何使用awkchunk。可以编写脚本 script1.awk ,然后处理您的数据,但这会一次处理整个文件。



相关问题,更具体的例子:如何预处理和加载大数据 tsv文件转换成python数据帧?

解决方案

awk读取单个记录(chunk)一次设计。默认情况下,记录是数据行,但您可以使用 RS (记录分隔符)变量指定记录。在下一次读取之前,每个代码块都有条件地在当前记录上执行:

  $ awk'/ pattern / {printMATCHED ,$ 0> output}'file 

上述脚本将一次从输入文件中读取一行,如果该行匹配 pattern ,那么在阅读下一行之前,它将保存前缀为 MATCHED 的文件输出中的行。


Let's say I'm opening a large (several GB) file where I cannot read in the entire file as once.

If it's a csv file, we would use:

for chunk in pd.read_csv('path/filename', chunksize=10**7):
    # save chunk to disk

Or we could do something similar with pandas:

import pandas as pd
with open(fn) as file:
    for line in file:
        # save line to disk, e.g. df=pd.concat([df, line_data]), then save the df

How does one "chunk" data with an awk script? Awk will parse/process text into a format you desire, but I don't know how to "chunk" with awk. One can write a script script1.awk and then process your data, but this processes the entire file at once.

Related question, with more concrete example: How to preprocess and load a "big data" tsv file into a python dataframe?

解决方案

awk reads a single record (chunk) at a time by design. By default a record is line of data, but you can specify a record using the RS (record separator) variable. Each code block is conditionally executed on the current record before the next is read:

$ awk '/pattern/{print "MATCHED", $0 > "output"}' file

The above script will read a line at a time from the input file and if the that line matchs pattern it will save the line in the file output prepended with MATCHED before reading the next line.

这篇关于如何使用awk处理和保存数据块?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆