读取固定宽度的大数据 [英] Reading big data with fixed width

查看:90
本文介绍了读取固定宽度的大数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何读取固定宽度格式的大数据?我阅读了这个问题,并尝试了一些技巧,但是所有答案都是针对定界数据(如.csv)的,这不是我的情况.数据有558MB,我不知道有多少行.

How can I read big data formated with fixed width? I read this question and tried some tips, but all answers are for delimited data (as .csv), and that's not my case. The data has 558MB, and I don't know how many lines.

我正在使用:

dados <- read.fwf('TS_MATRICULA_RS.txt', width=c(5, 13, 14, 3, 3, 5, 4, 6, 6, 6, 1, 1, 1, 4, 3, 2, 9, 3, 2, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 4, 11, 9, 2, 3, 9, 3, 2, 9, 9, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1), stringsAsFactors=FALSE, comment.char='', 
    colClasses=c('integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'character', 'character', 'character',
    'integer', 'integer', 'character', 'integer', 'integer', 'character', 'integer', 'character', 'character', 'character', 'character', 'character', 'character',
    'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character',
    'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'integer',
    'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'character', 'integer', 'integer', 'character', 'character', 'character',
    'character', 'integer', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character'), buffersize=180000)

但是读取数据需要30分钟(并且正在计数...).有什么新建议吗?

But it takes 30 minutes (and counting...) to read the data. Any new suggestions?

推荐答案

没有足够的数据详细信息,很难给出具体答案,但是这里有一些想法可以帮助您入门:

Without enough details about your data, it's hard to give a concrete answer, but here are some ideas to get you started:

首先,如果您使用的是Unix系统,则可以使用wc命令获取有关文件的一些信息.例如,wc -l TS_MATRICULA_RS.txt会告诉您文件中有多少行,而wc -L TS_MATRICULA_RS.txt将报告文件中最长的行的长度.了解这一点可能会很有用.同样,通过headtail,您可以检查文本文件的前10行和后10行.

First, if you're on a Unix system, you can get some information about your file by using the wc command. For example wc -l TS_MATRICULA_RS.txt will tell you how many lines there are in your file and wc -L TS_MATRICULA_RS.txt will report the length of the longest line in your file. This might be useful to know. Similarly, head and tail would let you inspect the first and last 10 lines of your text file.

第二,一些建议:既然您似乎知道每个字段的宽度,那么我将推荐两种方法之一.

Second, some suggestions: Since it appears that you know the widths of each field, I would recommend one of two approaches.

csvkit是一组用于处理CSV文件的Python工具.其中一种工具是 in2csv ,它采用固定宽度格式的文件与模式"文件结合使用,以创建可以与其他程序一起使用的适当CSV.

csvkit is a set of Python tools for working with CSV files. One of the tools is in2csv, which takes a fixed-width-format file combined with a "schema" file to create a proper CSV that can be used with other programs.

模式文件本身就是一个具有三列的CSV文件:(1)变量名,(2)起始位置和(3)宽度.一个示例(来自in2csv手册页):

The schema file is, itself, a CSV file with three columns: (1) variable name, (2) start position, and (3) width. An example (from the in2csv man page) is:

    column,start,length
    name,0,30 
    birthday,30,10 
    age,40,3

创建该文件后,您应该可以使用类似以下内容的文件:

Once you have created that file, you should be able to use something like:

in2csv -f fixed -s path/to/schemafile.csv path/to/TS_MATRICULA_RS.txt > TS_MATRICULA_RS.csv

从那里,我建议您考虑使用fread从"data.table"或使用sqldf读取数据.

From there, I would suggest looking into reading the data with fread from "data.table" or using sqldf.

在像您这样的大数据文件上使用sqldf实际上应该很快,并且您可以使用substr准确指定要读取的内容.

Using sqldf on a large-ish data file like yours should actually be pretty quick, and you get the benefit of being able to specify exactly what you want to read in using substr.

同样,这将希望您拥有一个可用的架构文件,如上面所述.拥有架构文件后,您可以执行以下操作:

Again, this will expect that you have a schema file available, like the one described above. Once you have your schema file, you can do the following:

temp <- read.csv("mySchemaFile.csv")

## Construct your "substr" command
GetMe <- paste("select", 
               paste("substr(V1, ", temp$start, ", ",
                     temp$length, ") `", temp$column, "`", 
                     sep = "", collapse = ", "), 
               "from fixed", sep = " ")

## Load "sqldf"
library(sqldf)

## Connect to your file
fixed <- file("TS_MATRICULA_RS.txt")
myDF <- sqldf(GetMe, file.format = list(sep = "_"))


由于知道宽度,因此您可以跳过模式文件的生成.从宽度上看,cumsum只是一点点工作.这是一个基本示例,以read.fwf中的第一个示例为基础:


Since you know the widths, you might be able to skip the generation of the schema file. From the widths, it's just a little bit of work with cumsum. Here's a basic example, building on the first example from read.fwf:

ff <- tempfile()
cat(file = ff, "123456", "987654", sep = "\n")
read.fwf(ff, widths = c(1, 2, 3))

widths <- c(1, 2, 3)
length <- cumsum(widths)
start <- length - widths + 1
column <- paste("V", seq_along(length), sep = "")

GetMe <- paste("select", 
               paste("substr(V1, ", start, ", ",
                     widths, ") `", column, "`", 
                     sep = "", collapse = ", "), 
               "from fixed", sep = " ")

library(sqldf)

## Connect to your file
fixed <- file(ff)
myDF <- sqldf(GetMe, file.format = list(sep = "_"))
myDF
unlink(ff)

这篇关于读取固定宽度的大数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆