使用fread读取对齐的列数据 [英] Reading aligned column data with fread

查看:117
本文介绍了使用fread读取对齐的列数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了这样的文件:

COL1        COL2          COL3
weqw        asrg          qerhqetjw
weweg       ethweth       rqerhwrtjw
rhqerhqerhq qergqer       qerhqew5h
qerh        qergqer       wetjwryerj

我无法加载直接用 fread 代替,所以我用代替了 \s + 用比我提供的 sed 更好的解决了问题。但是是否有一种内置的方式可以使用 data.table 来读取此类数据?

I could not load it directly with fread so I replaced \s+ by , with sed than I gave to fread and it solved it. But is there a built in way of reading this kind of data with data.table ?

推荐答案

fread (尚未)具有读取定宽文件

我也经常遇到这样烦人地存储文件。随时在 Github页面上添加功能请求。

I, too, often come across files annoyingly stored like this. Feel free to add a feature request on the Github page.

在您的情况下可能并非如此,但是您使用 sed 的解决方案无法在我遇到的很多FWF上使用,因为列之间没有空格,例如您会看到类似00010的字符串实际上包含3个字段。

It may not be so in your case, but your solution with sed would not work on a lot of FWF I come across because there's no space between columns, e.g. you'll see strings like 00010 that actually comprise 3 fields.

如果是这种情况,则需要一个字段宽度字典,此时您可以使用以下几种选择:

If that's the case, you'll need a field width dictionary, at which point you have several options:


  1. R 内的 read.fwf

  2. 编写一个 fwf -> csv 程序(我用我在 Python ,它的运行速度非常快,可以根据需要共享代码)-基本是您最初使用的方法的增强版本,因此您不必处理FWF再次

  3. 在Excel / LibreOffice / etc中打开它;有一个本地FWF阅读器试图(通常很差)猜测列的宽度,这至少为您指定列宽做了一半的工作。然后,您可以将其另存为.csv或其他任何格式。

  1. read.fwf within R
  2. Write a fwf->csv program (I use one I wrote in Python and it's pretty fast, could share the code if you'd like)--basically the beefed up version of your initial approach, so that you never have to deal with the FWF again
  3. Open it in Excel / LibreOffice / etc; there's a native FWF reader that tries (usually poorly) to guess the widths of the columns, which at least does half the work of specifying the column widths for you. Then you can save it as .csv or whatever from there.

我个人最常使用第二个选项。 read.fwf 的优化程度不如 fread ,因此速度可能很慢。而且,如果您有很多(比如20+)的FWF可供阅读,那么第三个选项就很繁琐。

I personally stick with the second option most often. read.fwf is not optimized like fread so it will probably be slow. And if you've got a lot (say 20+) of FWF to read, the 3rd option is pretty tedious.

但是我同意拥有一些东西会很好像这样内置于 fread

But I agree it would be nice to have something like this built in to fread.

这篇关于使用fread读取对齐的列数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆