使用 fread 读取对齐的列数据 [英] Reading aligned column data with fread

查看:30
本文介绍了使用 fread 读取对齐的列数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了这样的文件:

COL1        COL2          COL3
weqw        asrg          qerhqetjw
weweg       ethweth       rqerhwrtjw
rhqerhqerhq qergqer       qerhqew5h
qerh        qergqer       wetjwryerj

我无法直接用 fread 加载它,所以我将 s+ 替换为 ,sed 比我交给 fread 并解决了它.但是有没有一种内置的方式来使用 data.table 读取这种数据?

I could not load it directly with fread so I replaced s+ by , with sed than I gave to fread and it solved it. But is there a built in way of reading this kind of data with data.table ?

推荐答案

fread (还)没有任何阅读能力 固定宽度文件.

fread does not (yet) have any capabilities for reading fixed-width files.

我也经常遇到像这样令人讨厌地存储的文件.随意在 Github 页面上添加功能请求.

I, too, often come across files annoyingly stored like this. Feel free to add a feature request on the Github page.

在您的情况下可能不是这样,但是您使用 sed 的解决方案不适用于我遇到的很多 FWF,因为列之间没有空格,例如您会看到像 00010 这样的字符串实际上包含 3 个字段.

It may not be so in your case, but your solution with sed would not work on a lot of FWF I come across because there's no space between columns, e.g. you'll see strings like 00010 that actually comprise 3 fields.

如果是这种情况,您将需要一个字段宽度字典,此时您有多种选择:

If that's the case, you'll need a field width dictionary, at which point you have several options:

  1. read.fwfR
  2. 中的
  3. 写一个 fwf->csv 程序(我用的是我用 Python 写的,速度挺快的,如果你可以分享代码'd like)--基本上是您最初方法的增强版本,这样您就不必再次处理 FWF
  4. 在 Excel/LibreOffice/等中打开它;有一个本地 FWF 阅读器会尝试(通常很糟糕)猜测列的宽度,这至少完成了为您指定列宽的一半工作.然后,您可以将其另存为 .csv 或其他格式.
  1. read.fwf within R
  2. Write a fwf->csv program (I use one I wrote in Python and it's pretty fast, could share the code if you'd like)--basically the beefed up version of your initial approach, so that you never have to deal with the FWF again
  3. Open it in Excel / LibreOffice / etc; there's a native FWF reader that tries (usually poorly) to guess the widths of the columns, which at least does half the work of specifying the column widths for you. Then you can save it as .csv or whatever from there.

我个人最常坚持第二种选择.read.fwf 没有像 fread 那样优化,所以它可能会很慢.如果你有很多(比如 20+)的 FWF 要阅读,第 3 个选项就相当乏味了.

I personally stick with the second option most often. read.fwf is not optimized like fread so it will probably be slow. And if you've got a lot (say 20+) of FWF to read, the 3rd option is pretty tedious.

但我同意在 fread 中内置这样的东西会很好.

But I agree it would be nice to have something like this built in to fread.

这篇关于使用 fread 读取对齐的列数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆