按分隔符读取表格,然后按R中的固定宽度读取 [英] Read table by delimiter then by fixed width in R

查看:931
本文介绍了按分隔符读取表格,然后按R中的固定宽度读取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个制表符分隔的文件,如下所示:

I have a tab-delimited file like this:

RS1->2001 HAPLO1 AAACAAGGAGGAGAAGGAAA ...
RS1->2001 HAPLO2 CAACAAAGAGGAGAAGGAAA ...
RS1->2002 HAPLO1 AAAAAAGGAGGAAAAGGAAA ...
RS1->20020 HAPLO2 CAACAAGGAGGAAGCAGAGC ...
RS1->20021 HAPLO2 CAACAAGGAGGAAGCAGAGC ...

在R中我们可以轻松阅读这三列,我的问题是我需要分开第3列字符。最终结果应该是这样的:

In R we can easily read in these three columns, my problem is that I need separate the 3rd column character by character. The end result should be something like this:

RS1->2001 HAPLO1 A A A C  ...
RS1->2001 HAPLO2 C A A C  ...
RS1->2002 HAPLO1 A A A A  ...
RS1->20020 HAPLO2 C A A C  ...
RS1->20021 HAPLO2 C A A C  ...

我可以先读取3列,然后将第3列的每个条目拆分成字符,但是这个很烦人,我非常希望从一开始就把它弄好。

I can first read the 3 columns in, then split each entry of the 3rd column into characters, but this is annoying, I would very much prefer to get it right from the start.

如果前两列没有出现,我可以用

If the first two columns does not existe, I can achieve the goal with

read.fwf('test.csv', widths=rep(1, 300))

我在想是否可以使用制表符分隔符读取前两列,然后按固定宽度读取第3列。

I am thinking whether I can read in the first 2 columns in by using the tab delimiter and then read the 3rd column by fixed width.

推荐答案

我想到的两个主要选项是 strsplit (如评论中所述)并在@ Ricardo的回答中)和 read.fwf read.fwf 将无法直接处理您的数据,但如果您使用,它可以处理已读入的数据列textConnection() function。

The two main options that come to mind are strsplit (as mentioned in the comments and in @Ricardo's answer) and read.fwf. read.fwf won't work directly with your data, but it can work on a column of data that has already been read in if you use the textConnection() function.

以下是一个基本示例:

## Create a tab-separated file named "test.txt" in your working directory
cat("2001\tHAPLO1\tAAACAAGGAGGAGAAGGAAA\n",
    "2001\tHAPLO2\tCAACAAAGAGGAGAAGGAAA\n",
    "2002\tHAPLO1\tAAAAAAGGAGGAAAAGGAAA\n",
    "20020\tHAPLO2\tCAACAAGGAGGAAGCAGAGC\n",
    "20021\tHAPLO2\tCAACAAGGAGGAAGCAGAGC\n", 
    file = "test.txt")

## Read it in with `read.delim`
mydata <- read.delim("test.txt", header = FALSE, stringsAsFactors = FALSE)

## Use `read.fwf` on the third column
## Replace "widths" with whatever the maximum width is for that column
## If max width is not known, you can use something like
##    `widths = rep(1, max(nchar(mydata$V3)))`
cbind(mydata[-3], 
      read.fwf(file = textConnection(mydata$V3), widths = rep(1, 20)))
#      V1     V2 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
# 1  2001 HAPLO1  A  A  A  C  A  A  G  G  A   G   G   A   G   A   A   G   G   A   A   A
# 2  2001 HAPLO2  C  A  A  C  A  A  A  G  A   G   G   A   G   A   A   G   G   A   A   A
# 3  2002 HAPLO1  A  A  A  A  A  A  G  G  A   G   G   A   A   A   A   G   G   A   A   A
# 4 20020 HAPLO2  C  A  A  C  A  A  G  G  A   G   G   A   A   G   C   A   G   A   G   C
# 5 20021 HAPLO2  C  A  A  C  A  A  G  G  A   G   G   A   A   G   C   A   G   A   G   C

注意:如果您没有使用 stringsAsFactors = FALSE ,则必须更改文件参数:

Note: If you did not use stringsAsFactors = FALSE, you would have to change your file argument to:

file = textConnection(as.character(mydata$V3))

这篇关于按分隔符读取表格,然后按R中的固定宽度读取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆