拆分字符列并获取字符串中的字段名称 [英] split character columns and get names of field in string
问题描述
我需要将包含信息的列拆分为几列.
我会使用 tstrsplit
,但是相同的信息在行之间的顺序并不相同,因此我需要在变量中提取新列的名称.重要信息:可能有很多信息(字段变成新变量),我不知道所有这些信息,因此,我不需要逐字段"解决方案.
I need to split a column that contains information into several columns.
I'd use tstrsplit
but the same kind of information is not in the same order among the rows and I need to extract the name of the new column within the variable. Important to know: there can be many pieces of information (fields to become new variables) and I don't know all of them, so I don't want a "field by field" solution.
下面是我所拥有的一个例子:
Below is an example of what I have:
library(data.table)
myDT <- structure(list(chr = c("chr1", "chr2", "chr4"), pos = c(123L,
435L, 120L), info = c("type=3;end=4", "end=6", "end=5;pos=TRUE;type=2"
)), class = c("data.table", "data.frame"), row.names = c(NA,-3L))
# chr pos info
#1: chr1 123 type=3;end=4
#2: chr2 435 end=6
#3: chr4 120 end=5;pos=TRUE;type=2
我想得到:
# chr pos end pos type
#1: chr1 123 4 <NA> 3
#2: chr2 435 6 <NA> <NA>
#3: chr4 120 5 TRUE 2
最简单的方法将不胜感激!(注意:我不愿意采用dplyr/tidyr方式)
A most straightforward way to get that would be much appreciated! (Note: I'm not willing to go with a dplyr/tidyr way)
推荐答案
使用 regex
和 stringi
软件包:
setDT(myDT) # After creating data.table from structure()
library(stringi)
fields <- unique(unlist(stri_extract_all(regex = "[a-z]+(?==)", myDT$info)))
patterns <- sprintf("(?<=%s=)[^;]+", fields)
myDT[, (fields) := lapply(patterns, function(x) stri_extract(regex = x, info))]
myDT[, !"info"]
chr pos type end
1: chr1 <NA> 3 4
2: chr2 <NA> <NA> 6
3: chr4 TRUE 2 5
要获取正确的类型,请使用似乎(?) type.convert()
:
To get the correct type it seems (?) type.convert()
can be used:
myDT[, (fields) := lapply(patterns, function(x) type.convert(stri_extract(regex = x, info), as.is = TRUE))]
这篇关于拆分字符列并获取字符串中的字段名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!