在第一个遇到的数字上通过dplyr分离柱(tidyr) [英] Separating column using separate (tidyr) via dplyr on a first encountered digit
本文介绍了在第一个遇到的数字上通过dplyr分离柱(tidyr)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在尝试将一个相当混乱的列分成两列,包含期间和描述。我的数据类似于以下摘录:
I'm trying to separate a rather messy column into two columns containing period and description. My data resembles the extract below:
set.seed(1)
dta <- data.frame(indicator=c("someindicator2001", "someindicator2011",
"some text 20022008", "another indicator 2003"),
values = runif(n = 4))
所需结果
所需的结果应如下所示:
Desired results
Desired results should look like that:
indicator period values
1 someindicator 2001 0.2655087
2 someindicator 2011 0.3721239
3 some text 20022008 0.5728534
4 another indicator 2003 0.9082078
特征
Characteristics
- 指标描述在一列中。
- 数字值(从第一位数字开始计数,第一位数字位于第二列)
代码
Code
require(dplyr); require(tidyr); require(magrittr)
dta %<>%
separate(col = indicator, into = c("indicator", "period"),
sep = "^[^\\d]*(2+)", remove = TRUE)
当然这不行:
> head(dta, 2)
indicator period values
1 001 0.2655087
2 011 0.3721239
其他尝试
- 我还尝试了默认分离方法
sep =[^ [ :alnum:]]
,但是将列分解成太多的列,因为它似乎与所有可用的数字相匹配。 -
sep =2 *
也不起作用,因为有时太多2s(例如: 2 003 2 006)。 - I have also tried the default separation method
sep = "[^[:alnum:]]"
but it breaks down the column into too many columns as it appears to be matching all of the available digits. - The
sep = "2*"
also doesn't work as there are too many 2s at times (example: 20032006). - 识别字符串中的第一个数字
- 分隔该章程。事实上,我也很乐意保留这个特定的角色。
- Identifying the first digit in the string
- Separating on that charter. As a matter of fact, I would be happy to preserve that particular character as well.
Other attempts
归结为:
推荐答案
我想这可能会这样做。
library(tidyr)
separate(dta, indicator, c("indicator", "period"), "(?<=[a-z]) ?(?=[0-9])")
# indicator period values
# 1 someindicator 2001 0.2655087
# 2 someindicator 2011 0.3721239
# 3 some text 20022008 0.5728534
# 4 another indicator 2003 0.9082078
-
(?< = [az])
是一个积极的看法 - 它声称[az]
(匹配a和z之间的范围内的单个字符(区分大小写))可以匹配 -
?
匹配fr中的空格字符在零点和一次之间,尽可能多的时间,根据需要回馈 -
(?= [0-9])
是一个积极的前瞻 - 它声称[0-9]
(匹配在0和9之间的范围内存在的单个字符)可以匹配
(?<=[a-z])
is a positive lookbehind - it asserts that[a-z]
(match a single character present in the range between a and z (case sensitive)) can be matched?
matches the space character in front of it literally, between zero and one time, as many times as possible, giving back as needed(?=[0-9])
is a positive lookahead - it asserts that[0-9]
(match a single character present in the range between 0 and 9) can be matched
这篇关于在第一个遇到的数字上通过dplyr分离柱(tidyr)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文