在第一个遇到的数字上通过dplyr分离柱（tidyr） [英] Separating column using separate (tidyr) via dplyr on a first encountered digit

查看：186 发布时间：2017/7/13 20:38:07 regex r string dplyr tidyr

本文介绍了在第一个遇到的数字上通过dplyr分离柱（tidyr）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试将一个相当混乱的列分成两列，包含期间和描述。我的数据类似于以下摘录：

I'm trying to separate a rather messy column into two columns containing period and description. My data resembles the extract below:

set.seed(1)
dta <- data.frame(indicator=c("someindicator2001", "someindicator2011",
                              "some text 20022008", "another indicator 2003"),
                  values = runif(n = 4))

所需结果

所需的结果应如下所示：

Desired results

Desired results should look like that:

          indicator   period    values
1     someindicator     2001 0.2655087
2     someindicator     2011 0.3721239
3         some text 20022008 0.5728534
4 another indicator     2003 0.9082078

特征

Characteristics

指标描述在一列中。

数字值（从第一位数字开始计数，第一位数字位于第二列）

代码

Code

require(dplyr); require(tidyr); require(magrittr) dta %<>% separate(col = indicator, into = c("indicator", "period"), sep = "^[^\\d]*(2+)", remove = TRUE)

当然这不行：

> head(dta, 2) indicator period values 1 001 0.2655087 2 011 0.3721239

其他尝试

我还尝试了默认分离方法 sep =[^ [ ：alnum：]]，但是将列分解成太多的列，因为它似乎与所有可用的数字相匹配。

sep =2 *也不起作用，因为有时太多2s（例如： 2 003 2 006）。

Other attempts

I have also tried the default separation method sep = "[^[:alnum:]]" but it breaks down the column into too many columns as it appears to be matching all of the available digits.
The sep = "2*" also doesn't work as there are too many 2s at times (example: 20032006).

归结为：

识别字符串中的第一个数字

分隔该章程。事实上，我也很乐意保留这个特定的角色。

Identifying the first digit in the string
Separating on that charter. As a matter of fact, I would be happy to preserve that particular character as well.

推荐答案

我想这可能会这样做。

library(tidyr)
separate(dta, indicator, c("indicator", "period"), "(?<=[a-z]) ?(?=[0-9])")
#           indicator   period    values
# 1     someindicator     2001 0.2655087
# 2     someindicator     2011 0.3721239
# 3         some text 20022008 0.5728534
# 4 another indicator     2003 0.9082078

以下是提供给您的正则表达式的解释regex101 。

（？< = [az]）是一个积极的看法 - 它声称 [az] （匹配a和z之间的范围内的单个字符（区分大小写））可以匹配

？匹配fr中的空格字符在零点和一次之间，尽可能多的时间，根据需要回馈

（？= [0-9]）是一个积极的前瞻 - 它声称 [0-9] （匹配在0和9之间的范围内存在的单个字符）可以匹配

(?<=[a-z]) is a positive lookbehind - it asserts that [a-z] (match a single character present in the range between a and z (case sensitive)) can be matched
? matches the space character in front of it literally, between zero and one time, as many times as possible, giving back as needed
(?=[0-9]) is a positive lookahead - it asserts that [0-9] (match a single character present in the range between 0 and 9) can be matched

这篇关于在第一个遇到的数字上通过dplyr分离柱（tidyr）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Other attempts

推荐答案

在第一个遇到的数字上通过dplyr分离柱（tidyr） [英] Separating column using separate (tidyr) via dplyr on a first encountered digit

问题描述

所需结果

Desired results

特征

Characteristics

代码

Code

其他尝试

相关文章

其他开发语言最新文章

热门教程

热门工具

登录关闭

在第一个遇到的数字上通过dplyr分离柱（tidyr） [英] Separating column using separate (tidyr) via dplyr on a first encountered digit

问题描述

所需结果

Desired results

特征

Characteristics

代码

Code

其他尝试

Other attempts

推荐答案

相关文章

其他开发语言最新文章

热门教程

热门工具

登录 关闭

登录关闭