使用 Regex & 在 R 中分隔列分开(整洁) [英] Separating a column in R using Regex & separate (tidyr)

查看:18
本文介绍了使用 Regex & 在 R 中分隔列分开(整洁)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这就是我希望能够做到的.
https://regex101.com/r/KchccA/1

This is what I am looking to be able to do.
https://regex101.com/r/KchccA/1

我想匹配 = 和 ) 之间的任何字符,同时还要考虑是否存在空捕获组,因为我希望每行填充所有字段.

I want to match on any characters in-between = and ) while also considering if there is a null captured group, as I want all fields to be populated per row.

行示例:在此示例中,Address4、County 和 Contact name 为空.您还可以看到一些错误/不正确的值.还有一些开头/结尾的文字我需要删除.

Example of a row: In this example Address4, County, and Contact name are null. You can also see how some have wrong / incorrect values. Theres also some initial / ending text too I need to remove.

x <- "Please enter an UT location before booking the order.. ADDRESS_VALIDATION_FAILED (SITE_TYPE=uct) (SITE_USE_ID=1000) (CUSTOMER_NAME=cname) (CUSTOMER_NUMBER=2000) (ADDRESS1=addy1) (ADDRESS2=addy2) (ADDRESS3=addy3) (ADDRESS4=) (CITY=.) (STATE=) (ZIP=0000) (COUNTY=) (COUNTRY=NO) (CONTACT_NAME=) The task is raised for line_number: 7"

但是在 R 中,当我尝试使用 tidyr 的单独方法时,我最终得到了不想要的结果.我不是在逃避吗?

However in R when I try to utilize tidyr's separate method I end up with undesirable results. Am I not escaping it right?

这是我的代码:

df.sub <- separate(data = main.data, col = Order.Task.Text.CCW, into = c("SITE_TYPE", "SITE_USE_ID", "CUSTOMER_NAME","CUSTOMER_NUMBER", "ADDRESS1", "ADDRESS2", "ADDRESS3", "ADDRESS4", "CITY", "STATE", "ZIP", "COUNTY", "COUNTRY", "CONTACT_NAME"), sep = "=([^\\)]+|())\\)")

结果示例:

   SITE_TYPE    SITE_USE_ID   CUSTOMER_NAME      CUSTOMER_NUMBER       
1  (SITE_TYPE    (SITE_USE_ID    (CUSTOMER_NAME  (CUSTOMER_NUMBER
2  (SITE_TYPE    (SITE_USE_ID    (CUSTOMER_NAME  (CUSTOMER_NUMBER
3  (SITE_TYPE    (SITE_USE_ID    (CUSTOMER_NAME  (CUSTOMER_NUMBER
4  (SITE_TYPE    (SITE_USE_ID    (CUSTOMER_NAME  (CUSTOMER_NUMBER

<小时>

最终解决方案

这是我为任何好奇的人提供的最终解决方案,基于为便于查看而格式化的正确答案.


Final Solution

Here's my final solution for anyone curious, based on correct answer formatted for ease of viewing.

p <- proto(
 pre = function(.) .$k <- 0,
 fun = function(., x) {
 if (x == "(") .$k <- .$k + 1 else if (x == ")") .$k <- .$k - 1
 if (x == "(" && .$k == 1) "" else if (x == ")" && .$k == 0) "\n" else x
})
df.sub.final <- df.sub$text %>%
sub("^[^\\(]*\\(", "(", .) %>% 
sub("\\)[^\\)]*$", ")", .) %>% 
gsub("\n", "", .) %>%
gsub("=", ": ", .) %>%
gsubfn("([\\(\\)]) *", p, .) %>%
textConnection %>%
read.dcf %>%
as.data.frame(.)

推荐答案

对于有效输入是什么似乎存在一些不确定性.以下是基于不同假设的几种不同答案.全部将输入转换为dcf形式(即名称:值),然后使用read.dcf.

There seems to be some uncertainty as to what the valid inputs are. Below are several different answers based on different assumptions. All convert the input to dcf form (i.e. name: value) and then use read.dcf.

转换为 dcf 形式(即名称:值).

Transform to dcf form (i.e. name: value).

我们可以使用 gsubfn 处理平衡括号.首先创建一个 proto 对象,其 pre 函数将计数器 k 初始化为零,然后为每个匹配到 () 函数 fun 输入它并增加或减少 k 输出适当的替换字符.有关详细信息,请参阅 gsubfn 包小插图.

We can handle balanced parentheses with gsubfn. First create a proto object whose pre function initializes a counter k to zero and then for each match to ( or ) the function fun inputs it and increments or decrements k outputting the appropriate replacement character. See the gsubfn package vignette for more info.

现在从 x 开始替换开头的垃圾,将 = 替换为 : 和一个空格,然后运行 ​​gsubfn 匹配 ( or ) 后跟可选空格与 proto我们定义的对象.最后使用 read.dcf 读取转换后的文本.

Now starting from x replace the junk at the beginning, replace = with : and a space and then run gsubfn matching ( or ) followed by optional space with the proto object we defined. Finally read the transformed text using read.dcf.

library(gsubfn)
library(magrittr)

p <- proto(
 pre = function(.) .$k <- 0,
 fun = function(., x) {
  if (x == "(") .$k <- .$k + 1 else if (x == ")") .$k <- .$k - 1
  if (x == "(" && .$k == 1) "" else if (x == ")" && .$k == 0) "\n" else x
})

x %>%
  sub("^.*?\\(", "(", .) %>%
  gsub("=", ": ", .) %>%
  gsubfn("([\\(\\)]) *", p, .) %>%
  textConnection %>%
  read.dcf

2) 嵌套的括号没有相邻的空格

x <- "(SITE_TYPE=Site1) (SITE_USE_ID=2000) (CUSTOMER_NAME=cname) (CUSTOMER_NUMBER=11111) (ADDRESS1=addy1) (ADDRESS2=addy2) (ADDRESS3=addy3) (ADDRESS4=) (CITY=.) (STATE=) (ZIP=0000) (COUNTY=) (COUNTRY=NO) (CONTACT_NAME=)"


library(magrittr)

x %>%
  paste0(" ") %>%
  sub("^.*?\\(", "", .) %>%
  gsub(" +\\(", " ", .) %>%
  gsub("=", ": ", .) %>%
  gsub("\\) ", "\n", .) %>%
  textConnection %>%
  read.dcf

给予:

     SITE_TYPE SITE_USE_ID CUSTOMER_NAME CUSTOMER_NUMBER ADDRESS1 ADDRESS2
[1,] "Site1"   "2000"      "cname"       "11111"         "addy1"  "addy2" 
     ADDRESS3 ADDRESS4 CITY STATE ZIP    COUNTY COUNTRY CONTACT_NAME
[1,] "addy3"  ""       "."  ""    "0000" ""     "NO"    ""     

3) 固定关键字跟在外左括号之后.

对于这种情况,内括号可以不平衡,但外括号总是跟在cn中的关键字之一.

x <- "ADDRESS_VALIDATION_FAILED (SITE_TYPE=site1) (SITE_USE_ID=200) (CUSTOMER_NAME=abc) (CUSTOMER_NUMBER=1000) (ADDRESS1=issue here (some more text) (ADDRESS2=) (ADDRESS3=) (ADDRESS4=) (CITY=city, ) (STATE=na) (ZIP=250) (COUNTY=) (COUNTRY=NA) (CONTACT_NAME=)"

cn <- c("SITE_TYPE", "SITE_USE_ID", "CUSTOMER_NAME", "CUSTOMER_NUMBER", 
"ADDRESS1", "ADDRESS2", "ADDRESS3", "ADDRESS4", "CITY", "STATE", 
"ZIP", "COUNTY", "COUNTRY", "CONTACT_NAME")
rx <- sprintf(".(%s)", paste(cn, collapse = "|"))

x %>%
  sub("^.*?\\(", "(", .) %>%
  gsub("=", ": ", .) %>%
  gsub(rx, "\n\\1", .) %>%
  gsub("\\) *\\n", "\n", .) %>%
  sub("\\)$", "", .) %>%
  textConnection %>%
  read.dcf

给予:

     SITE_TYPE SITE_USE_ID CUSTOMER_NAME CUSTOMER_NUMBER
[1,] "site1"   "200"       "abc"         "1000"         
     ADDRESS1                     ADDRESS2 ADDRESS3 ADDRESS4 CITY    STATE
[1,] "issue here (some more text" ""       ""       ""       "city," "na" 
     ZIP   COUNTY COUNTRY CONTACT_NAME
[1,] "250" ""     "NA"    ""          

注意

可重现形式的输入是:

Note

The input in reproducible form is:

这篇关于使用 Regex &amp; 在 R 中分隔列分开(整洁)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆