R中的正则表达式命名组 [英] Regex named groups in R

查看:95
本文介绍了R中的正则表达式命名组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

出于所有意图和目的,我是Python用户,每天使用Pandas库.正则表达式中的命名捕获组非常有用.因此,例如,提取特定单词或短语的出现并在数据帧的新列中生成结果的串联字符串是相对琐碎的.下面给出了如何实现此目标的示例:

For all intents and purposes, I am a Python user and use the Pandas library on a daily basis. The named capture groups in regex is extremely useful. So, for example, it is relatively trivial to extract occurrences of specific words or phrases and to produce concatenated strings of the results in new columns of a dataframe. An example of how this might be achieved is given below:

import numpy as np
import pandas as pd
import re

myDF = pd.DataFrame(['Here is some text',
                     'We all love TEXT',
                     'Where is the TXT or txt textfile',
                     'Words and words',
                     'Just a few works',
                     'See the text',
                     'both words and text'],columns=['origText'])

print("Original dataframe\n------------------")
print(myDF)

# Define regex to find occurrences of 'text' or 'word' as separate named groups
myRegex = """(?P<textOcc>t[e]?xt)|(?P<wordOcc>word)"""
myCompiledRegex = re.compile(myRegex,flags=re.I|re.X)

# Extract all occurrences of 'text' or 'word'
myMatchesDF = myDF['origText'].str.extractall(myCompiledRegex)
print("\nDataframe of matches (with multi-index)\n--------------------")
print(myMatchesDF)

# Collapse resulting multi-index dataframe into single rows with concatenated fields
myConcatMatchesDF = myMatchesDF.groupby(level = 0).agg(lambda x: '///'.join(x.fillna('')))
myConcatMatchesDF = myConcatMatchesDF.replace(to_replace = "^/+|/+$",value = "",regex = True) # Remove '///' at start and end of strings
print("\nCollapsed and concatenated matches\n----------------------------------")
print(myConcatMatchesDF)

myDF = myDF.join(myConcatMatchesDF)
print("\nFinal joined dataframe\n----------------------")
print(myDF)

这将产生以下输出:

Original dataframe
------------------
                           origText
0                 Here is some text
1                  We all love TEXT
2  Where is the TXT or txt textfile
3                   Words and words
4                  Just a few works
5                      See the text
6               both words and text

Dataframe of matches (with multi-index)
--------------------
        textOcc wordOcc
  match                
0 0        text     NaN
1 0        TEXT     NaN
2 0         TXT     NaN
  1         txt     NaN
  2        text     NaN
3 0         NaN    Word
  1         NaN    word
5 0        text     NaN
6 0         NaN    word
  1        text     NaN

Collapsed and concatenated matches
----------------------------------
            textOcc      wordOcc
0              text             
1              TEXT             
2  TXT///txt///text             
3                    Word///word
5              text             
6              text         word

Final joined dataframe
----------------------
                           origText           textOcc      wordOcc
0                 Here is some text              text             
1                  We all love TEXT              TEXT             
2  Where is the TXT or txt textfile  TXT///txt///text             
3                   Words and words                    Word///word
4                  Just a few works               NaN          NaN
5                      See the text              text             
6               both words and text              text         word

我已经印制了每个阶段,以使其易于遵循.

I've printed each stage to try to make it easy to follow.

问题是,我可以在R中做类似的事情吗?我已经在网上搜索过,但是找不到任何描述命名组用法的信息(尽管我是R新手,因此可能正在搜索错误的库或描述性术语.

The question is, can I do something similar in R. I've searched the web but can't find anything that describes the use of named groups (although I'm an R-newcomer and so might be searching for the wrong libraries or descriptive terms).

我已经能够识别出包含一个或多个匹配项的项目,但是看不到如何提取特定匹配项或如何利用命名组.到目前为止,我的代码(使用与上面的Python示例相同的数据框和正则表达式)是:

I've been able to identify those items that contain one or more matches but I cannot see how to extract specific matches or how to make use of the named groups. The code I have so far (using the same dataframe and regex as in the Python example above) is:

origText = c('Here is some text','We all love TEXT','Where is the TXT or txt textfile','Words and words','Just a few works','See the text','both words and text')
myDF = data.frame(origText)
myRegex = "(?P<textOcc>t[e]?xt)|(?P<wordOcc>word)"
myMatches = grep(myRegex,myDF$origText,perl=TRUE,value=TRUE,ignore.case=TRUE)
myMatches
[1] "Here is some text"                "We all love TEXT"                 "Where is the TXT or txt textfile" "Words and words"                 
[5] "See the text"                     "both words and text"             

myMatchesRow = grep(myRegex,myDF$origText,perl=TRUE,value=FALSE,ignore.case=TRUE)
myMatchesRow
[1] 1 2 3 4 6 7

regex似乎可以正常工作,并且正确的行被标识为包含匹配项(即,上面示例中的第5行以外的所有行).但是,我的问题是,是否可以产生与Python产生的输出类似的输出,在该输出中,特定匹配项被提取并列出在使用正则表达式中包含的组名命名的数据框中的新列中?

The regex seems to be working and the correct rows are identified as containing a match (i.e. all except row 5 in the above example). However, my question is, can I produce an output that is similar to that produced by Python where the specific matches are extracted and listed in new columns in the dataframe that are named using the group names contained in the regex?

推荐答案

Base R确实捕获了有关名称的信息,但是它没有一个很好的帮助来按名称提取名称.我写了一个包装来帮助 regcapturedmatches .您可以将它与

Base R does capture the information about the names but it doesn't have a good helper to extract them by name. I write a wrapper to help called regcapturedmatches. You can use it with

myRegex = "(?<textOcc>t[e]?xt)|(?<wordOcc>word)"
m<-regexpr(myRegex, origText, perl=T, ignore.case=T)
regcapturedmatches(origText,m)

返回哪个

     textOcc wordOcc
[1,] "text"  ""     
[2,] "TEXT"  ""     
[3,] "TXT"   ""     
[4,] ""      "Word" 
[5,] ""      ""     
[6,] "text"  ""     
[7,] ""      "word" 

这篇关于R中的正则表达式命名组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆