根据字符串中的以下术语提取数字 [英] Extracting Numbers Based On the Following Term in a String

查看:63
本文介绍了根据字符串中的以下术语提取数字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一批数据,其中包括充满自由格式文本的文本变量.我正在尝试根据字符串中的上下文将某些信息提取到新变量中,然后可以对其进行分析.

I have a batch of data that includes a text variable full of free-form text. I am trying to extract certain information based on context within the string into new variables which I can then analyze.

我一直在研究qdaptm.我已经用tolowerreplace_abbreviation统一了格式,但是似乎无法弄清楚如何实际提取所需的信息.

I have been digging into qdap and tm. I have uniformed the format with tolower and replace_abbreviation but cannot seem to figure out how to actually extract the information I need.

例如,

library(data.table)
data<-data.table(text=c("Person 1: $1000 fine, 31 months jail", 
                     "Person 2: $500 fine, 45 days jail"))


                                   text
1: Person 1: $1000 fine, 31 months jail
2:    Person 2: $500 fine, 45 days jail

我想做的是根据以下术语提取数字,以创建另外两个变量,即months和days,它们具有相应的值:

What I would like to do is to extract numbers based on whatever the following term is to create two additional variable, months and days, which has the corresponding values:

data<-data.table(text=c("Person 1: $1000 fine, 31 months jail", 
                        "Person 2: $500 fine, 45 days jail"), 
                 months=c("31",""), 
                 days=c("","45")


                                   text months days
1: Person 1: $1000 fine, 31 months jail     31     
2:    Person 2: $500 fine, 45 days jail          45

我一直在寻找Stack Overflow,但没有找到任何答案,因此希望我不会错过任何一个.但是任何人都可以提供的任何帮助将不胜感激.在文本分析方面还是很新的.

I have scoured Stack Overflow and have not found any answers to this so hopefully I didn't just miss one. But any help anyone could offer will be very much appreciated. Still pretty new at text analysis.

谢谢您的时间!

推荐答案

stringr::str_extract()正向前进,您可以执行以下操作:

Using stringr::str_extract() with positive lookahead you can do something like this:

data <- dplyr::mutate(data,
                      months = stringr::str_extract(text, "\\d+(?=\\smonths)"),
                      days = stringr::str_extract(text, "\\d+(?=\\sdays)"))

##                                   text months days
## 1 Person 1: $1000 fine, 31 months jail     31 <NA>
## 2    Person 2: $500 fine, 45 days jail   <NA>   45

上面的正则表达式对文本字符串进行了一些假设,即它在数字和单位之间只有一个空格,并且单位始终是复数.更加灵活的是:

The above regex makes some assumptions about the text string, namely it has one and only one space between the number and the unit, and also that the units are always plural. Something more flexible would be:

data<-data.table(text=c("Person 1: $1000 fine, 31 months jail", 
                        "Person 2: $500 fine, 45 days jail",
                        "Person 3: $1000 fine, 1     month 1 day jail"))

data <- dplyr::mutate(data,
                      months = stringr::str_extract(text, "\\d+(?=\\s*months*)"),
                      days = stringr::str_extract(text, "\\d+(?=\\s*days*)"))

##                                           text months days
## 1         Person 1: $1000 fine, 31 months jail     31 <NA>
## 2            Person 2: $500 fine, 45 days jail   <NA>   45
## 3 Person 3: $1000 fine, 1     month 1 day jail      1    1

这篇关于根据字符串中的以下术语提取数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆