R stringR RegExp 策略,用于在没有先验知识的情况下对类似表达式进行分组 [英] R stringR RegExp strategy for grouping like expressions without prior knowledge

查看:38
本文介绍了R stringR RegExp 策略,用于在没有先验知识的情况下对类似表达式进行分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一份包含 5 万多个零件号的列表.我需要按产品类型对它们进行分组.零件号通常按顺序彼此靠近,尽管它们不是完全顺序的.产品描述总是相似的,但不遵循最佳规则.让我用下表来说明.

I've got a list of 50K+ part numbers. I need to group them by their Product Type. Part numbers are typically near each other in sequence, although they're not perfectly sequential. The product description is always similar, but does not follow optimum rules. Let me illustrate with the following table.

| PartNo | Description | ProductType |
|--------|-------------|-------------|
|A000443 |Water Bottle |    Water    |
|A000445 |Contain Water|    Water    |
|A000448 |WaterBotHold |    Water    |
|HRZ55   |Hershey_Bar  | Energy Bar  |
|RRB55   |Candy Energy | Energy Bar  |
|QMU55   |Bar Protein  | Energy Bar  |

我事先不知道产品类型.stringR 正则表达式必须足够智能才能从零件描述生成产品类型.我是一名新手,刚刚通过R for Data Science,这似乎是可以实现的,尽管很困难.

I do not know the Product Types before hand. The stringR regular expression has to be smart enough to generate a product type from the parts description. I'm a rookie just making my way through R for Data Science and this seems achievable, although difficult.

你会如何着手解决这个问题?我实际使用的内容如下所示.期望我的 stringR 语法将填充 ProductType 列.

How would you go about even starting this problem? What I'm actually working with is shown below. The expectation is that my stringR syntax will populate the ProductType column.

| PartNo | Description | ProductType |
|--------|-------------|-------------|
|A000443 |Water Bottle |             |
|A000445 |Contain Water|             |
|A000448 |WaterBotHold |             |
|HRZ55   |Hershey_Bar  |             |
|RRB55   |Candy Energy |             |
|QMU55   |Bar Protein  |             |

这是让球滚动的可重现示例.

Here's the reproducible example to get the ball rolling.

library(tidyverse)
library(stringr)
df <- tribble(
  ~PartNo, ~Description, ~ProductType, 
  "A000443", "Water Bottle", "",
  "A000445", "Contain Water", "",
  "A000448", "WaterBotHold", "",
  "HRZ55", "Hershey_Bar", "",
  "RRB55", "Candy Energy", "",
  "QMU55", "Bar Protein", ""
)

推荐答案

你可以试试stringr::str_extract.它适用于由 | 分隔的多个单词.

You can try stringr::str_extract. It works for multiple words which are separated by |.

更新:

OP 建议查找作为 ProductType 的词是未知的,应该根据 Description 列中不同词的出现频率来决定.

OP suggested that words to look up as ProductType is not known and those should be decided on basis of frequency of different words in Description column.

一个选项是使用 qdap 包来查找不同单词的频率并选择将决定产品类型的前 n(例如 2)个单词.解决方案如下:

An option is to use qdap package to find frequencies of different words and select top n (say 2) words which will decide product type. The solution will be as:

library(stringr)
library(qdap)

#Find frequencies of different words
freq <- freq_terms(df$Description)

#Select top `n`. I have taken top 2 and create regex pattern 
word_to_search <- paste0(freq$WORD[1:2],collapse = "|")

df$ProductType <- str_extract(tolower(df$Description), word_to_search)
df
#    PartNo   Description ProductType
# 1 A000443  Water Bottle       water
# 2 A000445 Contain Water       water
# 3 A000448  WaterBotHold       water
# 4   HRZ55   Hershey_Bar         bar
# 5   RRB55  Candy Energy        <NA>    #Didn't match with Water/Bar
# 6   QMU55   Bar Protein         bar

数据:

df <- read.table(text = 
"PartNo  Description 
A000443 'Water Bottle' 
A000445 'Contain Water'
A000448 WaterBotHold 
HRZ55   Hershey_Bar  
RRB55   'Candy Energy' 
QMU55   'Bar Protein'",
stringsAsFactors = FALSE, header = TRUE)

这篇关于R stringR RegExp 策略,用于在没有先验知识的情况下对类似表达式进行分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆