从R中的文本列中提取特定数据 [英] Extracting specific data from text column in R

查看:691
本文介绍了从R中的文本列中提取特定数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在列中有一组药品名称数据.我正在尝试从该数据中提取每种药物的名称,强度和单位.术语MG和ML是设置中强度的限定词.例如,让我们考虑以下给定的药物名称数据集.

I have a data set of medicine names in a column. I am trying to extract the name ,strength and unit of each medicine from this data. The term MG and ML are the qualifiers of strength in the setup. For example, let us consider the following given data set for the names of the medicines.

 Medicine name
----------------------
 FALCAN 150 MG tab
 AUGMENTIN 500MG tab
 PRE-13 0.5 ML PFS inj
 NS.9%w/v 250 ML, Glass Bottle

我想从该数据集中提取以下信息列,

I want to extract the following information columns from this data set,

Name     | Strength |Unit
---------| ---------|------
FALCAN   | 150      |MG
AUGMENTIN| 500      |MG
PRE-13   | 0.5      |ML
NS.9%w/v | 250      |ML

我尝试了grepl etc命令,但找不到一个好的解决方案.我有大约12000多个数据可识别.数据没有遵循固定的模式,在某些地方,MG和强度之间没有空格,例如300MG. ,

I have tried grepl etc command and could not find a good solution. I have around >12000 data to identify. Data does not follow a fixed pattern, and at few places MG and strength does not have a space in between such as 300MG. ,

推荐答案

您可以使用多个正则表达式来实现.所有人都以为我不是正则表达式的拥护者,所以我将其用于与您在此使用的目的相同的目的.

You can achieve this with multiple regular expressions. All thought I am not a regex champion I use it for the same purpose as you present here.

meds <- c('FALCAN 150 MG tab',
'AUGMENTIN 500MG tab',
'PRE-13 0.5 ML PFS inj',
'NS.9%w/v 250 ML, Glass Bottle')

library(stringr)

#Name
trimws(str_extract(str_extract(meds, '.* [0-9.]{3}'),'.* '))

#Strength
str_extract(str_extract(meds, '[0-9.]{3}( M|M)[GL]'),'[0-9.]*')

#Unit
str_extract(str_extract(meds, '( M|[0-9]M)[GL]'), 'M[GL]')

我知道许多这样的医学符号可能有很大的不同,因此与G. Grothendieck提出的解决方案相反,我更希望使用正则表达式提取每个项目(G. Grothendieck期望数据中具有特定结构(3列) ). 这样,通过检查所有生成NA值的字符串,我可以调整每个项目.

I know that a lot of these medicine notations can be quite different, thus I prefer to extract each item by regular expressions, in contrast to the solution presented by G. Grothendieck, who expects a certain structure in the data (3 columns). That way I am able to tweak each item, by inspecting all the strings that generate NA values.

这篇关于从R中的文本列中提取特定数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆