在for循环中多次使用re.search在python中提取不同的字段值 [英] Using re.search multiple times inside for loop to extract different field values in python

查看:114
本文介绍了在for循环中多次使用re.search在python中提取不同的字段值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从输入文本中检索所有百分比数据以及带单位的整数/浮点数(如果文本中存在的话).如果两者都不一起出现,我希望至少检索存在的那一个.到现在为止,如果在提取的文本中有一个带有单位的整数/浮点数,它将出现在结果变量中.

I want to retrieve all percentage data as well as integer/float numbers with units from an input text, if it is present in the text. If both are not present together, I want to retrieve atleast the one that is present. Till now if there is an integer/float with an unit in the extracted text, it comes in the result variable.

result=[]
    newregex = "[0-9\.\s]+(?:mg|kg|ml|q.s.|ui|M|g|µg)"
    percentregex = "(\d+(\.\d+)?%)"
    for s in zz:
        for e in extracteddata:
            v = re.search(newregex,e,flags=re.IGNORECASE|re.MULTILINE)
            xx = re.search(percentregex,e,flags=re.IGNORECASE|re.MULTILINE)
            if v:

                if e.upper().startswith(s.upper()):
                    result.append([s,v.group(0), e])
            else:
                if e.upper().startswith(s.upper()):
                    result.append([s, e])

在上面的代码中,newregex标识数字/浮点数后跟一个单位,percentregex标识百分比数据,zz和extracteddata如下

In the code above, newregex identifies numbers/float with an unit after it, percentregex identifies percentage data, zz and extracteddata are as follows

zz = ['HYDROCHLORIC ACID 2M', 'ROPIVACAINE HYDROCHLORIDE MONOHYDRATE', 'SODIUM CHLORIDE', 'SODIUM HYDROXIDE 2M', 'WATER FOR INJECTIONS']

extracteddata = ['Ropivacaine hydrochloride monohydrate for injection (corresponding to 2 mg Ropivacaine hydrochloride anhydrous) 2.12 mg Active ingredient Ph Eur ', 'Sodium chloride for injection 8.6 mg 28% Tonicity contributor Ph Eur ', 'Sodium hydroxide 2M q.s. pH-regulator Ph Eur, NF Hydrochloric acid 2M q.s. pH-regulator Ph Eur, NF ', 'Water for Injections to 1 ml 34% Solvent Ph Eur, USP The product is filled into polypropylene bags sealed with rubber stoppers and aluminium caps with flip-off seals. The primary container is enclosed in a blister. 1(1)']

现在,我还想添加条件以提取结果变量中的百分比数据(如果存在),但是我对循环方面感到困惑.我需要使用变量"xx"将百分比数据添加到结果列表(如果存在)以及带单位的整数/浮点数的帮助.

Now I also want to add the condition to extract percentage data in the result variable if it is present but I am stuck with the looping aspect. i want help on using the variable 'xx' to add percentage data to result list if it is present, along with the integer/float numbers with units.

对此有任何帮助.

有关尝试的更新:

result = []
    mg = []
    newregex = "[0-9\.\s]+(?:mg|kg|ml|q.s.|ui|M|g|µg)"
    percentregex = "(\d+(\.\d+)?%)"
    print(type(newregex))
    for s in zz:
        for e in extracteddata:
            v = re.search(newregex,e,flags=re.IGNORECASE|re.MULTILINE)
            xx = re.search(percentregex,e,flags=re.IGNORECASE|re.MULTILINE)
            if v:
#                mg.append(v.group(0))

                if e.upper().startswith(s.upper()):
                    result.append([s,v.group(0), e])
            elif v is None:
                if e.upper().startswith(s.upper()):
                    result.append([s, e])
            elif xx:
                if v:
                    if e.upper().startswith(s.upper()):
                        result.append([s,v.group(0),xx.group(0), e])
            elif v is None:
                if  xx:
                    if e.upper().startswith(s.upper()):
                        result.append([s,xx.group(0), e])
            elif v is None and xx is None:
                if e.upper().startswith(s.upper()):
                        result.append([s, e])
            else:
                print("DOne")

推荐答案

以下是我们在评论中讨论的Python演示:

Here is a Python demo of what we talked about in the comments :

每个请求的修改

>>> import re
>>> 
>>> extracteddata = ['"Water 5.5 ml for injections 0.80 and 100 at 2.2 % ','Injections 100 and 0.80', 'Ropivacaine hydrochloride monohydrate for injection (corresponding to 2 mg Ropivacaine hydrochloride anhydrous) 2.12 mg Active ingredient Ph Eur ', 'Sodium chloride for injection 8.6 mg 28% Tonicity contributor Ph Eur ', 'Sodium hydroxide 2M q.s. pH-regulator Ph Eur, NF Hydrochloric acid 2M q.s. pH-regulator Ph Eur, NF ', 'Water for Injections to 1 ml 34% Solvent Ph Eur, USP The product is filled into polypropylene bags sealed with rubber stoppers and aluminium caps with flip-off seals. The primary container is enclosed in a blister. 1(1)']
>>> 
>>> Rx = r"(?i)(?=.*?((?:\d+(?:\.\d*)?|\.\d+)\s*(?:mg|kg|ml|q\.s\.|ui|M|g|µg)))?(?=.*?(\d+(?:\.\d+)?\s*%))?(?=.*?((?:\d+(?:\.\d*)?|\.\d+))(?![\d.])(?!\s*(?:%|mg|kg|ml|q\.s\.|ui|M|g|µg)))?.+"
>>> 
>>> for e in extracteddata:
...         match = re.search( Rx, e )
...         print("--------------------------------------------")
...         if match.group(1):
...                 print( "Unit num:  \t\t", match.group(1) )
...         if match.group(2):
...                 print( "Percentage num:  \t", match.group(2) )
...         if match.group(3):
...                 print( "Just a num:  \t\t", match.group(3) )
... 
--------------------------------------------
Unit num:                5.5 ml
Percentage num:          2.2 %
Just a num:              0.80
--------------------------------------------
Just a num:              100
--------------------------------------------
Unit num:                2 mg
--------------------------------------------
Unit num:                8.6 mg
Percentage num:          28%
--------------------------------------------
Unit num:                2M
--------------------------------------------
Unit num:                1 ml
Percentage num:          34%
Just a num:              1

这是正则表达式扩展

 (?i)
 (?=
      .*? 
      (                             # (1 start)
           (?:
                \d+ 
                (?: \. \d* )?
             |  \. \d+ 
           )
           \s* 
           (?: mg | kg | ml | q \. s \. | ui | M | g | µg )
      )                             # (1 end)
 )?
 (?=
      .*? 
      (                             # (2 start)
           \d+ 
           (?: \. \d+ )?
           \s* %
      )                             # (2 end)
 )?
 (?=
      .*? 
      (                             # (3 start)
           (?:
                \d+ 
                (?: \. \d* )?
             |  \. \d+ 
           )
      )                             # (3 end)
      (?! [\d.] )
      (?!
           \s* 
           (?: % | mg | kg | ml | q \. s \. | ui | M | g | µg )
      )
 )?
 .+ 

如图所示,它使用三个前瞻性断言来查找第一个实例
单位和百分比数字以及独立的数字.
所有值都是唯一的而不是重叠的.

As seen it uses three look ahead assertions to find the first instances
of the unit and percentage numbers and stand alone numbers.
All values are unique and not an overlap.

测试每一项是否为非空将显示是否在行中找到了该项目.

Testing each one for non-empty shows if it found that item(s) in the line.

这篇关于在for循环中多次使用re.search在python中提取不同的字段值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆