修改图案以查找数字 [英] Modify a pattern to find number

查看:58
本文介绍了修改图案以查找数字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这种模式可以从字符串中提取数字.

I have this pattern to extract numbers from Strings.

ptns = { 'clean1': re.compile(r'[-&\s]+', re.UNICODE)
        , 'clean2': re.compile(r'\bABCS?(?:[/\s-]+KE|(?=\s*\d))|\bFOR\s+(?:[A-Z]+\s+)*', re.UNICODE) 
        , 'data' : re.compile(r'\b(\d{4,6})(?=[A-Z/_]|$)', re.UNICODE) }

我想为模式添加一些条件,并且不要触及旧的条件,因此一开始我总是应该这样写: ABCDEFGS,ABCDEFG,ABC,JUSTIF .有时候,我在字符串末尾有一个可选的"S" ,例如:ABCDEFGS.

I want to add some conditions to the pattern, and don’t touch to the old contitions, always should I have the words : ABCDEFGS, ABCDEFG, ABC, JUSTIF in the begin. Some times I have an optional ‘S’ in the end of the string like : ABCDEFGS.

我想从此文本中提取所有包含以下数字的数字: 4、5或6 个数字.要添加到模式中以提取数字的条件和条件:

I want to extract all numbers that contain: 4, 5 or 6 digits from this text. Condition and cases to add to the pattern to extract the numbers:

- Attached to ABC then ‘.’ (sometimes I have only one number sometimes I have a list of a numbers) 
- Attached to ABC space then ‘.’ (sometimes I have only one number sometimes I have a list of a numbers) 
- after ABCDEFGS then space
- after ABCDEFG + space (line 4) 
- after JUSTIF then ‘.’ space 
- After ABC but between ( ) ⇒ See example bellow.

数据集示例,预期结果为:

Dataset example and expect result is:

Column                                                                                                                New_column
————————————————————————————————————————
Hoy es día  ABCDEFGS 05327 - 05771 - 06045 todas las mañanas   | [05327, 05771, 06045]     
———————————————————————————————————————— 
 todas las mañanas ABCDEFG 6661 & ABCDEFG 11440 Se viste    | [6661, 11440 ]
————————————————————————————————————————
escuela ABCDEFG 19652 matemáticas Hoy es día               | [19652]
————————————————————————————————————————
y comienza ABCDEFG 76192/T85921 el camino hacia             | [76192]
————————————————————————————————————————
Marcos se ABCDEFG 13462 S22786 camino                        | [13462]
————————————————————————————————————————
encuentra con su ABC. 19390 / 19351 viste, desayuna           | [19390, 19351]
————————————————————————————————————————
escuela ABC.5498/5499/5470/5471 DEFINE AND DESIGN IMPROVE     | [5498,5499,5470,5471]
————————————————————————————————————————
l camino hacia la ABC.20974 Marcos se                       | [20974]
————————————————————————————————————————
todas las mañanas ABC 160879-P15989/ 160878-P20181/160878-P20182 AND 160879-P20183 [160879, 160878, 160878, 160879]
————————————————————————————————————————
ABC. 5498/5499/5470/5471 l camino hacia la                  | [5498,5499,5470,5471]
————————————————————————————————————————
todas las JUSTIF. 103383/L25469   todas                                                     | [103383]
————————————————————————————————————————
las (ABC 38770) OR CFM56-5B1/3 (ABC 37147)  camino             |      [38770, 37147]
————————————————————————————————————————
hacia la (POST ABC 161104)  hacia la                             | [161104]
————————————————————————————————————————
DEFINE AND DESIG ABC/KE: 73620T80840 DEFINE      | [73620 ]
————————————————————————————————————————
 DEFINE AND DESIGN IMPROVE ABC (39729)  IMPROVE    |  [39729]
————————————————————————————————————————

推荐答案

根据您的请求,我修改了三种用于清除数据并匹配数字的模式:

Per your request, I modified the three patterns which are used to clean data and match the numbers:

data 模式中,将 \ b 替换为(?:^ |(?< =/)),以便数字可以在字符串的开头或在斜杠/之前.

in the data pattern, replaced \b with (?:^|(?<=/)) so that the numbers can either at the beginning of string or preceded by a slash /.

ptns = { 'clean1': re.compile(r'[/-]\s|\s[/-]|[&\s.():]+|\b(?:AND|OR)\b', re.UNICODE)
       , 'clean2': re.compile(r'\bABCS?[/\s]+KE|\b(?:ABCS?|ABCDEFGS?|JUSTIF|FOR)(?=\s*\d)', re.UNICODE)  
       , 'data'  : re.compile(r'(?:^|(?<=/))(\d{4,6})(?=[A-Z/_-]|$)', re.UNICODE) }   

模式:

  • clean1 :将以下模式转换为空格

  • clean1: convert the following patterns into a SPACE

  • [/-] \ s | \ s [/-] :斜杠或连字符前面加一个空格,或后面加一个空格

  • [/-]\s|\s[/-]: slash or hyphen preceded by a space or followed by a space

example:  'ABC- 72981' --> 'ABC 72981'
          'ABC 160879-P15989/' <-- no change since no SPACE around hyphen

  • \ b(?:AND | OR)\ b :允许 AND OR 链接数字序列

  • \b(?:AND|OR)\b: to allow AND or OR to link a sequence of numbers

    example: '160878-P20181/160878-P20182 AND 160879-P20183' --> '160878-P20181/160878-P20182 160879-P20183'
    

  • [& \ s.():] + :删除需要单独处理的连字符,并添加括号(),点.和冒号:

  • [&\s.():]+: remove hyphen which needs to be processed separately, added parenthesis (, ), dot . and colon :

    example:   'ABC. 19390'   --> 'ABC 19390'
               '(ABC 38770)'  --> 'ABC 38770'
               'ABC/KE: 73620T80840' --> 'ABC/KE 73620T80840'
    

  • clean2 :将以下内容转换为 ABC

    clean2: convert the following into ABC

    • \ bABCS?[/\ s] + KE :ABC,后跟一个空格或斜杠,然后是 KE .如果对JUSTIF,ABCDEFGS也应用相同的规则,则此部分可能会移至 clean1 模式.等等.

    • \bABCS?[/\s]+KE: ABC followed by a spaces or slashes then following by KE. this part might be moved to the clean1 pattern if the same rules are applied also to JUSTIF, ABCDEFGS? etc.

    \ b(?:ABCS?| ABCDEFGS?| JUSTIF | FOR)(?= \ s * \ d):匹配ABC.ABCS,ABCDEFG,ABCDEFGS或JUSTIF,后跟一个可选空格,然后输入数字

    \b(?:ABCS?|ABCDEFGS?|JUSTIF|FOR)(?=\s*\d): matches ABC. ABCS, ABCDEFG, ABCDEFGS or JUSTIF followed by an optional space and then number

    数据:添加了连字符-作为锚,以跟随匹配的4-6位数字的子字符串

    data: added the hyphen - as an anchor to follow the matched substring of 4-6 digit

    应保留其他代码,如下所示:

    Other code should be kept, see below:

    udf_find_number = udf(lambda x: find_number(x, ptns), ArrayType(StringType()))
    
    df.withColumn('new_column', udf_find_number('column')).show(truncate=False)
    +----------------------------------------------------------------------------------+--------------------------------+
    |Column                                                                            |new_column                      |
    +----------------------------------------------------------------------------------+--------------------------------+
    |Hoy es d陋a  ABCDEFGS 05327 - 05771 - 06045 todas las ma?anas                     |[05327, 05771, 06045]           |
    | todas las ma?anas ABCDEFG 6661 & ABCDEFG 11440 Se viste                          |[6661, 11440]                   |
    |escuela ABCDEFG 19652 matem垄ticas Hoy es d陋a                                    |[19652]                         |
    |y comienza ABCDEFG 76192/T85921 el camino hacia                                   |[76192]                         |
    |Marcos se ABCDEFG 13462 S22786 camino                                             |[13462]                         |
    |encuentra con su ABC. 19390 / 19351 viste, desayuna                               |[19390, 19351]                  |
    |escuela ABC.5498/5499/5470/5471 DEFINE AND DESIGN IMPROVE                         |[5498, 5499, 5470, 5471]        |
    |l camino hacia la ABC.20974 Marcos se                                             |[20974]                         |
    |todas las ma?anas ABC 160879-P15989/ 160878-P20181/160878-P20182 AND 160879-P20183|[160879, 160878, 160878, 160879]|
    |ABC. 5498/5499/5470/5471 l camino hacia la                                        |[5498, 5499, 5470, 5471]        |
    |todas las JUSTIF. 103383/L25469   todas                                           |[103383]                        |
    |las (ABC 38770) OR CFM56-5B1/3 (ABC 37147)  camino                                |[38770, 37147]                  |
    |hacia la (POST ABC 161104)  hacia la                                              |[161104]                        |
    |DEFINE AND DESIG ABC/KE: 73620T80840 DEFINE                                       |[73620]                         |
    | DEFINE AND DESIGN IMPROVE ABC (39729)  IMPROVE                                   |[39729]                         |
    +----------------------------------------------------------------------------------+--------------------------------+
    

    让我知道这是否解决了问题.

    Let me know if this fixed the problems.

    这篇关于修改图案以查找数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆