如何从文本中找到数字 [英] How to find number from text

查看:67
本文介绍了如何从文本中找到数字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的数据框中 pyspark 列(字符串)的一个小示例.

This is a small example of a pyspark column (String) in my dataframe.

 column                                                                                            |   new_column
------------------------------------------------------------------------------------------------- |--------------------------------------------------
Hoy es día de ABC/KE98789T983456 clase.                                                           |    98789
------------------------------------------------------------------------------------------------- |--------------------------------------------------
Como ABC/KE 34562Z845673 todas las mañanas                                                        |    34562
------------------------------------------------------------------------------------------------- |--------------------------------------------------
Hoy tiene ABC/KE 110330/L63868 clase de matemáticas,                                              |    110330
------------------------------------------------------------------------------------------------- |--------------------------------------------------
Marcos se ABC 898456/L56784 levanta con sueño.                                                    |    898456
------------------------------------------------------------------------------------------------- |--------------------------------------------------
Marcos se ABC898456 levanta con sueño.                                                            |    898456
------------------------------------------------------------------------------------------------- |--------------------------------------------------
comienza ABC - KE 60014 -T60058                                                                   |    60014
------------------------------------------------------------------------------------------------- |--------------------------------------------------
inglés y FOR 102658/L61144 ciencia. Se viste, desayuna                                            |    102658
------------------------------------------------------------------------------------------------- |--------------------------------------------------
y comienza FOR ABC- 72981 / KE T79581: el camino hacia la                                         |    72981
------------------------------------------------------------------------------------------------- |--------------------------------------------------
escuela. Se FOR ABC 101665 - 103035 - 101926 - 105484 - 103036 - 103247 - encuentra con su        |    [101665,103035,101926,105484,103036,103247]
------------------------------------------------------------------------------------------------- |--------------------------------------------------
escuela ABCS 206048/206049/206050/206051/205225-FG-matemáticas-                                   |    [206048,206049,206050,206051,205225]
------------------------------------------------------------------------------------------------- |--------------------------------------------------
encuentra ABCS 111553/L00847 & 111558/L00895 - matemáticas                                        |    [111553, 111558]
------------------------------------------------------------------------------------------------- |--------------------------------------------------
ciencia ABC 163278/P20447 AND RETROFIT ABCS 164567/P21000 - 164568/P21001 - desayuna              |    [163278,164567,164568 ]
------------------------------------------------------------------------------------------------- |--------------------------------------------------
ABC/KE 71729/T81672 - 71781/T81674 71782/T81676 71730/T81673 71783/T81677 71784/T                 |    [71729,71781,71782,71730,71783,71784]
------------------------------------------------------------------------------------------------- |--------------------------------------------------
ciencia ABC/KE2646/L61175:E/F-levanta con sueño L61/62LAV AT Z5CTR/XC D3-1593                     |    [2646]
-----------------------------------------------------------------------------------------------------------------------------------------------------
escuela ABCS 6048/206049/6050/206051/205225-FG-matemáticas- MSN 2345                              |    [6048,206049,6050,206051,205225]
-----------------------------------------------------------------------------------------------------------------------------------------------------
FOR ABC/KE 109038_L35674_DEFINE AND DESIGN IMPROVEMENTS OF 1618 FROM 118(PDS4 BRACKETS)           |   [109038]
-----------------------------------------------------------------------------------------------------------------------------------------------------
y comienza FOR ABC- 2981 / KE T79581: el camino hacia la 9856                                     |   [2981]

我想从该文本中提取包含以下内容的所有数字:4、5 或 6 位数字.提取它们的条件和案例:

I want to extract all numbers that contain: 4, 5 or 6 digits from this text. Condition and cases to extract them:

- Attached to ABC/KE (first line in the example above).
- after ABC/KE + space (second and third line).
- after ABC + space (line 4)
- after ABC without space (line 5)
- after ABC - KE + space
- after for word
- after ABC- + space
- after ABC + space
- after ABCS (line 10 and 11)

失败案例:

                                    Column                                                    |   new_column
------------------------------------------------------------------------------------------------------------------------
FOR ABC/KE 109038_L35674_DEFINE AND DESIGN IMPROVEMENTS OF 1618 FROM 118(PDS4 BRACKETS)           |   [1618]   ==> should be [109038]
------------------------------------------------------------------------------------------------------------------------
ciencia ABC/KE2646/L61175:E/F-levanta con sueño L61/62LAV AT Z5CTR/XC D3-1593                     |   [1593]  ==> should be [2646]
------------------------------------------------------------------------------------------------------------------------
escuela ABCS 6048/206049/6050/206051/205225-FG-matemáticas- MSN 2345                              |    [6048,206049,6050,206051,205225, 2345]  ==> should be [6048,206049,6050,206051,205225]

我希望我恢复了案例,你可以看到我上面的例子和期望的输出.我该怎么做 ?谢谢

I hope that I resumed the cases, you can see my example above and the expect output. How can I do it ? Thank you

推荐答案

一种使用正则表达式清理数据并设置一个值为 ABC 的单独锚点的方法来识别潜在的开始比赛.在 str.split() 之后,遍历结果数组以标记并检索此锚点后面的连续匹配数字.

One way using regexes to clean out the data and set up a lone anchor with value of ABC to identify the start of a potential match. after str.split(), iterate through the resulting array to flag and retrieve consecutive matching numbers that follow this anchor.

在数据模式中添加了下划线 _ (\b(\d{4,6})(?=[AZ/_]|$)) 以便它现在允许下划线作为锚点跟随匹配的 4-6 位子字符串.这修复了第一行、第 2 行和第 3 行应该使用现有的正则表达式模式.

Added underscore _ into the data pattern (\b(\d{4,6})(?=[A-Z/_]|$)) so that it now allows underscore as an anchor to follow the matched substring of 4-6 digit. this fixed the first line, line 2 and 3 should be working with the existing regex patterns.

import re
from pyspark.sql.types import ArrayType, StringType
from pyspark.sql.functions import udf

(1) 使用正则表达式模式清理原始数据,这样我们就只有一个锚ABC 来识别潜在匹配的开始:

(1) Use regex patterns to clean out the raw data so that we have only one anchor ABC to identify the start of a potential match:

  • clean1:使用 [-&\s]+ 将 '&'、'-' 和空格转换为 SPACE '',它们用于连接一串数字

  • clean1: use [-&\s]+ to convert '&', '-' and whitespaces to a SPACE ' ', they are used to connect a chain of numbers

example: `ABC - KE`  -->  `ABC KE`
         `103035 - 101926 - 105484` -> `103035 101926 105484`
         `111553/L00847 & 111558/L00895` -> `111553/L00847 111558/L00895`

  • clean2:将符合以下三个子模式的文本转换为'ABC'

  • clean2: convert text matching the following three sub-patterns into 'ABC '

    + ABCS?(?:[/\s]+KE|(?=\s*\d))
      + ABC followed by an optional `S`
        + followed by at least one slash or whitespace and then `KE` --> `[/\s]+KE`
          example: `ABC/KE 110330/L63868` to `ABC  110330/L63868`
        + or followed by optional whitespaces and then at least one digit  --> (?=\s*\d)
          example: ABC898456 -> `ABC 898456`
    + \bFOR\s+(?:[A-Z]+\s+)*
      + `FOR` words
          example: `FOR DEF HJK 12345` -> `ABC 12345`
    

  • data:\b(\d{4,6})(?=[AZ/_]|$) 是一个匹配实际的正则表达式数字:4-6 位数字后跟 [AZ/] 或 end_of_string

  • data: \b(\d{4,6})(?=[A-Z/_]|$) is a regex to match actual numbers: 4-6 digits followed by [A-Z/] or end_of_string

    (2) 创建一个 dict 来保存所有 3 个模式:

    (2) Create a dict to save all 3 patterns:

    ptns = {
        'clean1': re.compile(r'[-&\s]+', re.UNICODE)
      , 'clean2': re.compile(r'\bABCS?(?:[/\s-]+KE|(?=\s*\d))|\bFOR\s+(?:[A-Z]+\s+)*', re.UNICODE)
      , 'data'  : re.compile(r'\b(\d{4,6})(?=[A-Z/_]|$)', re.UNICODE)
    }
    

    (3) 创建一个函数来查找匹配的数字并将它们保存到数组中

    (3) Create a function to find matched numbers and save them into an array

    def find_number(s_t_r, ptns, is_debug=0):
      try:
        arr = re.sub(ptns['clean2'], 'ABC ', re.sub(ptns['clean1'], ' ', s_t_r.upper())).split()
        if is_debug: return arr
        # f: flag to identify if a chain of matches is started, default is 0(false)
        f = 0
        new_arr = []
        # iterate through the above arr and start checking numbers when anchor is detected and set f=1
        for x in arr:
          if x == 'ABC':
            f = 1
          elif f:
            new = re.findall(ptns['data'], x)
            # if find any matches, else reset the flag
            if new:
              new_arr.extend(new)
            else:
              f = 0
        return new_arr
      except Exception as e:
        # only use print in local debugging
        print('ERROR:{}:\n  [{}]\n'.format(s_t_r, e))
        return []
    

    (4) 定义udf函数

    (4) defind the udf function

    udf_find_number = udf(lambda x: find_number(x, ptns), ArrayType(StringType()))
    

    (5) 获取 new_column

    (5) get the new_column

    df.withColumn('new_column', udf_find_number('column')).show(truncate=False)
    +------------------------------------------------------------------------------------------+------------------------------------------------+
    |column                                                                                    |new_column                                      |
    +------------------------------------------------------------------------------------------+------------------------------------------------+
    |Hoy es da de ABC/KE98789T983456 clase.                                                    |[98789]                                         |
    |Como ABC/KE 34562Z845673 todas las ma?anas                                                |[34562]                                         |
    |Hoy tiene ABC/KE 110330/L63868 clase de matem篓垄ticas,                                    |[110330]                                        |
    |Marcos se ABC 898456/L56784 levanta con sue?o.                                            |[898456]                                        |
    |Marcos se ABC898456 levanta con sue?o.                                                    |[898456]                                        |
    |comienza ABC - KE 60014 -T60058                                                           |[60014]                                         |
    |ingl篓娄s y FOR 102658/L61144 ciencia. Se viste, desayuna                                  |[102658]                                        |
    |y comienza FOR ABC- 72981 / KE T79581: el camino hacia la                                 |[72981]                                         |
    |escuela. Se FOR ABC 101665 - 103035 - 101926 - 105484 - 103036 - 103247 - encuentra con su|[101665, 103035, 101926, 105484, 103036, 103247]|
    |escuela ABCS 206048/206049/206050/206051/205225-FG-matem篓垄ticas-                         |[206048, 206049, 206050, 206051, 205225]        |
    |encuentra ABCS 111553/L00847 & 111558/L00895 - matem篓垄ticas                              |[111553, 111558]                                |
    |ciencia ABC 163278/P20447 AND RETROFIT ABCS 164567/P21000 - 164568/P21001 - desayuna      |[163278, 164567, 164568]                        |
    |ABC/KE 71729/T81672 - 71781/T81674 71782/T81676 71730/T81673 71783/T81677 71784/T         |[71729, 71781, 71782, 71730, 71783, 71784]      |
    +------------------------------------------------------------------------------------------+------------------------------------------------+
    

    (6) 调试代码,使用 find_number(row.column, ptns, 1) 检查前两个正则表达式模式如何/是否按预期工作:

    (6) code for debugging, use find_number(row.column, ptns, 1) to check how/if the first two regex patterns work as expected:

    for row in df.limit(10).collect():
      print('{}:\n    {}\n'.format(row.column, find_number(row.column, ptns)))
    

    一些注意事项:

    • clean2模式中,ABCS和ABS的处理方式相同.如果它们不同,只需删除S"并在模式末尾添加一个新的替代 ABCS\s*(?=\d)

    • in clean2 pattern, ABCS and ABS are treated the same way. if they are different, just remove the 'S' and add a new alternative ABCS\s*(?=\d) to the end of the pattern

    re.compile(r'\bABC(?:[/\s-]+KE|(?=\s*\d))|\bFOR\s+(?:[A-Z]+\s+)*|ABCS\s*(?=\d)')
    

  • 当前模式 clean1 只处理 '-', '&'和空格作为连续连接符,您可以添加更多字符或单词,例如and"、or",例如:

  • current pattern clean1 only treats '-', '&' and whitespaces as consecutive connector, you might add more characters or words like 'and', 'or', for example:

    re.compile(r'[-&\s]+|\b(?:AND|OR)\b')
    

  • FOR words 是\bFOR\s+(?:[AZ]+\s+)*,这可能会根据单词等中是否允许数字进行调整.

  • FOR words is \bFOR\s+(?:[A-Z]+\s+)*, this might be adjusted based on if numbers are allowed in words etc.

    这是在 Python-3 上测试的.使用 Python-2,unicode 可能有问题,您可以使用 参考

    This was tested on Python-3. using Python-2, there might be issue with unicode, you can fix it by using the method in the first answer of reference

    这篇关于如何从文本中找到数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆