修改图案以查找数字 [英] Modify a pattern to find number
问题描述
我有这种模式可以从字符串中提取数字.
I have this pattern to extract numbers from Strings.
ptns = { 'clean1': re.compile(r'[-&\s]+', re.UNICODE)
, 'clean2': re.compile(r'\bABCS?(?:[/\s-]+KE|(?=\s*\d))|\bFOR\s+(?:[A-Z]+\s+)*', re.UNICODE)
, 'data' : re.compile(r'\b(\d{4,6})(?=[A-Z/_]|$)', re.UNICODE) }
我想为模式添加一些条件,并且不要触及旧的条件,因此一开始我总是应该这样写: ABCDEFGS,ABCDEFG,ABC,JUSTIF
.有时候,我在字符串末尾有一个可选的"S"
,例如:ABCDEFGS.
I want to add some conditions to the pattern, and don’t touch to the old contitions, always should I have the words : ABCDEFGS, ABCDEFG, ABC, JUSTIF
in the begin.
Some times I have an optional ‘S’
in the end of the string like : ABCDEFGS.
我想从此文本中提取所有包含以下数字的数字: 4、5或6
个数字.要添加到模式中以提取数字的条件和条件:
I want to extract all numbers that contain: 4, 5 or 6
digits from this text. Condition and cases to add to the pattern to extract the numbers:
- Attached to ABC then ‘.’ (sometimes I have only one number sometimes I have a list of a numbers)
- Attached to ABC space then ‘.’ (sometimes I have only one number sometimes I have a list of a numbers)
- after ABCDEFGS then space
- after ABCDEFG + space (line 4)
- after JUSTIF then ‘.’ space
- After ABC but between ( ) ⇒ See example bellow.
数据集示例,预期结果为:
Dataset example and expect result is:
Column New_column
————————————————————————————————————————
Hoy es día ABCDEFGS 05327 - 05771 - 06045 todas las mañanas | [05327, 05771, 06045]
————————————————————————————————————————
todas las mañanas ABCDEFG 6661 & ABCDEFG 11440 Se viste | [6661, 11440 ]
————————————————————————————————————————
escuela ABCDEFG 19652 matemáticas Hoy es día | [19652]
————————————————————————————————————————
y comienza ABCDEFG 76192/T85921 el camino hacia | [76192]
————————————————————————————————————————
Marcos se ABCDEFG 13462 S22786 camino | [13462]
————————————————————————————————————————
encuentra con su ABC. 19390 / 19351 viste, desayuna | [19390, 19351]
————————————————————————————————————————
escuela ABC.5498/5499/5470/5471 DEFINE AND DESIGN IMPROVE | [5498,5499,5470,5471]
————————————————————————————————————————
l camino hacia la ABC.20974 Marcos se | [20974]
————————————————————————————————————————
todas las mañanas ABC 160879-P15989/ 160878-P20181/160878-P20182 AND 160879-P20183 [160879, 160878, 160878, 160879]
————————————————————————————————————————
ABC. 5498/5499/5470/5471 l camino hacia la | [5498,5499,5470,5471]
————————————————————————————————————————
todas las JUSTIF. 103383/L25469 todas | [103383]
————————————————————————————————————————
las (ABC 38770) OR CFM56-5B1/3 (ABC 37147) camino | [38770, 37147]
————————————————————————————————————————
hacia la (POST ABC 161104) hacia la | [161104]
————————————————————————————————————————
DEFINE AND DESIG ABC/KE: 73620T80840 DEFINE | [73620 ]
————————————————————————————————————————
DEFINE AND DESIGN IMPROVE ABC (39729) IMPROVE | [39729]
————————————————————————————————————————
推荐答案
根据您的请求,我修改了三种用于清除数据并匹配数字的模式:
Per your request, I modified the three patterns which are used to clean data and match the numbers:
在 data 模式中,将 \ b
替换为(?:^ |(?< =/))
,以便数字可以在字符串的开头或在斜杠/
之前.
in the data pattern, replaced \b
with (?:^|(?<=/))
so that the numbers can either at the beginning of string or preceded by a slash /
.
ptns = { 'clean1': re.compile(r'[/-]\s|\s[/-]|[&\s.():]+|\b(?:AND|OR)\b', re.UNICODE)
, 'clean2': re.compile(r'\bABCS?[/\s]+KE|\b(?:ABCS?|ABCDEFGS?|JUSTIF|FOR)(?=\s*\d)', re.UNICODE)
, 'data' : re.compile(r'(?:^|(?<=/))(\d{4,6})(?=[A-Z/_-]|$)', re.UNICODE) }
模式:
-
clean1 :将以下模式转换为空格
clean1: convert the following patterns into a SPACE
-
[/-] \ s | \ s [/-]
:斜杠或连字符前面加一个空格,或后面加一个空格
[/-]\s|\s[/-]
: slash or hyphen preceded by a space or followed by a space
example: 'ABC- 72981' --> 'ABC 72981'
'ABC 160879-P15989/' <-- no change since no SPACE around hyphen
\ b(?:AND | OR)\ b
:允许 AND 或 OR 链接数字序列
\b(?:AND|OR)\b
: to allow AND or OR to link a sequence of numbers
example: '160878-P20181/160878-P20182 AND 160879-P20183' --> '160878-P20181/160878-P20182 160879-P20183'
[& \ s.():] +
:删除需要单独处理的连字符,并添加括号(
,)
,点.
和冒号:
[&\s.():]+
: remove hyphen which needs to be processed separately, added parenthesis (
, )
, dot .
and colon :
example: 'ABC. 19390' --> 'ABC 19390'
'(ABC 38770)' --> 'ABC 38770'
'ABC/KE: 73620T80840' --> 'ABC/KE 73620T80840'
clean2 :将以下内容转换为 ABC
clean2: convert the following into ABC
-
\ bABCS?[/\ s] + KE
:ABC,后跟一个空格或斜杠,然后是KE
.如果对JUSTIF,ABCDEFGS也应用相同的规则,则此部分可能会移至 clean1 模式.等等.
\bABCS?[/\s]+KE
: ABC followed by a spaces or slashes then following byKE
. this part might be moved to the clean1 pattern if the same rules are applied also to JUSTIF, ABCDEFGS? etc.
\ b(?:ABCS?| ABCDEFGS?| JUSTIF | FOR)(?= \ s * \ d)
:匹配ABC.ABCS,ABCDEFG,ABCDEFGS或JUSTIF,后跟一个可选空格,然后输入数字
\b(?:ABCS?|ABCDEFGS?|JUSTIF|FOR)(?=\s*\d)
: matches ABC. ABCS, ABCDEFG, ABCDEFGS or JUSTIF followed by an optional space and then number
数据:添加了连字符-
作为锚,以跟随匹配的4-6位数字的子字符串
data: added the hyphen -
as an anchor to follow the matched substring of 4-6 digit
应保留其他代码,如下所示:
Other code should be kept, see below:
udf_find_number = udf(lambda x: find_number(x, ptns), ArrayType(StringType()))
df.withColumn('new_column', udf_find_number('column')).show(truncate=False)
+----------------------------------------------------------------------------------+--------------------------------+
|Column |new_column |
+----------------------------------------------------------------------------------+--------------------------------+
|Hoy es d陋a ABCDEFGS 05327 - 05771 - 06045 todas las ma?anas |[05327, 05771, 06045] |
| todas las ma?anas ABCDEFG 6661 & ABCDEFG 11440 Se viste |[6661, 11440] |
|escuela ABCDEFG 19652 matem垄ticas Hoy es d陋a |[19652] |
|y comienza ABCDEFG 76192/T85921 el camino hacia |[76192] |
|Marcos se ABCDEFG 13462 S22786 camino |[13462] |
|encuentra con su ABC. 19390 / 19351 viste, desayuna |[19390, 19351] |
|escuela ABC.5498/5499/5470/5471 DEFINE AND DESIGN IMPROVE |[5498, 5499, 5470, 5471] |
|l camino hacia la ABC.20974 Marcos se |[20974] |
|todas las ma?anas ABC 160879-P15989/ 160878-P20181/160878-P20182 AND 160879-P20183|[160879, 160878, 160878, 160879]|
|ABC. 5498/5499/5470/5471 l camino hacia la |[5498, 5499, 5470, 5471] |
|todas las JUSTIF. 103383/L25469 todas |[103383] |
|las (ABC 38770) OR CFM56-5B1/3 (ABC 37147) camino |[38770, 37147] |
|hacia la (POST ABC 161104) hacia la |[161104] |
|DEFINE AND DESIG ABC/KE: 73620T80840 DEFINE |[73620] |
| DEFINE AND DESIGN IMPROVE ABC (39729) IMPROVE |[39729] |
+----------------------------------------------------------------------------------+--------------------------------+
让我知道这是否解决了问题.
Let me know if this fixed the problems.
这篇关于修改图案以查找数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!