如何在文字中保留数字 [英] How keep number in text
问题描述
我有一个pyspark数据框,我想改进以下正则表达式。
我想添加条件或将正则表达式修改为:
I have a pyspark Dataframe, I would like to improve the regex bellow. I want to add a condition or modify the regex to:
- 恢复附加到<$ c $的所有数字最后是c> / 或
字母
。
- recover all number that is attached to a
/
orletter
in the end.
案例1的示例:
column_example | new_column
------------------------------------------------------|----------------- |
mundo por el número de NJU/LOK 12345T98789-hablantes | 12345
hispanohablantes ZES/UJ86758/L87586:residentes en | 86758
示例2:
- 我不应该接受ABC单词后面的数字。
列示例:
My_column | new_column
------------------------------------------------------|---------------------
mundo por el número de ABC 8567 hablantes | []
------------------------------------------------------|---------------------
con dominio nativo ABC 987480 millones de personas | []
------------------------------------------------------|---------------------
hispanohablantes residentes en ABC98754 otros países | []
以下代码为:
ptn = re.complie(r'^(?:MOD)?[0-9]{4,6}$')
array_filter = udf(lambda arr: [ x.lstrip('MOD') for x in arr if re.match(ptn, x) ] if type(arr) is list else arr, ArrayType(StringType()))
我该怎么办?
谢谢
How can I do it ? Thank you
推荐答案
一种不使用 udf
的方法版本 2.4.0 之前的Spark:
One way without using udf
for Spark before version 2.4.0:
from pyspark.sql.functions import split, regexp_replace
df.withColumn('new_column'
, split(
regexp_replace(
regexp_replace('My_column', r'.*?(?<!ABC\s{0,5})(?<!\d)(\d{4,6})(?=[A-Z/])', '$1\0')
, '\0?[^\0]*$'
, ''
)
,'\0')
) \
.show(truncate=False)
+-----------------------------------------------------------------------+--------------+
|My_column |new_column |
+-----------------------------------------------------------------------+--------------+
|23458/ mundo por el nmero de NJU/LOK 12345T98789 hablantes |[23458, 12345]|
|con dominio nativo ABC 987480 millones ZES/UJ86758/L87586:residentes en|[86758] |
|hispanohablantes residentes en ABC98754/ otros pases |[] |
+-----------------------------------------------------------------------+--------------+
其中:
-
使用regexp_replace:替换与以下模式匹配的文本
use regexp_replace: to replace the text matching the following pattern
.*?(?<!ABC\s{0,5})(?<!\d)(\d{4,6})(?=[A-Z/])
带有 $ 1\0
,可删除之前所有不相关的文本 NUMBER_NEEDED (保存在 $ 1 中),其前面没有 ABC\s {0,5}
和 \d
,但后跟 [AZ /]
。在每个匹配的 $ 1
的末尾放置一个NULL char \0
。
with $1\0
which removes all unrelated text before NUMBER_NEEDED(saved in $1) which is not preceded by ABC\s{0,5}
and \d
but followed by [A-Z/]
. put a NULL char \0
at the end of each matched $1
.
-
使用
split(text,'\0')
将上述文本转换成数组,注意数组的最后一项无关紧要,应排除在外
use
split(text, '\0')
to convert the above text into an array, notice that the last item of the array is irrelevant which should be excluded
使用另一个 regexp_replace(text,'\0?[^ \0] * $','')
在运行上述 split()函数
use another regexp_replace(text, '\0?[^\0]*$', '')
to remove the trailing unrelated text before running the above split() function
注意:
-
(?<!ABC\s {0,5})
将允许测试ABC
和 NUMBER_NEEDED 。由于正则表达式负向后查找不支持(?<!ABC\s *)
,如果文本之间可能包含更多空格,则可以调整5
到更大的数字。顺便说一句。(?<!ABC\s {0,5})
对于PySpark很好,但是在Pythonre
中无效仅允许使用固定宽度模式的模块
(?<!ABC\s{0,5})
will allow to test 0-5 whitespaces betweenABC
and the NUMBER_NEEDED. since regex negative lookbehind does not support(?<!ABC\s*)
, if your text might contain more spaces in between, you can adjust5
to a larger number. BTW.(?<!ABC\s{0,5})
is fine with PySpark but invalid in Pythonre
module which allows only fixed-width pattern
在(?s)
之前添加小数点模式文本包含换行符
prepend (?s)
to allow dotall mode if any texts contain line breaks
我假设您的原始字符中未显示NULL char \0
文本,因为它不会成为匹配项的一部分,因此您可以在运行上述3个函数之前将其全部删除( regexp_replace(text,'\0','')
)。
I assumed that the NULL char \0
is not shown in your original texts, since it wont be part of matches, you can remove them all (regexp_replace(text, '\0', '')
) before running the above 3 functions.
import re
from pyspark.sql.types import ArrayType, StringType
from pyspark.sql.functions import udf
ptn = re.compile(r'(?<!ABC)(?<!\d)(\d{4,6})(?=[A-Z/])')
find_number = udf(lambda x: re.findall(ptn, re.sub(r'(?<=ABC)\s+', '', x)) if x else [], ArrayType(StringType()))
df.withColumn('new_column', find_number('My_column')).show()
这篇关于如何在文字中保留数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!