如何在文字中保留数字 [英] How keep number in text

查看：102 发布时间：2020/10/16 21:57:53 regex dataframe pyspark

本文介绍了如何在文字中保留数字的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个pyspark数据框，我想改进以下正则表达式。
我想添加条件或将正则表达式修改为：

I have a pyspark Dataframe, I would like to improve the regex bellow. I want to add a condition or modify the regex to:

恢复附加到<$ c $的所有数字最后是c> / 或字母。

recover all number that is attached to a / or letter in the end.

案例1的示例：

column_example                                        |   new_column
------------------------------------------------------|-----------------                                       |
mundo por el número de NJU/LOK 12345T98789-hablantes  |   12345
hispanohablantes ZES/UJ86758/L87586:residentes en     |   86758

示例2：

我不应该接受ABC单词后面的数字。

列示例：

    My_column                                             |         new_column
------------------------------------------------------|---------------------
mundo por el número de ABC 8567 hablantes             |           []
------------------------------------------------------|---------------------
con dominio nativo ABC 987480 millones de personas    |           []
------------------------------------------------------|---------------------
hispanohablantes residentes en ABC98754 otros países  |           []

以下代码为：

ptn = re.complie(r'^(?:MOD)?[0-9]{4,6}$')
array_filter = udf(lambda arr: [ x.lstrip('MOD') for x in arr if re.match(ptn, x) ] if type(arr) is list else arr, ArrayType(StringType()))

我该怎么办？
谢谢

How can I do it ? Thank you

推荐答案

一种不使用 udf 的方法版本 2.4.0 之前的Spark：

One way without using udf for Spark before version 2.4.0:

from pyspark.sql.functions import split, regexp_replace

df.withColumn('new_column'
   , split(
       regexp_replace(
           regexp_replace('My_column', r'.*?(?<!ABC\s{0,5})(?<!\d)(\d{4,6})(?=[A-Z/])', '$1\0')
         , '\0?[^\0]*$'
         , ''
       )
     ,'\0')
   ) \
  .show(truncate=False)
+-----------------------------------------------------------------------+--------------+
|My_column                                                              |new_column    |
+-----------------------------------------------------------------------+--------------+
|23458/ mundo por el nmero de NJU/LOK 12345T98789 hablantes             |[23458, 12345]|
|con dominio nativo ABC 987480 millones ZES/UJ86758/L87586:residentes en|[86758]       |
|hispanohablantes  residentes en ABC98754/ otros pases                  |[]            |
+-----------------------------------------------------------------------+--------------+

其中：

使用regexp_replace：替换与以下模式匹配的文本

use regexp_replace: to replace the text matching the following pattern

.*?(?<!ABC\s{0,5})(?<!\d)(\d{4,6})(?=[A-Z/])

带有 $ 1\0 ，可删除之前所有不相关的文本 NUMBER_NEEDED （保存在 $ 1 中），其前面没有 ABC\s {0,5} 和 \d ，但后跟 [AZ /] 。在每个匹配的 $ 1 的末尾放置一个NULL char \0 。

with $1\0 which removes all unrelated text before NUMBER_NEEDED(saved in $1) which is not preceded by ABC\s{0,5} and \d but followed by [A-Z/]. put a NULL char \0 at the end of each matched $1.

使用 split（text，'\0'）将上述文本转换成数组，注意数组的最后一项无关紧要，应排除在外

use split(text, '\0') to convert the above text into an array, notice that the last item of the array is irrelevant which should be excluded

使用另一个 regexp_replace（text，'\0？[^ \0] * $'，''）在运行上述 split（）函数

use another regexp_replace(text, '\0?[^\0]*$', '') to remove the trailing unrelated text before running the above split() function

注意：

（？<！ABC\s {0,5}）将允许测试 ABC 和 NUMBER_NEEDED 。由于正则表达式负向后查找不支持（？<！ABC\s *），如果文本之间可能包含更多空格，则可以调整 5 到更大的数字。顺便说一句。（？<！ABC\s {0,5}）对于PySpark很好，但是在Python re 中无效仅允许使用固定宽度模式的模块



(?<!ABC\s{0,5}) will allow to test 0-5 whitespaces between ABC and the NUMBER_NEEDED. since regex negative lookbehind does not support (?<!ABC\s*), if your text might contain more spaces in between, you can adjust 5 to a larger number. BTW. (?<!ABC\s{0,5}) is fine with PySpark but invalid in Python re module which allows only fixed-width pattern
在（？s）之前添加小数点模式文本包含换行符
prepend (?s) to allow dotall mode if any texts contain line breaks
我假设您的原始字符中未显示NULL char  \0 文本，因为它不会成为匹配项的一部分，因此您可以在运行上述3个函数之前将其全部删除（ regexp_replace（text，'\0'，''））。 
I assumed that the NULL char \0 is not shown in your original texts, since it wont be part of matches, you can remove them all (regexp_replace(text, '\0', '')) before running the above 3 functions.
import re
from pyspark.sql.types import ArrayType, StringType
from pyspark.sql.functions import udf

ptn = re.compile(r'(?<!ABC)(?<!\d)(\d{4,6})(?=[A-Z/])')

find_number = udf(lambda x: re.findall(ptn, re.sub(r'(?<=ABC)\s+', '', x)) if x else [], ArrayType(StringType()))

df.withColumn('new_column', find_number('My_column')).show()


                        这篇关于如何在文字中保留数字的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

如何在文字中保留数字 [英] How keep number in text

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在文字中保留数字 [英] How keep number in text

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭