如何在文字中保留数字 [英] How keep number in text

查看:102
本文介绍了如何在文字中保留数字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个pyspark数据框,我想改进以下正则表达式。
我想添加条件或将正则表达式修改为:

I have a pyspark Dataframe, I would like to improve the regex bellow. I want to add a condition or modify the regex to:


  • 恢复附加到<$ c $的所有数字最后是c> / 或字母

  • recover all number that is attached to a / or letter in the end.

案例1的示例:

column_example                                        |   new_column
------------------------------------------------------|-----------------                                       |
mundo por el número de NJU/LOK 12345T98789-hablantes  |   12345
hispanohablantes ZES/UJ86758/L87586:residentes en     |   86758

示例2:


  • 我不应该接受ABC单词后面的数字。

列示例:

    My_column                                             |         new_column
------------------------------------------------------|---------------------
mundo por el número de ABC 8567 hablantes             |           []
------------------------------------------------------|---------------------
con dominio nativo ABC 987480 millones de personas    |           []
------------------------------------------------------|---------------------
hispanohablantes residentes en ABC98754 otros países  |           []

以下代码为:

ptn = re.complie(r'^(?:MOD)?[0-9]{4,6}$')
array_filter = udf(lambda arr: [ x.lstrip('MOD') for x in arr if re.match(ptn, x) ] if type(arr) is list else arr, ArrayType(StringType()))

我该怎么办?
谢谢

How can I do it ? Thank you

推荐答案

一种不使用 udf 的方法版本 2.4.0 之前的Spark:

One way without using udf for Spark before version 2.4.0:

from pyspark.sql.functions import split, regexp_replace

df.withColumn('new_column'
   , split(
       regexp_replace(
           regexp_replace('My_column', r'.*?(?<!ABC\s{0,5})(?<!\d)(\d{4,6})(?=[A-Z/])', '$1\0')
         , '\0?[^\0]*$'
         , ''
       )
     ,'\0')
   ) \
  .show(truncate=False)
+-----------------------------------------------------------------------+--------------+
|My_column                                                              |new_column    |
+-----------------------------------------------------------------------+--------------+
|23458/ mundo por el nmero de NJU/LOK 12345T98789 hablantes             |[23458, 12345]|
|con dominio nativo ABC 987480 millones ZES/UJ86758/L87586:residentes en|[86758]       |
|hispanohablantes  residentes en ABC98754/ otros pases                  |[]            |
+-----------------------------------------------------------------------+--------------+

其中:


  • 使用regexp_replace:替换与以下模式匹配的文本

  • use regexp_replace: to replace the text matching the following pattern

.*?(?<!ABC\s{0,5})(?<!\d)(\d{4,6})(?=[A-Z/])


带有 $ 1\0 ,可删除之前所有不相关的文本 NUMBER_NEEDED (保存在 $ 1 中),其前面没有 ABC\s {0,5} \d ,但后跟 [AZ /] 。在每个匹配的 $ 1 的末尾放置一个NULL char \0

with $1\0 which removes all unrelated text before NUMBER_NEEDED(saved in $1) which is not preceded by ABC\s{0,5} and \d but followed by [A-Z/]. put a NULL char \0 at the end of each matched $1.


  • 使用 split(text,'\0')将上述文本转换成数组,注意数组的最后一项无关紧要,应排除在外

  • use split(text, '\0') to convert the above text into an array, notice that the last item of the array is irrelevant which should be excluded

使用另一个 regexp_replace(text,'\0?[^ \0] * $','')在运行上述 split()函数

use another regexp_replace(text, '\0?[^\0]*$', '') to remove the trailing unrelated text before running the above split() function

注意:


  • (?<!ABC\s {0,5})将允许测试 ABC 和 NUMBER_NEEDED 。由于正则表达式负向后查找不支持(?<!ABC\s *),如果文本之间可能包含更多空格,则可以调整 5 到更大的数字。顺便说一句。 (?<!ABC\s {0,5})对于PySpark很好,但是在Python re 中无效仅允许使用固定宽度模式的模块

  • (?<!ABC\s{0,5}) will allow to test 0-5 whitespaces between ABC and the NUMBER_NEEDED. since regex negative lookbehind does not support (?<!ABC\s*), if your text might contain more spaces in between, you can adjust 5 to a larger number. BTW. (?<!ABC\s{0,5}) is fine with PySpark but invalid in Python re module which allows only fixed-width pattern

(?s)之前添加小数点模式文本包含换行符

prepend (?s) to allow dotall mode if any texts contain line breaks

我假设您的原始字符中未显示NULL char \0 文本,因为它不会成为匹配项的一部分,因此您可以在运行上述3个函数之前将其全部删除( regexp_replace(text,'\0',''))。

I assumed that the NULL char \0 is not shown in your original texts, since it wont be part of matches, you can remove them all (regexp_replace(text, '\0', '')) before running the above 3 functions.

import re
from pyspark.sql.types import ArrayType, StringType
from pyspark.sql.functions import udf

ptn = re.compile(r'(?<!ABC)(?<!\d)(\d{4,6})(?=[A-Z/])')

find_number = udf(lambda x: re.findall(ptn, re.sub(r'(?<=ABC)\s+', '', x)) if x else [], ArrayType(StringType()))

df.withColumn('new_column', find_number('My_column')).show()

这篇关于如何在文字中保留数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆