Python中的模糊智能数解析 [英] Fuzzy smart number parsing in Python

查看:72
本文介绍了Python中的模糊智能数解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望解析十进制数字,而不管它们的格式如何,这是未知的.原始语言是未知的,可能会有所不同.另外,源字符串可以在前后包含一些额外的文本,例如货币或单位.

I wish to parse decimal numbers regardless of their format, which is unknown. Language of the original text is unknown and may vary. In addition, the source string can contain some extra text before or after, like currency or units.

我正在使用以下内容:

# NOTE: Do not use, this algorithm is buggy. See below.
def extractnumber(value):

    if (isinstance(value, int)): return value
    if (isinstance(value, float)): return value

    result = re.sub(r'&#\d+', '', value)
    result = re.sub(r'[^0-9\,\.]', '', result)

    if (len(result) == 0): return None

    numPoints = result.count('.')
    numCommas = result.count(',')

    result = result.replace(",", ".")

    if ((numPoints > 0 and numCommas > 0) or (numPoints == 1) or (numCommas == 1)):
        decimalPart = result.split(".")[-1]
        integerPart = "".join ( result.split(".")[0:-1] )
    else:
        integerPart = result.replace(".", "")

    result = int(integerPart) + (float(decimalPart) / pow(10, len(decimalPart) ))

    return result

这种作品...

>>> extractnumber("2")
2
>>> extractnumber("2.3")
2.3
>>> extractnumber("2,35")
2.35
>>> extractnumber("-2 000,5")
-2000.5
>>> extractnumber("EUR 1.000,74 €")
1000.74

>>> extractnumber("20,5 20,8") # Testing failure...
ValueError: invalid literal for int() with base 10: '205 208'

>>> extractnumber("20.345.32.231,50") # Returns false positive
2034532231.5

所以我的方法对我来说似乎非常脆弱,并且会返回很多误报.

So my method seems very fragile to me, and returns lots of false positives.

是否有任何可以处理此问题的库或智能功能?理想情况下,20.345.32.231,50不会通过,但是会提取其他语言(例如1.200,501 200'50)中的数字,而不考虑周围其他文本和字符(包括换行符)的数量.

Is there any library or smart function that can handle this? Ideally 20.345.32.231,50 shall not pass, but numbers in other languages like 1.200,50 or 1 200'50 would be extracted, regardless the amount of other text and characters (including newlines) around.

(根据接受的答案更新了实现: https://github.com/jjmontesl/cubetl/blob/master/cubetl/text/functions.py#L91 )

(Updated implementation according to accepted answer: https://github.com/jjmontesl/cubetl/blob/master/cubetl/text/functions.py#L91)

推荐答案

您可以使用适当的正则表达式来完成此操作.这是我最大的尝试.我使用命名捕获组,就像这种复杂的数字模式那样,在反向引用中使用时会更加混乱.

You can do this with a suitably fancy regular expression. Here's my best attempt at one. I use named capturing groups, as with a pattern this complex, numeric ones would be much more confusing to use in backreferences.

首先,使用正则表达式模式:

First, the regexp pattern:

_pattern = r"""(?x)       # enable verbose mode (which ignores whitespace and comments)
    ^                     # start of the input
    [^\d+-\.]*            # prefixed junk
    (?P<number>           # capturing group for the whole number
        (?P<sign>[+-])?       # sign group (optional)
        (?P<integer_part>         # capturing group for the integer part
            \d{1,3}               # leading digits in an int with a thousands separator
            (?P<sep>              # capturing group for the thousands separator
                [ ,.]                 # the allowed separator characters
            )
            \d{3}                 # exactly three digits after the separator
            (?:                   # non-capturing group
                (?P=sep)              # the same separator again (a backreference)
                \d{3}                 # exactly three more digits
            )*                    # repeated 0 or more times
        |                     # or
            \d+                   # simple integer (just digits with no separator)
        )?                    # integer part is optional, to allow numbers like ".5"
        (?P<decimal_part>     # capturing group for the decimal part of the number
            (?P<point>            # capturing group for the decimal point
                (?(sep)               # conditional pattern, only tested if sep matched
                    (?!                   # a negative lookahead
                        (?P=sep)              # backreference to the separator
                    )
                )
                [.,]                  # the accepted decimal point characters
            )
            \d+                   # one or more digits after the decimal point
        )?                    # the whole decimal part is optional
    )
    [^\d]*                # suffixed junk
    $                     # end of the input
"""

这是一个使用它的函数:

And here's a function to use it:

def parse_number(text):
    match = re.match(_pattern, text)
    if match is None or not (match.group("integer_part") or
                             match.group("decimal_part")):    # failed to match
        return None                      # consider raising an exception instead

    num_str = match.group("number")      # get all of the number, without the junk
    sep = match.group("sep")
    if sep:
        num_str = num_str.replace(sep, "")     # remove thousands separators

    if match.group("decimal_part"):
        point = match.group("point")
        if point != ".":
            num_str = num_str.replace(point, ".")  # regularize the decimal point
        return float(num_str)

    return int(num_str)

一些数字字符串带有正好一个逗号或句点,并紧跟其后的三个数字(例如"1,234""1.234")是不明确的.此代码将使用整数分隔符(1234)而不是浮点值(1.234)将它们解析为整数,而不管所使用的实际分隔符是什么.如果您希望这些数字有不同的结果(例如,如果您希望从1.234浮点数出来),则可以使用特殊情况来处理此问题.

Some numeric strings with exactly one comma or period and exactly three digits following it (like "1,234" and "1.234") are ambiguous. This code will parse both of them as integers with a thousand separator (1234), rather than floating point values (1.234) regardless of the actual separator character used. It's possible you could handle this with a special case, if you want a different outcome for those numbers (e.g. if you'd prefer to make a float out of 1.234).

一些测试输出:

>>> test_cases = ["2", "2.3", "2,35", "-2 000,5", "EUR 1.000,74 €",
                  "20,5 20,8", "20.345.32.231,50", "1.234"]
>>> for s in test_cases:
    print("{!r:20}: {}".format(s, parse_number(s)))


'2'                 : 2
'2.3'               : 2.3
'2,35'              : 2.35
'-2 000,5'          : -2000.5
'EUR 1.000,74 €'    : 1000.74
'20,5 20,8'         : None
'20.345.32.231,50'  : None
'1.234'             : 1234

这篇关于Python中的模糊智能数解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆