为什么在RegEx中找不到此字符串? [英] Why can't I find this string in RegEx?

查看:55
本文介绍了为什么在RegEx中找不到此字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

lines = []
total_check = 0

with pdfplumber.open(file) as pdf:
    pages = pdf.pages
    for page in pdf.pages:
        text = page.extract_text()
        for line in text.split('\n'):
            print(line)

输出数据:

Totaalbedrag excl. btw € 25,00

当我尝试从数据中获取增值税时:

KVK_re = re.compile(r'(excl. btw .+)')
KVK_re.search(data).group(0)

输出:AttributeError:'NoneType'对象没有属性'group'

KVK_re = re.compile(r'(excl. btw .+)')
KVK_re.search(r'excl. btw € 25,00').group(0)

输出:'excl.大约€25,00'

当我将文字输出粘贴到搜索中时,怎么可能找到数字€25,00,而当输入数据变量时却找不到数字?

请帮助我!

解决方案

在大多数情况下,当模式中使用文字空间并且不匹配时,原因是不可见的字符或不间断的空格. /p>

当您使用不间断的空格\xA0时,可以简单地将文字空间替换为\s来匹配任何空白,或者使用[ \xA0]来替换任何一个空格.

在这种情况下,似乎可能是空格和一些不可见字符的组合,因此,您可以使用\W匹配任何非单词字符而不是文字空间:

r'excl\.\W+btw\W.+'

lines = []
total_check = 0

with pdfplumber.open(file) as pdf:
    pages = pdf.pages
    for page in pdf.pages:
        text = page.extract_text()
        for line in text.split('\n'):
            print(line)

output data:

Totaalbedrag excl. btw € 25,00

When I try to retrieve VAT from data:

KVK_re = re.compile(r'(excl. btw .+)')
KVK_re.search(data).group(0)

output: AttributeError: 'NoneType' object has no attribute 'group'

KVK_re = re.compile(r'(excl. btw .+)')
KVK_re.search(r'excl. btw € 25,00').group(0)

output: 'excl. btw € 25,00'

How is it possible that when I paste the literal output in a search it does find the number € 25,00 and when I enter the data variable it does not?

Please help me!

解决方案

In most cases, when a literal space is used in the pattern and there is no match, the reason is the invisible characters, or non-breaking spaces.

When you have non-breaking spaces, \xA0, you can simply replace the literal spaces with \s to match any whitespace, or [ \xA0] to match either of the spaces.

It appears there may be a combination of both spaces and some invisible chars in this case, thus, you may use \W to match any non-word chars instead of a literal space:

r'excl\.\W+btw\W.+'

这篇关于为什么在RegEx中找不到此字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆