如果 re.findall 找不到匹配项,如何返回字符串 [英] How to return a string if a re.findall finds no match

查看:82
本文介绍了如果 re.findall 找不到匹配项,如何返回字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个脚本来获取扫描的 pdf 文件并将它们转换为文本行以输入到数据库中.我使用 re.findall 从正则表达式列表中获取匹配项,以从 tesseract 提取的字符串中获取某些值.当正则表达式找不到匹配项时,我遇到了问题,我希望它返回错误".所以我可以看到有问题.

I am writing a script to take scanned pdf files and convert them into lines of text to enter into a database. I use re.findall to get matches from a list of regular expressions to get certain values from the tesseract extracted strings. I am having trouble when a regular expression can't find a match I want it to return "Error." So I can see that there is a problem.

我尝试了一些 if/else 语句,但我似乎无法注意到 None 值.

I have tried a handful of if/else statements but I can't seem to get any to notice the None value.

from wand.image import Image as Img
import ghostscript
from PIL import Image
import pytesseract
import re
import os

def get_text_from_pdf(pendingpdf,pendingimg):
    with Img(filename=pendingpdf, resolution=300) as img:
        img.compression_quality = 99
        img.save(filename=pendingimg)
    pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'
    extractedtext = pytesseract.image_to_string(Image.open(pendingimg))
    os.unlink(pendingimg)
    return extractedtext

def get_results(vendor,extracted_string,results):
    for v in vendor:
        pattern = re.compile(v)
        for match in re.findall(pattern,extracted_string):
            if type(match) is str:
                results.append(match)
            else:
                results.append("Error")
    return results

pendingpdf = r'J:\TBHscan07022019090315001.pdf'
pendingimg = 'Test1.jpg'
aggind = ["^(\w+)(?:.+)\n+3600",
          "Ticket: (nonsensewordstothrowerror)",
          "Ticket: \d+\s([0-9|/]+)",
          "Product: (\w+.+)\n",
          "Quantity: ([\d\.]+)",
          "Truck (\w+)"]
vendor = aggind
extracted_string = get_text_from_pdf(pendingpdf,pendingimg)
results = []

print(get_results(vendor,get_text_from_pdf(pendingpdf,pendingimg),results))

推荐答案

您可以在一行中完成:

results += re.findall(pattern, extracted_string) or ["Error"]

顺便说一句,在供应商循环内编译模式没有任何好处,因为您只使用它一次.

BTW, you get no benefit from compiling the pattern inside the vendor loop because you're only using it once.

您的函数还可以使用单个列表推导式返回整个搜索结果:

Your function could also return the whole search result using a single list comprehension:

return [m for v in vendor for m in re.findall(v, extracted_string) or ["Error"]]

您实际上想要修改并返回作为参数传递的结果列表,这有点奇怪.这可能会在您使用该功能时产生一些意想不到的副作用.

It is a bit weird that you would actually want to modify AND return the results list being passed as parameter. This may produce some unexpected side effects when you use the function.

您的错误"标志可能会在结果列表中出现多次,并且鉴于每个模式可能返回多个匹配项,很难确定哪个模式未能找到值.

Your "Error" flag may appear several times in the result list, and given that each pattern may return multiple matches, it will be hard to determine which pattern failed to find a value.

如果您只想在所有供应商模式都不匹配时发出错误信号,您可以对整个结果使用 或 ["Error"] 技巧:

If you only want to signal an error when none of the vendor patterns match, you could use the or ["Error"] trick on whole result:

return [m for v in vendor for m in re.findall(v, extracted_string)] or ["Error"]

这篇关于如果 re.findall 找不到匹配项,如何返回字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆