如何以表格格式从发票中提取数据 [英] How to extract data from invoices in tabular format
问题描述
我正在尝试使用计算机视觉从pdf /图像发票中提取数据。为此,我使用了基于ocr的pytesseract。
这是示例发票
您可以在下面找到相同的代码
I'm trying to extract data from pdf/image invoices using computer vision.For that i used ocr based pytesseract. this is sample invoice you can find code for same below
import pytesseract
img = Image.open("invoice-sample.jpg")
text = pytesseract.image_to_string(img)
print(text)
通过使用pytesseract我得到的输出低于
by using pytesseract i got below output
http://mrsinvoice.com
’ Invoice
Your Company LLC Address 123, State, My Country P 111-222-333, F 111-222-334
BILLTO:
fofin Oe Invoice # 00001
Alpha Bravo Road 33 Invoice Date 32/12/2001
P: 111-292-333, F: 111-222-334
client@example.net Nomecof Reps Bob
Contact Phone 101-102-103
SHIPPING TO:
eine ce Payment Terms ash on Delivery
Office Road 38
P: 111-333-222, F: 122-222-334 Amount Due: $4,170
office@example.net
NO PRODUCTS / SERVICE QUANTITY / RATE / UNIT AMOUNT
HOURS: PRICE
1 tye 2 $20 $40
2__| Steering Wheel 5 $10 $50
3 | Engine oil 10 $15 $150
4 | Brake Pad 24 $1000 $2,400
Subtotal $275
Tax (10%) $27.5
Grand Total $202.5
‘THANK YOU FOR YOUR BUSINESS
但问题是我想提取文本并将其分成不同的部分,例如供应商名称,发票编号,项目名称和数量。
预期产出
but problem is i want to extract text and segregate it into different parts like Vendor name, Invoice number, item name and item quantity. expected output
{'date': (2014, 6, 4), 'invoice_number': 'EUVINS1-OF5-DE-120725895', 'amount': 35.24, 'desc': 'Invoice EUVINS1-OF5-DE-120725895 from Amazon EU'}
我也尝试了 invoice2data
python库,但同样有很多限制。我也尝试过使用regex和opencv的canny边缘检测来分别检测文本框,但未能达到预期的结果
I also tried invoice2data
python library but again it has many limitation. I also tried regex and opencv's canny edge detection for detecting text boxes separately but failed to achieve the expected outcome
请帮助我
推荐答案
您必须执行更多处理,尤其是因为BILL TO和SHIPPING TO与发票表不对齐。但是您可以使用以下代码作为基础。
You must do more processing, especially because BILL TO and SHIPPING TO are not aligned with the invoice table. But you can use following code as a base.
import cv2
import pytesseract
from pytesseract import Output
import pandas as pd
img = cv2.imread("aF0Dc.jpg")
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 127, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
custom_config = r'-l eng --oem 1 --psm 6 '
d = pytesseract.image_to_data(thresh, config=custom_config, output_type=Output.DICT)
df = pd.DataFrame(d)
df1 = df[(df.conf != '-1') & (df.text != ' ') & (df.text != '')]
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
sorted_blocks = df1.groupby('block_num').first().sort_values('top').index.tolist()
for block in sorted_blocks:
curr = df1[df1['block_num'] == block]
sel = curr[curr.text.str.len() > 3]
# sel = curr
char_w = (sel.width / sel.text.str.len()).mean()
prev_par, prev_line, prev_left = 0, 0, 0
text = ''
for ix, ln in curr.iterrows():
# add new line when necessary
if prev_par != ln['par_num']:
text += '\n'
prev_par = ln['par_num']
prev_line = ln['line_num']
prev_left = 0
elif prev_line != ln['line_num']:
text += '\n'
prev_line = ln['line_num']
prev_left = 0
added = 0 # num of spaces that should be added
if ln['left'] / char_w > prev_left + 1:
added = int((ln['left']) / char_w) - prev_left
text += ' ' * added
text += ln['text'] + ' '
prev_left += len(ln['text']) + added + 1
text += '\n'
print(text)
结果
bhttps//mrsinvoice.com
Lp
I |
Your Company LLC Address 123, State, My Country P 111-222-333, F 111-222-334
BILL TO:
P: 111-222-333, F: 111-222-334 m .
dlent@ccomplent
Contact Phone 101-102-103
john Doe office ayment Terms ash on Delivery
Office Road 38
P: 111-833-222, F: 122-222-334 Amount Due: $4,170
office@example.net
NO PRODUCTS / SERVICE QUANTITY / RATE / UNIT AMOUNT
HOURS, PRICE
1 | tyre 2 $20 $40
2 | Steering Wheet 5 $10 $50
3 | Engine ol 40 $15 $150
4 | Brake Pad 2a $1000 $2,400
Subtotal $275
Tax (10%) $275
Grand Total $302.5
‘THANK YOU FOR YOUR BUSINESS
这篇关于如何以表格格式从发票中提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!