寻找一些python机器学习建议 [英] Looking for a little python machine learning advice

查看:89
本文介绍了寻找一些python机器学习建议的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对涉足Python和机器学习/自动数据输入感兴趣.但是,随着研究的进展,我意识到有很多不同的技术,每种技术都有自己的优势.

I'm interested in having a dabble with Python and machine learning/automatic data entry. However as my research has progressed I realise there are so many different techniques each with there own strengths.

我已经决定,如果我朝相反的方向学习,我可能会走得更远. IE.选择一个问题/任务并通过解决/完成它来学习.

I've decided i might get further if i learn in the opposite direction. I.e. pick a problem/task and learn by solving/completing it.

我有时不得不处理传真发票,我希望制作一个程序,一旦我扫描然后输入,便可以为我输入.

I occasionally have to data process invoices that are faxed, I'm hoping to make a program that can enter these for me once I've scanned then in.

传真基本上由2个相同的表组成.每行表示一个单独的工人.第一列是工人名称(选择6),第二列是地址,然后其余列是勾号框,表示不同的工作.页面顶部的框中还有一个发票ID.

The faxes basically consist of 2 identical tables. Each row denotes a seperate worker. The 1st column is for a workers name(a choice of 6) 2nd is an address then the rest of the columns are tick boxes which denote different jobs. There is also an invoice ID in a box at the top of the page.

我希望有人能简要解释他们将如何处理.他们是否将SVM用于文本识别或其他技术?以及如何使程序理解第5个方框中的勾号,表示"cleaned = yes",并且左上方方框中的数字是ID.我做了一些研究,但无法理解如何开始.如何隔离传真的各个部分,例如当您由于传真/扫描而无法保证绝对的放置/尺寸时,顶部的表格以及它是页面其余部分的单元格.还是我必须获取数百个传真+这些传真的类型化数据然后进行比较,然后使其慢慢了解自己,所以传真a和b的区别在这里是勾号,而ID号通常在这里...

I'm hoping for someone to briefly explain how they would go about this. If they would use SVM for text recognition or another technique? and how you could go about making a program understand a tick in the 5th box along means 'cleaned=yes' and that the number in the top left box is the ID. Ive done a bit of research but can't get my head around how to start. How is it possible to isolate parts of a fax e.g. The top table and it's cells from the rest of the page when you can't guarantee absolute placement/size due to the fax/scans. Or do I have to get hundreds of faxes + the typed up data of these faxes then compare them and then get it to slowly learn itself the difference between fax a and b is a tick here, and the ID number is usually here...

欢迎任何建议!

推荐答案

广义上讲,您可以将此过程分为两个阶段:

Broadly speaking you can divide this process into 2 phases:

  1. 确定文本的位置.它位于ml和Computer Vision的交汇处,因为在文本识别部分之前,您需要找到此文本的位置.这不是一件容易的事,您可以找到行,框等,例如,查看 opencv lib,这可能会很有用与简历相关的任务.如果您所有的文档都具有相同的精确形式(字段相对于扫描列表本身的位置),并且可以完美地扫描它们而没有变形(旋转,偏移),则可以尝试在字段所在的静态区域中搜索文本.

  1. Determining location of text. It's at the intersection of ml and Computer Vision, because before text recognition part you need to find where this text is located. It's not an easy task, you can find lines, boxes, etc, look at opencv lib for example, it may be useful for CV-related tasks. If all of your documents have same precise form (location of fields relative to scanned list itself) and you can scan them perfectly, without distortions (rotations, offsets) you can try to search text in static areas, where fields are.

找到文本后,必须将每个字段的内容分解为单词,然后将单词分解为字符,然后可以将这些字符提供给识别器(ML部件)并获取每个字符本身的标签.对于手写文本来说,这几乎是不可能的(如今),因此一般情况下很难识别手写文本.即使字段仅包含印刷文本,我还是建议您避免此步骤,并对OCR使用特殊的lib,例如

When you have found the text, you have to break contents of each field to words, then words to characters, and then you can feed your recognizer (ML part) with these characters and get labels of each character itself. And it's almost impossible(nowadays) for handwritten text, thus it's hard to recognize handwritten text in general case. Even if fields contain only printed text i recommend you to avoid this step, and use special lib for OCR, like tesseract

这篇关于寻找一些python机器学习建议的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆