算法或者模式阅读文本 [英] Algorithms or Patterns for reading text

查看:131
本文介绍了算法或者模式阅读文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我公司拥有跟踪产品的价格从不同的公司在不同地点的客户端。此信息进入数据库。

My company has a client that tracks prices for products from different companies at different locations. This information goes into a database.

这些公司每天都通过电子邮件发送的价格,我们的客户,当然还有电子邮件都不同的格式。它是不可能有任何一家公司的改变自己的格式 - 他们不会去做

These companies email the prices to our client each day, and of course the emails are all formatted differently. It is impossible to have any of the companies change their format - they will not do it.

有些看起来就有点像这样的:

Some look sort of like this:


    This is example text that could be many lines long...

    Location 1
    Product 1     Product 2     Product 3
    $20.99        $21.99        $33.79

    Location 2
    Product 1     Product 2     Product 3
    $24.99        $22.88        $35.59

别人看的排序是这样的:

Others look sort of like this:


    PRODUCT       PRICE    + / -
    ------------  -------- -------
    Location 1
    1             2007.30 +048.20
    2             2022.50 +048.20

    Maybe some multiline text here about a holiday or something...

    Location 2
    1             2017.30 +048.20
    2             2032.50 +048.20

目前,我们对每个公司的电子邮件格式写入单个解析器。但这些格式更改频繁小幅pretty的。我们不能指望是在同一行或列每次价格。

Currently we have individual parsers written for each company's email format. But these formats change slightly pretty frequently. We can't count on the prices being on the same row or column each time.

这是平凡的我们来看看邮件,并确定哪些价格与哪个产品在哪个位置。但是,与其说我们的code。所以,我试图找到一个更灵活的解​​决方案,并希望您什么办法采取的建议。我愿意接受任何的正则表达式的神经网络 - 我会学到什么,我需要做这项工作,我只是不知道我需要学习。这是一个法/解析的问题?更多类似的OCR?

It's trivial for us to look at the emails and determine which price goes with which product at which location. But not so much for our code. So I'm trying to find a more flexible solution and would like your suggestions about what approaches to take. I'm open to anything from regex to neural networks - I'll learn what I need to to make this work, I just don't know what I need to learn. Is this a lex/parsing problem? More similar to OCR?

在code没有弄清楚格式的所有自己。这些电子邮件分为几个主要的'风格'像以上这样的。我们真正需要的code只具备足够的灵活性,一个新的产品线或空白或东西不会使文件不可分析。

The code doesn't have to figure out the formats all on its own. The emails fall into a few main 'styles' like the ones above. We really need the code to just be flexible enough that a new product line or whitespace or something doesn't make the file unparsable.

感谢您在哪里开始的任何建议。

Thanks for any suggestions about where to start.

推荐答案

我觉得这个问题会适合适当的解析器生成器。普通EX pressions太难测试和调试如果出错。不过,我会去一个解析器生成器,使用简单,就好像它是一个语言的一部分。

I think this problem would be suitable for proper parser generator. Regular expressions are too difficult to test and debug if they go wrong. However, I would go for a parser generator that is simple to use as if it was part of a language.

对于这些类型的任务,我将与pyparsing去,因为它有一个完整的LR解析器,但没有一个困难的语法定义和很好的辅助函数的权力。在code是容易阅读了。

For these type of tasks I would go with pyparsing as its got the power of a full lr parser but without a difficult grammer to define and very good helper functions. The code is easy to read too.

from pyparsing import *

aaa ="""    This is example text that could be many lines long...
             another line

    Location 1
    Product 1     Product 2     Product 3
    $20.99        $21.99        $33.79

    stuff in here you want to ignore

    Location 2
    Product 1     Product 2     Product 3
    $24.99        $22.88        $35.59 """

result = SkipTo("Location").suppress() \  
# in place of "location" could be any type of match like a re.
         + OneOrMore(Word(alphas) + Word(nums)) \
         + OneOrMore(Word(nums+"$.")) \

all_results = OneOrMore(Group(result))

parsed = all_results.parseString(aaa)

for block in parsed:
    print block

这会返回一个列表的列表。

This returns a list of lists.

['Location', '1', 'Product', '1', 'Product', '2', 'Product', '3', '$20.99', '$21.99', '$33.79']
['Location', '2', 'Product', '1', 'Product', '2', 'Product', '3', '$24.99', '$22.88', '$35.59']

您可以将事情,只要你想,但为了简单起见,我刚刚返回的列表。空白默认情况下,这使事情变得简单了很多被忽略。

You can group things as you want but for simplicity I have just returned lists. Whitespace is ignored by default which makes things a lot simpler.

我不知道是否有等值的其他语言。

I do not know if there are equivalents in other languages.

这篇关于算法或者模式阅读文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆