将一种查询格式解析为另一种 [英] pyparsing one query format to another one

查看:112
本文介绍了将一种查询格式解析为另一种的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很茫然.我一直在努力使它工作几天.但是我对此一无所获,所以我想在这里咨询一下你们,看看是否有人可以帮助我!

I am at a loss. I have been trying to get this to work for days now. But I am not getting anywhere with this, so I figured I'd consult you guys here and see if someone is able to help me!

我正在使用pyparsing尝试将一种查询格式解析为另一种查询格式.这不是一个简单的转变,但实际上需要一些头脑:)

I am using pyparsing in an attempt to parse one query format to another one. This is not a simple transformation but actually takes some brains :)

当前查询如下:

("breast neoplasms"[MeSH Terms] OR breast cancer[Acknowledgments] 
OR breast cancer[Figure/Table Caption] OR breast cancer[Section Title] 
OR breast cancer[Body - All Words] OR breast cancer[Title] 
OR breast cancer[Abstract] OR breast cancer[Journal]) 
AND (prevention[Acknowledgments] OR prevention[Figure/Table Caption] 
OR prevention[Section Title] OR prevention[Body - All Words] 
OR prevention[Title] OR prevention[Abstract])

使用pyparsing,我可以得到以下结构:

And using pyparsing I have been able to get the following structure:

[[[['"', 'breast', 'neoplasms', '"'], ['MeSH', 'Terms']], 'or',
[['breast', 'cancer'], ['Acknowledgments']], 'or', [['breast', 'cancer'],
['Figure/Table', 'Caption']], 'or', [['breast', 'cancer'], ['Section', 
'Title']], 'or', [['breast', 'cancer'], ['Body', '-', 'All', 'Words']], 
'or', [['breast', 'cancer'], ['Title']], 'or', [['breast', 'cancer'], 
['Abstract']], 'or', [['breast', 'cancer'], ['Journal']]], 'and', 
[[['prevention'], ['Acknowledgments']], 'or', [['prevention'], 
['Figure/Table', 'Caption']], 'or', [['prevention'], ['Section', 'Title']], 
'or', [['prevention'], ['Body', '-', 'All', 'Words']], 'or', 
[['prevention'], ['Title']], 'or', [['prevention'], ['Abstract']]]]

但是现在,我很茫然.我需要将上面的输出格式化为lucene搜索查询. 这是有关所需转换的简短示例:

But now, I am at a loss. I need to format the above output to a lucene search query. Here is a short example on the transformations required:

"breast neoplasms"[MeSH Terms] --> [['"', 'breast', 'neoplasms', '"'], 
['MeSH', 'Terms']] --> mesh terms: "breast neoplasms"

但是我被困在那里.我还需要能够使用特殊词AND和OR.

But I am stuck right there. I also need to be able to make use of the special words AND and OR.

所以最后一个查询可能是:网状术语:乳腺肿瘤"和预防

so a final query might be: mesh terms: "breast neoplasms" and prevention

谁可以帮助我,并给我一些解决方法的提示?任何帮助将不胜感激.

Who can help me and give me some hints on how to solve this? Any kind of help would be appreciated.

由于我正在使用pyparsing,因此我对python充满了兴趣.我粘贴了以下代码,以便您可以试用它,而不必从0开始!

Since I am using pyparsing, I am bount to python. I have pasted the code below so that you can play around with it and dont have to start at 0!

非常感谢您的帮助!

def PubMedQueryParser():
    word = Word(alphanums +".-/&§")
    complex_structure = Group(Literal('"') + OneOrMore(word) + Literal('"')) + Suppress('[') + Group(OneOrMore(word)) + Suppress(']')
    medium_structure = Group(OneOrMore(word)) + Suppress('[') + Group(OneOrMore(word)) + Suppress(']')
    easy_structure = Group(OneOrMore(word))
    parse_structure = complex_structure | medium_structure | easy_structure
    operators = oneOf("and or", caseless=True)
    expr = Forward()
    atom = Group(parse_structure) + ZeroOrMore(operators + expr)
    atom2 = Group(Suppress('(') + atom + Suppress(')')) + ZeroOrMore(operators + expr) | atom
    expr << atom2
    return expr

推荐答案

好吧,您已经踏入了一个不错的起点.但是从这里开始,很容易陷入解析器细化的细节,而且您可能会处于这种状态好几天了.让我们从原始查询语法开始逐步解决您的问题.

Well, you have gotten yourself off to a decent start. But from here, it is easy to get bogged down in details of parser-tweaking, and you could be in that mode for days. Let's step through your problem beginning with the original query syntax.

从这样的项目开始时,编写要解析的语法的BNF.不一定要非常严格,实际上,这是我从您的样本中看到的一个起点:

When you start out with a project like this, write a BNF of the syntax you want to parse. It doesn't have to be super rigorous, in fact, here is a start at one based on what I can see from your sample:

word :: Word('a'-'z', 'A'-'Z', '0'-'9', '.-/&§')
field_qualifier :: '[' word+ ']'
search_term :: (word+ | quoted_string) field_qualifier?
and_op :: 'and'
or_op :: 'or'
and_term :: or_term (and_op or_term)*
or_term :: atom (or_op atom)*
atom :: search_term | ('(' and_term ')')

这非常接近-我们在wordand_opor_op表达式之间可能存在一些歧义方面存在一个小问题,因为'and'和'or'确实匹配单词的定义.我们需要在实施时加紧处理,以确保将癌症或癌或淋巴瘤或黑色素瘤"读为4个不同的搜索词,并用或"分隔,而不仅仅是一个大词(我认为这是您目前所用的词解析器即可).我们还获得了识别运算符优先级的好处-也许并非绝对必要,但现在让我们开始吧.

That's pretty close - we have a slight problem with some possible ambiguity between word and the and_op and or_op expressions, since 'and' and 'or' do match the definition of a word. We'll need to tighten this up at implementation time, to make sure that "cancer or carcinoma or lymphoma or melanoma" gets read as 4 different search terms separated by 'or's, not just one big term (which I think is what your current parser would do). We also get the benefit of recognizing precedence of operators - maybe not strictly necessary, but let's go with it for now.

转换为pyparsing很简单:

Converting to pyparsing is simple enough:

LBRACK,RBRACK,LPAREN,RPAREN = map(Suppress,"[]()")
and_op = CaselessKeyword('and')
or_op = CaselessKeyword('or')
word = Word(alphanums + '.-/&')

field_qualifier = LBRACK + OneOrMore(word) + RBRACK
search_term = ((Group(OneOrMore(word)) | quoted_string)('search_text') + 
               Optional(field_qualifier)('field'))
expr = Forward()
atom = search_term | (LPAREN + expr + RPAREN)
or_term = atom + ZeroOrMore(or_op + atom)
and_term = or_term + ZeroOrMore(and_op + or_term)
expr << and_term

为了解决'or'和'and'的歧义,我们在单词的开头加一个否定的前瞻:

To address the ambiguity of 'or' and 'and', we put a negative lookahead at the beginning of word:

word = ~(and_op | or_op) + Word(alphanums + '.-/&')

要为结果提供一些结构,请包裹在Group类中:

To give some structure to the results, wrap in Group classes:

field_qualifier = Group(LBRACK + OneOrMore(word) + RBRACK)
search_term = Group(Group(OneOrMore(word) | quotedString)('search_text') +
                          Optional(field_qualifier)('field'))
expr = Forward()
atom = search_term | (LPAREN + expr + RPAREN)
or_term = Group(atom + ZeroOrMore(or_op + atom))
and_term = Group(or_term + ZeroOrMore(and_op + or_term))
expr << and_term

现在使用以下方式解析示例文本:

Now parsing your sample text with:

res = expr.parseString(test)
from pprint import pprint
pprint(res.asList())

给予:

[[[[[[['"breast neoplasms"'], ['MeSH', 'Terms']],
     'or',
     [['breast', 'cancer'], ['Acknowledgments']],
     'or',
     [['breast', 'cancer'], ['Figure/Table', 'Caption']],
     'or',
     [['breast', 'cancer'], ['Section', 'Title']],
     'or',
     [['breast', 'cancer'], ['Body', '-', 'All', 'Words']],
     'or',
     [['breast', 'cancer'], ['Title']],
     'or',
     [['breast', 'cancer'], ['Abstract']],
     'or',
     [['breast', 'cancer'], ['Journal']]]]],
  'and',
  [[[[['prevention'], ['Acknowledgments']],
     'or',
     [['prevention'], ['Figure/Table', 'Caption']],
     'or',
     [['prevention'], ['Section', 'Title']],
     'or',
     [['prevention'], ['Body', '-', 'All', 'Words']],
     'or',
     [['prevention'], ['Title']],
     'or',
     [['prevention'], ['Abstract']]]]]]]

实际上,它与解析器的结果非常相似.现在,我们可以遍历此结构并构建新的查询字符串,但是我更喜欢使用解析对象来执行此操作,该对象是在解析时通过将类定义为令牌容器而不是Group,然后向这些类中添加行为来创建的得到我们想要的输出.区别在于,我们的已分析对象令牌容器可以具有特定于已分析表达式类型的行为.

Actually, pretty similar to the results from your parser. We could now recurse through this structure and build up your new query string, but I prefer to do this using parsed objects, created at parse time by defining classes as token containers instead of Groups, and then adding behavior to the classes to get our desired output. The distinction is that our parsed object token containers can have behavior that is specific to the kind of expression that was parsed.

我们将从一个基本的抽象类ParsedObject开始,该类将把已解析的标记作为其初始化结构.我们还将添加一个抽象方法queryString,该方法将在所有派生类中实现,以创建所需的输出:

We'll begin with a base abstract class, ParsedObject, that will take the parsed tokens as its initializing structure. We'll also add an abstract method, queryString, which we'll implement in all the deriving classes to create your desired output:

class ParsedObject(object):
    def __init__(self, tokens):
        self.tokens = tokens
    def queryString(self):
        '''Abstract method to be overridden in subclasses'''

现在我们可以从此类派生,任何子类都可以用作定义语法的解析动作.

Now we can derive from this class, and any subclass can be used as a parse action in defining the grammar.

执行此操作时,添加的Group会干扰我们的结构,因此我们将在没有它们的情况下重新定义原始解析器:

When we do this, Groups that were added for structure kind of get in our way, so we'll redefine the original parser without them:

search_term = Group(OneOrMore(word) | quotedString)('search_text') + 
                    Optional(field_qualifier)('field')
atom = search_term | (LPAREN + expr + RPAREN)
or_term = atom + ZeroOrMore(or_op + atom)
and_term = or_term + ZeroOrMore(and_op + or_term)
expr << and_term

现在,我们使用self.tokens来实现search_term的类,以访问在输入字符串中找到的已解析位:

Now we implement the class for search_term, using self.tokens to access the parsed bits found in the input string:

class SearchTerm(ParsedObject):
    def queryString(self):
        text = ' '.join(self.tokens.search_text)
        if self.tokens.field:
            return '%s: %s' % (' '.join(f.lower() 
                                        for f in self.tokens.field[0]),text)
        else:
            return text
search_term.setParseAction(SearchTerm)

接下来,我们将实现and_termor_term表达式.两者都是二进制运算符,只是它们在输出查询中的结果运算符字符串不同,因此我们可以只定义一个类,然后让它们为各自的运算符字符串提供一个类常量:

Next we'll implement the and_term and or_term expressions. Both are binary operators differing only in their resulting operator string in the output query, so we can just define one class and let them provide a class constant for their respective operator strings:

class BinaryOperation(ParsedObject):
    def queryString(self):
        joinstr = ' %s ' % self.op
        return joinstr.join(t.queryString() for t in self.tokens[0::2])
class OrOperation(BinaryOperation):
    op = "OR"
class AndOperation(BinaryOperation):
    op = "AND"
or_term.setParseAction(OrOperation)
and_term.setParseAction(AndOperation)

请注意,pyparsing与传统解析器有所不同-我们的BinaryOperation将以单个表达式匹配"a或b或c",而不是嵌套对(a或b)或c".因此,我们必须使用步进切片[0::2]重新加入所有术语.

Note that pyparsing is a little different from traditional parsers - our BinaryOperation will match "a or b or c" as a single expression, not as the nested pairs "(a or b) or c". So we have to rejoin all of the terms using the stepping slice [0::2].

最后,我们通过将所有expr包装在()中来添加一个parse动作以反映任何嵌套:

Finally, we add a parse action to reflect any nesting by wrapping all exprs in ()'s:

class Expr(ParsedObject):
    def queryString(self):
        return '(%s)' % self.tokens[0].queryString()
expr.setParseAction(Expr)

为方便起见,这是整个解析器的一个副本/可粘贴块:

For your convenience, here is the entire parser in one copy/pastable block:

from pyparsing import *

LBRACK,RBRACK,LPAREN,RPAREN = map(Suppress,"[]()")
and_op = CaselessKeyword('and')
or_op = CaselessKeyword('or')
word = ~(and_op | or_op) + Word(alphanums + '.-/&')
field_qualifier = Group(LBRACK + OneOrMore(word) + RBRACK)

search_term = (Group(OneOrMore(word) | quotedString)('search_text') + 
               Optional(field_qualifier)('field'))
expr = Forward()
atom = search_term | (LPAREN + expr + RPAREN)
or_term = atom + ZeroOrMore(or_op + atom)
and_term = or_term + ZeroOrMore(and_op + or_term)
expr << and_term

# define classes for parsed structure
class ParsedObject(object):
    def __init__(self, tokens):
        self.tokens = tokens
    def queryString(self):
        '''Abstract method to be overridden in subclasses'''

class SearchTerm(ParsedObject):
    def queryString(self):
        text = ' '.join(self.tokens.search_text)
        if self.tokens.field:
            return '%s: %s' % (' '.join(f.lower() 
                                        for f in self.tokens.field[0]),text)
        else:
            return text
search_term.setParseAction(SearchTerm)

class BinaryOperation(ParsedObject):
    def queryString(self):
        joinstr = ' %s ' % self.op
        return joinstr.join(t.queryString() 
                                for t in self.tokens[0::2])
class OrOperation(BinaryOperation):
    op = "OR"
class AndOperation(BinaryOperation):
    op = "AND"
or_term.setParseAction(OrOperation)
and_term.setParseAction(AndOperation)

class Expr(ParsedObject):
    def queryString(self):
        return '(%s)' % self.tokens[0].queryString()
expr.setParseAction(Expr)


test = """("breast neoplasms"[MeSH Terms] OR breast cancer[Acknowledgments]  
OR breast cancer[Figure/Table Caption] OR breast cancer[Section Title]  
OR breast cancer[Body - All Words] OR breast cancer[Title]  
OR breast cancer[Abstract] OR breast cancer[Journal])  
AND (prevention[Acknowledgments] OR prevention[Figure/Table Caption]  
OR prevention[Section Title] OR prevention[Body - All Words]  
OR prevention[Title] OR prevention[Abstract])"""

res = expr.parseString(test)[0]
print res.queryString()

打印以下内容:

((mesh terms: "breast neoplasms" OR acknowledgments: breast cancer OR 
  figure/table caption: breast cancer OR section title: breast cancer OR 
  body - all words: breast cancer OR title: breast cancer OR 
  abstract: breast cancer OR journal: breast cancer) AND 
 (acknowledgments: prevention OR figure/table caption: prevention OR 
  section title: prevention OR body - all words: prevention OR 
  title: prevention OR abstract: prevention))

我猜想您需要收紧一些输出-这些lucene标签名称看起来非常模棱两可-我只是在关注您发布的示例.但是您不必改变很多解析器,只需调整附加类的queryString方法即可.

I'm guessing you'll need to tighten up some of this output - those lucene tag names look very ambiguous - I was just following your posted sample. But you shouldn't have to change the parser much, just adjust the queryString methods of the attached classes.

作为发布者的附加练习:以查询语言添加对NOT布尔运算符的支持.

As an added exercise to the poster: add support for NOT boolean operator in your query language.

这篇关于将一种查询格式解析为另一种的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆