如何最好地解析一个简单的语法? [英] How best to parse a simple grammar?

查看:75
本文介绍了如何最好地解析一个简单的语法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

好的,所以我问了一些关于这个项目的小问题,但是我对我要提出的设计仍然没有足够的信心,所以我将在更广泛的范围内提一个问题.规模.

Ok, so I've asked a bunch of smaller questions about this project, but I still don't have much confidence in the designs I'm coming up with, so I'm going to ask a question on a broader scale.

我正在解析课程目录的先决条件描述.描述几乎总是遵循某种形式,这使我认为我可以解析大多数形式.

I am parsing pre-requisite descriptions for a course catalog. The descriptions almost always follow a certain form, which makes me think I can parse most of them.

根据本文,我想生成一个课程前提条件关系图. (在我解析了数据之后,这部分将很容易.)

From the text, I would like to generate a graph of course pre-requisite relationships. (That part will be easy, after I have parsed the data.)

一些示例输入和输出:

"CS 2110" => ("CS", 2110) # 0

"CS 2110 and INFO 3300" => [("CS", 2110), ("INFO", 3300)] # 1
"CS 2110, INFO 3300" => [("CS", 2110), ("INFO", 3300)] # 1
"CS 2110, 3300, 3140" => [("CS", 2110), ("CS", 3300), ("CS", 3140)] # 1

"CS 2110 or INFO 3300" => [[("CS", 2110)], [("INFO", 3300)]] # 2

"MATH 2210, 2230, 2310, or 2940" => [[("MATH", 2210), ("MATH", 2230), ("MATH", 2310)], [("MATH", 2940)]] # 3  

  1. 如果整个描述只是一门课程,则会直接输出.

  1. If the entire description is just a course, it is output directly.

如果课程是联合的(和"),则它们全部输出在同一列表中

If the courses are conjoined ("and"), they are all output in the same list

如果课程不连贯(或"),则它们在单独的列表中

If the course are disjoined ("or"), they are in separate lists

在这里,我们同时拥有"and"和"or".

Here, we have both "and" and "or".

一个需要注意的地方,它变得更容易:看来和"/或"短语的嵌套永远不会比示例3所示的多.

One caveat that makes it easier: it appears that the nesting of "and"/"or" phrases is never greater than as shown in example 3.

做到这一点的最佳方法是什么?我从PLY开始,但是我不知道如何解决减少/减少冲突. PLY的优点是很容易操纵每个解析规则生成的内容:

What is the best way to do this? I started with PLY, but I couldn't figure out how to resolve the reduce/reduce conflicts. The advantage of PLY is that it's easy to manipulate what each parse rule generates:

def p_course(p):
 'course : DEPT_CODE COURSE_NUMBER'
 p[0] = (p[1], int(p[2]))

使用PyParse,尚不清楚如何修改parseString()的输出.我当时正在考虑以@Alex Martelli的想法为基础,即在一个对象中保持状态并从中获取输出,但是我不确定到底该如何最好地完成.

With PyParse, it's less clear how to modify the output of parseString(). I was considering building upon @Alex Martelli's idea of keeping state in an object and building up the output from that, but I'm not sure exactly how that is best done.

 def addCourse(self, str, location, tokens):
  self.result.append((tokens[0][0], tokens[0][1]))

 def makeCourseList(self, str, location, tokens):

  dept = tokens[0][0]
  new_tokens = [(dept, tokens[0][1])]
  new_tokens.extend((dept, tok) for tok in tokens[1:])

  self.result.append(new_tokens)

例如,要处理或"案件:

For instance, to handle "or" cases:

    def __init__(self):
            self.result = []
            # ...
  self.statement = (course_data + Optional(OR_CONJ + course_data)).setParseAction(self.disjunctionCourses)



 def disjunctionCourses(self, str, location, tokens):
  if len(tokens) == 1:
   return tokens

  print "disjunction tokens: %s" % tokens

disjunctionCourses()如何知道哪些较小的短语要脱节?它得到的只是令牌,但是到目前为止已解析的内容存储在result中,那么该函数如何分辨result中的哪些数据对应于token的哪些元素?我想我可以搜索标记,然后找到具有相同数据的result元素,但是感觉很复杂……

How does disjunctionCourses() know which smaller phrases to disjoin? All it gets is tokens, but what's been parsed so far is stored in result, so how can the function tell which data in result corresponds to which elements of token? I guess I could search through the tokens, then find an element of result with the same data, but that feel convoluted...

此外,还有许多包含杂项文字的描述,例如:

Also, there are many descriptions that include misc text, like:

"CS 2110 or permission of instructor"
"INFO 3140 or equivalent experience"
"PYSCH 2210 and sophomore standing"

但是我解析该文本并不重要.

But it isn't critical that I parse that text.

什么是解决此问题的更好方法?

推荐答案

def parse(astr):
    astr=astr.replace(',','')
    astr=astr.replace('and','')    
    tokens=astr.split()
    dept=None
    number=None
    result=[]
    option=[]
    for tok in tokens:
        if tok=='or':
            result.append(option)
            option=[]
            continue
        if tok.isalpha():
            dept=tok
            number=None
        else:
            number=int(tok)
        if dept and number:
            option.append((dept,number))
    else:
        if option:
            result.append(option)
    return result

if __name__=='__main__':
    tests=[ ("CS 2110" , [[("CS", 2110)]]),
            ("CS 2110 and INFO 3300" , [[("CS", 2110), ("INFO", 3300)]]),
            ("CS 2110, INFO 3300" , [[("CS", 2110), ("INFO", 3300)]]),
            ("CS 2110, 3300, 3140", [[("CS", 2110), ("CS", 3300), ("CS", 3140)]]),
            ("CS 2110 or INFO 3300", [[("CS", 2110)], [("INFO", 3300)]]),
            ("MATH 2210, 2230, 2310, or 2940", [[("MATH", 2210), ("MATH", 2230), ("MATH", 2310)], [("MATH", 2940)]])]

    for test,answer in tests:
        result=parse(test)
        if result==answer:
            print('GOOD: {0} => {1}'.format(test,answer))
        else:
            print('ERROR: {0} => {1} != {2}'.format(test,result,answer))
            break

收益

GOOD: CS 2110 => [[('CS', 2110)]]
GOOD: CS 2110 and INFO 3300 => [[('CS', 2110), ('INFO', 3300)]]
GOOD: CS 2110, INFO 3300 => [[('CS', 2110), ('INFO', 3300)]]
GOOD: CS 2110, 3300, 3140 => [[('CS', 2110), ('CS', 3300), ('CS', 3140)]]
GOOD: CS 2110 or INFO 3300 => [[('CS', 2110)], [('INFO', 3300)]]
GOOD: MATH 2210, 2230, 2310, or 2940 => [[('MATH', 2210), ('MATH', 2230), ('MATH', 2310)], [('MATH', 2940)]]

这篇关于如何最好地解析一个简单的语法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆