我的第一个Python程序 - 词法分析器 [英] My first Python program -- a lexer

查看:326
本文介绍了我的第一个Python程序 - 词法分析器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好,


对Python有用(而不是试用片段)来自教程)。它还没有完成,但我希望得到一些反馈 - 我是Python

新手,看来,使用Python,总有一个更简单的

比你想象的更好的方式。


###开始###


import re


类Lexer(对象):

def __init __(self,source,tokens):

self.source = re。 sub(r" \r?\ n | \\ n"," \ n",source)

self.tokens = tokens

self.offset = 0

self.result = []

self.line = 1

self._compile()

self._tokenize()


def _compile(self):

for name,regex in self.tokens.iteritems():

self.tokens [name] = re.compile(regex,re.M)


def _tokenize(self):

而self .offset< len(self.source):

的名字,正则表达式在self.tokens.iteritems():

match = regex.match(self.source,self.offset)

如果不匹配:继续

self.offset + = len(match.group(0))

self.result.append(( name,match,self.line))

self.line + = match.group(0).count(" \ n")

break

else:

引发异常(

''偏移%s''源代码中的语法错误''%

str(self .offset))


def __str __(自我):

返回" \ n" .join(

[" ; [L:%s] \t [O:%s] \t [%s] \t''%s''"%

(str(line),str (match.pos),name,match.group(0))

for name,match,line in self.result])


#测试示例


source = r"""

姓名:" Thomas",#只是评论

年龄:37

"""


令牌= {

''T_IDENTIFIER'':r''[A-Za-z _] [A-Za-z0-9 _] *'',

''T_NUMBER'':r''[+ - ]?\d +'',

''T_STRING'':r''"(?: \\。| [^ \\"])*" '',

''T_OPERATOR'':r''[=:,;]'',

''T_NEWLINE'':r''\ n' ',

''T_LWSP'':r''[\ t] +'',

''T_COMMENT'':r''(?:\ #| //).*$''}


打印Lexer(来源,代币)


### End ### < br $>
问候,

托马斯


-

Ce n''est pas parce qu'' ils sont nontnombreuxàavoirtort qu''ils ont raison!

(Coluche)

解决方案

''}


打印Lexer(来源,代币)


###结束###

问候,

Thomas


-

Ce n''est pas parce qu''i ls sont nontnombreuxàavoirtort qu''ils ont raison!

(Coluche)


Thomas Mlynarczyk< th **** @ mlynarczyk -webdesign.dewrites:


你好,


我开始用Python写一个词法分析器 - 我第一次尝试做

一些对Python有用的东西(而不是试试来自

教程的片段)。它尚未完成,但我希望得到一些反馈 -

我是Python的新手,而且似乎使用Python,总有一个简单的

和比你想象的更好的方式。






加入约翰的评论,我不会'没有源作为

Lexer对象的成员,但作为tokenise()方法的一个参数(我会将b
公开)。 tokenise方法将返回您当前调用的内容

self.result。所以它会像这样使用。


>> mylexer = Lexer(tokens)
mylexer.tokenise(来源)



#Later:

< blockquote class =post_quotes>


>> mylexer.tokenise(another_source)



-

Arnaud


Arnaud Delobelle schrieb:


添加对于John的评论,我不会拥有作为

Lexer对象的成员的来源,而是作为tokenise()方法的参数(我会这样做)
公开)。 tokenise方法将返回您当前调用的内容

self.result。所以它会像这样使用。


>>> mylexer = Lexer(tokens)
mylexer.tokenise(source)
mylexer.tokenise(another_source)



稍后阶段,我打算让源标记不是一次全部,

但令牌代表,及时解析器(尚未写入)

访问下一个令牌:


token = mylexer.next(''FOO_TOKEN'')

如果不是令牌:提高异常(''FOO令牌预期。'')

#继续用令牌做一些有用的东西


其次在哪里( )将返回下一个标记(并提前一个内部

指针)*如果*它是一个FOO_TOKEN,否则它将返回False。这个

的方式,正则表达式匹配的总数将会减少:只有那个预期的b / bbb才会被试用。


但除此之外,经过反思,我认为你是对的,它确实比你建议的更合适。


谢谢你的支持反馈。


问候,

托马斯


-

Ce n' 'est pas parce qu''ils sontnombreuxàavoirtort qu''ils ont raison!

(Coluche)


Hello,

I started to write a lexer in Python -- my first attempt to do something
useful with Python (rather than trying out snippets from tutorials). It
is not complete yet, but I would like some feedback -- I''m a Python
newbie and it seems that, with Python, there is always a simpler and
better way to do it than you think.

### Begin ###

import re

class Lexer(object):
def __init__( self, source, tokens ):
self.source = re.sub( r"\r?\n|\r\n", "\n", source )
self.tokens = tokens
self.offset = 0
self.result = []
self.line = 1
self._compile()
self._tokenize()

def _compile( self ):
for name, regex in self.tokens.iteritems():
self.tokens[name] = re.compile( regex, re.M )

def _tokenize( self ):
while self.offset < len( self.source ):
for name, regex in self.tokens.iteritems():
match = regex.match( self.source, self.offset )
if not match: continue
self.offset += len( match.group(0) )
self.result.append( ( name, match, self.line ) )
self.line += match.group(0).count( "\n" )
break
else:
raise Exception(
''Syntax error in source at offset %s'' %
str( self.offset ) )

def __str__( self ):
return "\n".join(
[ "[L:%s]\t[O:%s]\t[%s]\t''%s''" %
( str( line ), str( match.pos ), name, match.group(0) )
for name, match, line in self.result ] )

# Test Example

source = r"""
Name: "Thomas", # just a comment
Age: 37
"""

tokens = {
''T_IDENTIFIER'' : r''[A-Za-z_][A-Za-z0-9_]*'',
''T_NUMBER'' : r''[+-]?\d+'',
''T_STRING'' : r''"(?:\\.|[^\\"])*"'',
''T_OPERATOR'' : r''[=:,;]'',
''T_NEWLINE'' : r''\n'',
''T_LWSP'' : r''[ \t]+'',
''T_COMMENT'' : r''(?:\#|//).*$'' }

print Lexer( source, tokens )

### End ###
Greetings,
Thomas

--
Ce n''est pas parce qu''ils sont nombreux à avoir tort qu''ils ont raison!
(Coluche)

解决方案

'' }

print Lexer( source, tokens )

### End ###
Greetings,
Thomas

--
Ce n''est pas parce qu''ils sont nombreux à avoir tort qu''ils ont raison!
(Coluche)


Thomas Mlynarczyk <th****@mlynarczyk-webdesign.dewrites:

Hello,

I started to write a lexer in Python -- my first attempt to do
something useful with Python (rather than trying out snippets from
tutorials). It is not complete yet, but I would like some feedback --
I''m a Python newbie and it seems that, with Python, there is always a
simpler and better way to do it than you think.

Hi,

Adding to John''s comments, I wouldn''t have source as a member of the
Lexer object but as an argument of the tokenise() method (which I would
make public). The tokenise method would return what you currently call
self.result. So it would be used like this.

>>mylexer = Lexer(tokens)
mylexer.tokenise(source)

# Later:

>>mylexer.tokenise(another_source)

--
Arnaud


Arnaud Delobelle schrieb:

Adding to John''s comments, I wouldn''t have source as a member of the
Lexer object but as an argument of the tokenise() method (which I would
make public). The tokenise method would return what you currently call
self.result. So it would be used like this.

>>>mylexer = Lexer(tokens)
mylexer.tokenise(source)
mylexer.tokenise(another_source)

At a later stage, I intend to have the source tokenised not all at once,
but token by token, "just in time" when the parser (yet to be written)
accesses the next token:

token = mylexer.next( ''FOO_TOKEN'' )
if not token: raise Exception( ''FOO token expected.'' )
# continue doing something useful with token

Where next() would return the next token (and advance an internal
pointer) *if* it is a FOO_TOKEN, otherwise it would return False. This
way, the total number of regex matchings would be reduced: Only that
which is expected is "tried out".

But otherwise, upon reflection, I think you are right and it would
indeed be more appropriate to do as you suggest.

Thanks for your feedback.

Greetings,
Thomas

--
Ce n''est pas parce qu''ils sont nombreux à avoir tort qu''ils ont raison!
(Coluche)


这篇关于我的第一个Python程序 - 词法分析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆