Python - 词法分析和标记化 [英] Python - lexical analysis and tokenization

查看：19 发布时间：2021/9/9 19:16:04 python transform lexical-analysis

本文介绍了Python - 词法分析和标记化的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我希望在这里加快我的发现过程，因为这是我第一次涉足词法分析领域.也许这甚至是错误的道路.首先，我将描述我的问题:

I'm looking to speed along my discovery process here quite a bit, as this is my first venture into the world of lexical analysis. Maybe this is even the wrong path. First, I'll describe my problem:

我有非常大的属性文件(大约有 1,000 个属性)，经过提炼后，实际上只有大约 15 个重要属性，其余的可以生成或很少更改.

I've got very large properties files (in the order of 1,000 properties), which when distilled, are really just about 15 important properties and the rest can be generated or rarely ever change.

例如:

general {
  name = myname
  ip = 127.0.0.1
}

component1 {
   key = value
   foo = bar
}

这是我要创建的格式类型以标记如下内容:

This is the type of format I want to create to tokenize something like:

property.${general.name}blah.home.directory = /blah
property.${general.name}.ip = ${general.ip}
property.${component1}.ip = ${general.ip}
property.${component1}.foo = ${component1.foo}

进入

property.mynameblah.home.directory = /blah
property.myname.ip = 127.0.0.1
property.component1.ip = 127.0.0.1
property.component1.foo = bar

词法分析和标记化听起来是我最好的方法，但这是一种非常简单的形式.这是一个简单的语法，一个简单的替换，我想确保我没有带着大锤敲钉子.

Lexical analysis and tokenization sounds like my best route, but this is a very simple form of it. It's a simple grammar, a simple substitution and I'd like to make sure that I'm not bringing a sledgehammer to knock in a nail.

我可以创建自己的词法分析器和分词器，或者 ANTlr 是一种可能性，但我不喜欢重新发明轮子，而且 ANTlr 听起来有点矫枉过正.

I could create my own lexer and tokenizer, or ANTlr is a possibility, but I don't like re-inventing the wheel and ANTlr sounds like overkill.

我不熟悉编译器技术，因此请指点正确的方向 &代码将不胜感激.

I'm not familiar with compiler techniques, so pointers in the right direction & code would be most appreciated.

注意:我可以更改输入格式.

Note: I can change the input format.

推荐答案

有一篇关于使用的优秀文章 effbot.org 上的词法分析正则表达式.

There's an excellent article on Using Regular Expressions for Lexical Analysis at effbot.org.

根据您的问题调整分词器:

Adapting the tokenizer to your problem:

import re

token_pattern = r"""
(?P<identifier>[a-zA-Z_][a-zA-Z0-9_]*)
|(?P<integer>[0-9]+)
|(?P<dot>\.)
|(?P<open_variable>[$][{])
|(?P<open_curly>[{])
|(?P<close_curly>[}])
|(?P<newline>\n)
|(?P<whitespace>\s+)
|(?P<equals>[=])
|(?P<slash>[/])
"""

token_re = re.compile(token_pattern, re.VERBOSE)

class TokenizerException(Exception): pass

def tokenize(text):
    pos = 0
    while True:
        m = token_re.match(text, pos)
        if not m: break
        pos = m.end()
        tokname = m.lastgroup
        tokvalue = m.group(tokname)
        yield tokname, tokvalue
    if pos != len(text):
        raise TokenizerException('tokenizer stopped at pos %r of %r' % (
            pos, len(text)))

为了测试它，我们这样做:

To test it, we do:

stuff = r'property.${general.name}.ip = ${general.ip}'
stuff2 = r'''
general {
  name = myname
  ip = 127.0.0.1
}
'''

print ' stuff '.center(60, '=')
for tok in tokenize(stuff):
    print tok

print ' stuff2 '.center(60, '=')
for tok in tokenize(stuff2):
    print tok

用于:

========================== stuff ===========================
('identifier', 'property')
('dot', '.')
('open_variable', '${')
('identifier', 'general')
('dot', '.')
('identifier', 'name')
('close_curly', '}')
('dot', '.')
('identifier', 'ip')
('whitespace', ' ')
('equals', '=')
('whitespace', ' ')
('open_variable', '${')
('identifier', 'general')
('dot', '.')
('identifier', 'ip')
('close_curly', '}')
========================== stuff2 ==========================
('newline', '\n')
('identifier', 'general')
('whitespace', ' ')
('open_curly', '{')
('newline', '\n')
('whitespace', '  ')
('identifier', 'name')
('whitespace', ' ')
('equals', '=')
('whitespace', ' ')
('identifier', 'myname')
('newline', '\n')
('whitespace', '  ')
('identifier', 'ip')
('whitespace', ' ')
('equals', '=')
('whitespace', ' ')
('integer', '127')
('dot', '.')
('integer', '0')
('dot', '.')
('integer', '0')
('dot', '.')
('integer', '1')
('newline', '\n')
('close_curly', '}')
('newline', '\n')

这篇关于Python - 词法分析和标记化的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python - 词法分析和标记化 [英] Python - lexical analysis and tokenization

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python - 词法分析和标记化 [英] Python - lexical analysis and tokenization

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭