从 Python 源代码中提取注释 [英] Extracting comments from Python Source Code

查看:215
本文介绍了从 Python 源代码中提取注释的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试编写一个程序来提取用户输入的代码中的注释.我尝试使用正则表达式,但发现很难编写.

I'm trying to write a program to extract comments in code that user enters. I tried to use regex, but found it difficult to write.

然后我在这里找到了一个帖子.答案建议使用 tokenize.generate_tokens 来分析语法,但是 文档说:

Then I found a post here. The answer suggests to use tokenize.generate_tokens to analyze the grammar, but the documentation says:

generate_tokens() 生成器需要一个参数,readline,它必须是一个可调用的对象,它提供与readline() 内置文件对象的方法(见文件对象部分).

The generate_tokens() generator requires one argument, readline, which must be a callable object which provides the same interface as the readline() method of built-in file objects (see section File Objects).

但是字符串对象没有 readline 方法.

But a string object does not have readline method.

然后我在这里找到了另一篇文章,建议使用StringIO.StringIO 获取readline 方法.于是我写了下面的代码:

Then I found another post here, suggesting to use StringIO.StringIO to get a readline method. So I wrote the following code:

import tokenize
import io
import StringIO

def extract(code):
    res = []
    comment = None
    stringio = StringIO.StringIO(code)
    for toktype, tokval, begin, end, line in tokenize.generate_tokens(stringio):
        # print(toknum,tokval)
        if toktype != tokenize.COMMENT:
            res.append((toktype, tokval))
        else:
            print tokenize.untokenize(toktype)
    return tokenize.untokenize(res)

并输入以下代码:extract('a = 1+2#A Comment')

但是得到了:

Traceback (most recent call last):     
   File "<stdin>", line 1, in <module>     
   File "ext.py", line 10, in extract     
     for toktype, tokval, begin, end, line in tokenize.generate_tokens(stringio):     
   File "C:\Python27\lib\tokenize.py", line 294, in generate_tokens     
     line = readline()     
AttributeError: StringIO instance has no `__call__` method

我知道我可以编写一个新类,但有没有更好的解决方案?

I know I can write a new class, but is there any better solution?

推荐答案

更一般情况的答案(从模块、函数中提取):

模块:

文档指定需要提供一个可调用的,它公开与 readline() 内置 file 对象的方法.这提示:创建一个提供该方法的对象.

Answer for more general cases (extracting from modules, functions):

Modules:

The documentation specifies that one needs to provide a callable which exposes the same interface as the readline() method of built-in file objects. This hints to: create an object that provides that method.

在模块的情况下,我们可以open 一个新的模块作为一个普通文件并传入它的 readline 方法.这是关键,你传递的参数方法readline().

In the case of module, we can just open a new module as a normal file and pass in it's readline method. This is the key, the argument you pass is the method readline().

给定一个小的 scrpt.py 文件:

Given a small scrpt.py file with:

# My amazing foo function.
def foo():
    """ docstring """
    # I will print
    print "Hello"
    return 0   # Return the value

# Maaaaaaain
if __name__ == "__main__":
    # this is main
    print "Main" 

我们会像打开所有文件一样打开它:

We will open it as we do all files:

fileObj = open('scrpt.py', 'r')

这个文件对象现在有一个名为 readline 的方法(因为它是一个文件对象),我们可以安全地将它传递给 tokenize.generate_tokens 并创建一个生成器.

This file object now has a method called readline (because it is a file object) which we can safely pass to tokenize.generate_tokens and create a generator.

tokenize.generate_tokens(简单地标记化.Py3 中的标记化 -- 注意: Python 3 需要 readline 返回 bytes 所以你需要在 'rb' mode) 返回一个命名的元素元组,其中包含有关标记化元素的信息.这是一个小演示:

tokenize.generate_tokens (simply tokenize.tokenize in Py3 -- Note: Python 3 requires readline return bytes so you'll need to open the file in 'rb' mode) returns a named tuple of elements which contain information about the elements tokenized. Here's a small demo:

for toktype, tok, start, end, line in tokenize.generate_tokens(fileObj.readline):
    # we can also use token.tok_name[toktype] instead of 'COMMENT'
    # from the token module 
    if toktype == tokenize.COMMENT:
        print 'COMMENT' + " " + tok

注意我们如何将 fileObj.readline 方法传递给它.现在将打印:

Notice how we pass the fileObj.readline method to it. This will now print:

COMMENT # My amazing foo function
COMMENT # I will print
COMMENT # Return the value
COMMENT # Maaaaaaain
COMMENT # this is main 

所以所有评论无论位置如何都会被检测到.文档字符串当然被排除在外.

So all comments regardless of position are detected. Docstrings of course are excluded.

对于我真的想不到的情况,您可以在没有 open 的情况下获得类似的结果.尽管如此,为了完整起见,我将提出另一种方法.在这种情况下,您需要两个额外的模块,inspectStringIO(<Python3中的code>io.StringIO:

You could achieve a similar result without open for cases which I really can't think of. Nonetheless, I'll present another way of doing it for completeness sake. In this scenario you'll need two additional modules, inspect and StringIO (io.StringIO in Python3):

假设您有以下功能:

def bar():
    # I am bar
    print "I really am bar"
    # bar bar bar baaaar
    # (bar)
    return "Bar"

您需要一个具有 readline 方法的类文件对象,以将其与 tokenize 一起使用.好吧,您可以使用 StringIO.StringIOstr 创建一个类似文件的对象,并且您可以获得一个表示函数源的 str使用 inspect.getsource(func).在代码中:

You need a file-like object which has a readline method to use it with tokenize. Well, you can create a file-like object from an str using StringIO.StringIO and you can get an str representing the source of the function with inspect.getsource(func). In code:

funcText = inpsect.getsource(bar)
funcFile = StringIO.StringIO(funcText)

现在我们有一个类似文件的对象来表示具有所需readline 方法的函数.我们可以重用之前执行的循环,将 fileObj.readline 替换为 funcFile.readline.我们现在得到的输出具有相似的性质:

Now we have a file-like object representing the function which has the wanted readline method. We can just re-use the loop we previously performed replacing fileObj.readline with funcFile.readline. The output we get now is of similar nature:

COMMENT # I am bar
COMMENT # bar bar bar baaaar
COMMENT # (bar)

<小时>

顺便说一句,如果您真的想使用 re 创建自定义方式来执行此操作,请查看 tokenize.py 模块的源代码.它定义了某些注释模式、(r'#[^\r\n]*') 名称等,使用 readline 遍历行并在 readline 中搜索code>line 模式列表.值得庆幸的是,在您看了一会儿之后,它并不太复杂:-).


As an aside, if you really want to create a custom way of doing this with re take a look at the source for the tokenize.py module. It defines certain patters for comments, (r'#[^\r\n]*') names et cetera, loops through the lines with readline and searches within the line list for pattterns. Thankfully, it's not too complex after you look at it for a while :-).

您已经使用 StringIO 创建了一个提供接口的对象,但是您是否没有将该接口 (readline) 传递给 tokenize.generate_tokens,相反,您传递了完整的对象 (stringio).

You've created an object with StringIO that provides the interface but have you haven't passed that intereface (readline) to tokenize.generate_tokens, instead, you passed the full object (stringio).

此外,在您的 else 子句中,将引发 TypeError,因为 untokenize 需要一个可迭代作为输入.进行以下更改,您的功能可以正常工作:

Additionally, in your else clause a TypeError is going to be raised because untokenize expects an iterable as input. Making the following changes, your function works fine:

def extract(code):
    res = []
    comment = None
    stringio = StringIO.StringIO(code)
    # pass in stringio.readline to generate_tokens
    for toktype, tokval, begin, end, line in tokenize.generate_tokens(stringio.readline):
        if toktype != tokenize.COMMENT:
            res.append((toktype, tokval))
        else:
            # wrap (toktype, tokval) tupple in list
            print tokenize.untokenize([(toktype, tokval)])
    return tokenize.untokenize(res)

提供表单 expr = extract('a=1+2#A comment') 的输入,该函数将打印出注释并保留 expr:

Supplied with input of the form expr = extract('a=1+2#A comment') the function will print out the comment and retain the expression in expr:

expr = extract('a=1+2#A comment')
#A comment

print expr
'a =1 +2 '

此外,正如我稍后提到的,io 包含 Python3 的 StringIO,因此在这种情况下,幸运的是不需要 import.

Furthermore, as I later mention io houses StringIO for Python3 so in this case the import is thankfully not required.

这篇关于从 Python 源代码中提取注释的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆