删除注释... tokenize错误 [英] Removing comments... tokenize error

查看:92
本文介绍了删除注释... tokenize错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在分析一个非常大的应用程序(pysol)时,我需要删除评论。


删除评论所有这一行都很简单...


而不是嵌入式注释我使用了tokenize模块。


令我惊讶的是分析输出与输入不同

(最后一个元组元素应该完全复制输入行)

错误出现在三重字符串的对应中。

我不知道这是否已经得到纠正(我使用Python 2.3)

或者也许是我的错误...


接下来你会找到我用来复制奇怪行为的脚本:


import tokenize


输入=" pippo1"

输出=" pippo2"


f =开启(输入)

fOut = open(输出,w)


nLastLine = 0
我在tokenize.generate_tokens中的
( f.readline):

..如果nLastLine!=(i [2])[0]:#元组的第3个元素是

.. nLastLine =(i [2])[0]#(startingRow,startingCol)

.. fOut.write(i [4])


f.close()

fOut.close()


要使用的文件(pippo1)包含一个摘录:


类SelectDialogTreeData:

.. img =无

。 .def __init __(个体经营):

.. self.tree_xview =(0.0,1.0)

.. self.tree_yview =(0.0,1.0)

..如果self.img为None:

.. 。 SelectDialogTreeData.img =(makeImage(抖动= 0,数据= QUOT;""

R0lGODlhEAAOAPIFAAAAAICAgMDAwP // AP /// 4AAAAAAAAAAACH5BAEAAAUALAAAAAAQAA4AAAOL

WLrcGxA6FoYYYoRZwhCDMAhDFCkBoa6sGgBFQAzCIAzCIAzCEA CFAEEwEAwEA8FAMBAEAIUAYSAY

CAaCgWAgGAQAhQBBMBAMBAPBQDAQBACFAGEgGAgGgoFgIBgEAA UBBAIDAgMCAwIDAgMCAQAFAQQD

AgMCAwIDAgMCAwEABSaiogAKAKeoqakFCQA7"""),makeImage(抖动= 0,数据= QUOT;""

R0lGODlhEAAOAPIFAAAAAICAgMDAwP // AP /// 4AAAAAAAAAAACH5BAEAAAUALAAAAAAQAA4AAAN3

WLrcHBA6Foi1YZZAxBCDQESREhCDMAiDcFkBUASEMAiDMAiDMA gBAGlIGgQAgZeSEAAIAoAAQTAQ

DAQDwUAwAEAAhQBBMBAMBAPBQBAABACFAGEgGAgGgoFgIAAEAA oBBAMCAwIDAgMCAwEAAApERI4L

jpWWlgkAOw =="""),makeImage(抖动= 0,数据= QUOT; "

R0lGODdhEAAOAPIAAAAAAAAAgICAgMDAwP /// wAAAAAAAAAAACWAAAAAEAAOAAADTii63DowyiiA

GCHrnQUQAxcQAAEQgAAIg + MCwkDMdD0LgDDUQG8LAMGg1gPYBA DBgFbs1QQAwYDWBNQEAMHABrAR

BAD BwOsVAFzoqlqdAAA7"""),makeImage(抖动= 0,数据= QUOT;""

R0lGODdhEAAOAPIAAAAAAAAAgICAgMDAwP8AAP /// wAAAAAAACwAAAAAEAAOAAADVCi63DowyiiA

GCHrnQUQAxcUQAEUgAAIg + MCwlDMdD0LgDDQBE3UAoBgUCMUCD YBQDCwEWwFAUAwqBEKBJsAIBjQ

CDRCTQAQDKBQAcDFBrjf8Lg7AQA7"""))

tokenize(pippo2)的输出反而给出:


class SelectDialogTreeData:

.. img =无

.. def __init __(自我):

.. self.tree_xview =(0.0,1.0)

.. self.tree_yview =(0.0,1.0)

..如果self.img为None:

.. 。 SelectDialogTreeData.img =(makeImage(dither = 0,data ="""

AgMCAwIDAgMCAwEABSaiogAKAKeoqakFCQA7"""),makeImage(dither = 0,data ="""

jpWWlgkAOw =="""),makeImage(dither = 0,data ="""

BADBwOsVAFzoqlqdAAA7"""),makeImage( dither = 0,data ="""

CDRCTQAQDKBQAcDFBrjf8Lg7AQA7"""))


....差异很大!为什么?

In analysing a very big application (pysol) made of almost
100 sources, I had the need to remove comments.

Removing the comments which take all the line is straightforward...

Instead for the embedded comments I used the tokenize module.

To my surprise the analysed output is different from the input
(the last tuple element should exactly replicate the input line)
The error comes out in correspondance of a triple string.
I don''t know if this has already been corrected (I use Python 2.3)
or perhaps is a mistake on my part...

Next you find the script I use to replicate the strange behaviour:

import tokenize

Input = "pippo1"
Output = "pippo2"

f = open(Input)
fOut=open(Output,"w")

nLastLine=0
for i in tokenize.generate_tokens(f.readline):
.. if nLastLine != (i[2])[0]: # the 3rd element of the tuple is
.. . nLastLine = (i[2])[0] # (startingRow, startingCol)
.. . fOut.write(i[4])

f.close()
fOut.close()

The file to be used (pippo1) contains an extract:

class SelectDialogTreeData:
.. img = None
.. def __init__(self):
.. . self.tree_xview = (0.0, 1.0)
.. . self.tree_yview = (0.0, 1.0)
.. . if self.img is None:
.. . . SelectDialogTreeData.img = (makeImage(dither=0, data="""
R0lGODlhEAAOAPIFAAAAAICAgMDAwP//AP///4AAAAAAAAAAACH5BAEAAAUALAAAAAAQAA4AAAOL
WLrcGxA6FoYYYoRZwhCDMAhDFCkBoa6sGgBFQAzCIAzCIAzCEA CFAEEwEAwEA8FAMBAEAIUAYSAY
CAaCgWAgGAQAhQBBMBAMBAPBQDAQBACFAGEgGAgGgoFgIBgEAA UBBAIDAgMCAwIDAgMCAQAFAQQD
AgMCAwIDAgMCAwEABSaiogAKAKeoqakFCQA7"""), makeImage(dither=0, data="""
R0lGODlhEAAOAPIFAAAAAICAgMDAwP//AP///4AAAAAAAAAAACH5BAEAAAUALAAAAAAQAA4AAAN3
WLrcHBA6Foi1YZZAxBCDQESREhCDMAiDcFkBUASEMAiDMAiDMA gBAGlIGgQAgZeSEAAIAoAAQTAQ
DAQDwUAwAEAAhQBBMBAMBAPBQBAABACFAGEgGAgGgoFgIAAEAA oBBAMCAwIDAgMCAwEAAApERI4L
jpWWlgkAOw=="""), makeImage(dither=0, data="""
R0lGODdhEAAOAPIAAAAAAAAAgICAgMDAwP///wAAAAAAAAAAACwAAAAAEAAOAAADTii63DowyiiA
GCHrnQUQAxcQAAEQgAAIg+MCwkDMdD0LgDDUQG8LAMGg1gPYBA DBgFbs1QQAwYDWBNQEAMHABrAR
BADBwOsVAFzoqlqdAAA7"""), makeImage(dither=0, data="""
R0lGODdhEAAOAPIAAAAAAAAAgICAgMDAwP8AAP///wAAAAAAACwAAAAAEAAOAAADVCi63DowyiiA
GCHrnQUQAxcUQAEUgAAIg+MCwlDMdD0LgDDQBE3UAoBgUCMUCD YBQDCwEWwFAUAwqBEKBJsAIBjQ
CDRCTQAQDKBQAcDFBrjf8Lg7AQA7"""))

The output of tokenize (pippo2) gives instead:

class SelectDialogTreeData:
.. img = None
.. def __init__(self):
.. . self.tree_xview = (0.0, 1.0)
.. . self.tree_yview = (0.0, 1.0)
.. . if self.img is None:
.. . . SelectDialogTreeData.img = (makeImage(dither=0, data="""
AgMCAwIDAgMCAwEABSaiogAKAKeoqakFCQA7"""), makeImage(dither=0, data="""
jpWWlgkAOw=="""), makeImage(dither=0, data="""
BADBwOsVAFzoqlqdAAA7"""), makeImage(dither=0, data="""
CDRCTQAQDKBQAcDFBrjf8Lg7AQA7"""))

.... with a big difference! Why?

推荐答案

" qwweeeit" < QW ****** @ yahoo.it>写道:
"qwweeeit" <qw******@yahoo.it> wrote:
我不知道这是否已经得到纠正(我使用Python 2.3)
或者也许是我的错误...


这是你的错误。在for-

循环中添加一个print语句可能会帮助你解决这个问题:

nLastLine = 0
我在tokenize.generate_tokens(f.readline) :
打印i。如果nLastLine!=(i [2])[0]:#元组的第3个元素是
。 。 nLastLine =(i [2])[0]#(startingRow,startingCol)
。 。 fOut.write(i [4])
I don''t know if this has already been corrected (I use Python 2.3)
or perhaps is a mistake on my part...
it''s a mistake on your part. adding a print statement to the for-
loop might help you figure it out:
nLastLine=0
for i in tokenize.generate_tokens(f.readline): print i . if nLastLine != (i[2])[0]: # the 3rd element of the tuple is
. . nLastLine = (i[2])[0] # (startingRow, startingCol)
. . fOut.write(i[4])




(提示:如果一个令牌跨越多行会发生什么?以及

如何标记化模块处理评论?)


< / F>



(hints: what happens if a token spans multiple lines? and how does
the tokenize module deal with comments?)

</F>


谢谢!如果你再回答我的帖子,我可以把你视为

我的导师...


找到一个bug很奇怪...... !在任何情况下,我都不会更深入地讨论这件事,因为对我而言,这已经足够你的解释了。

我用手去除多个令牌来纠正问题线路

(只有8个案例...)。


相反,我不明白你对评论的暗示......

我成功地实现了一个删除评论的python脚本。


这里(它的所有繁琐和折旧的外观!...):


#removeCommentsTok.py

导入令牌化

输入=" pippo1"

输出=" pippo2"

f =打开(输入)

fOut =打开(输出,w)


nLastLine = 0
$我在tokenize.generate_tokens(f.readline)中获得b $ b:

..如果我[0] == 52和nLastLine!=(i [2])[0]:

.. fOut.write((i [4] .replace(i [1],''''))。rstrip()+''\ n'')

.. nLastLine =(i [2])[0]

.. elif i [0] == 4和nLastLine!=(i [2])[0]:

.. fOut.write((i [4]))

.. nLastLine =(i [2])[0]

f.close()

fOut.close()


对像我这样的人的解释......:

- 52和4分别是评论和NEWLINE的任意代码

- 删除评论是通过清除评论获得的(我[1])在

输入行(i [4])

- 我也正确修剪线以摆脱剩余的空白。
Thanks! If you answer to my posts one more time I could consider you as
my tutor...

It was strange to have found a bug...! In any case I will not go deeper
into the matter, because for me it''s enough your explanatiom.
I corrected the problem by hand removing the tokens spanning multiple lines
(there were only 8 cases...).

Instead I haven''t understood your hint about comments...
I succeded in realizing a python script which removes comments.

Here it is (in all its cumbersome and criptic appearence!...):

# removeCommentsTok.py
import tokenize
Input = "pippo1"
Output = "pippo2"
f = open(Input)
fOut=open(Output,"w")

nLastLine=0
for i in tokenize.generate_tokens(f.readline):
.. if i[0]==52 and nLastLine != (i[2])[0]:
.. . fOut.write((i[4].replace(i[1],'''')).rstrip()+''\n'')
.. . nLastLine=(i[2])[0]
.. elif i[0]==4 and nLastLine != (i[2])[0]:
.. . fOut.write((i[4]))
.. . nLastLine=(i[2])[0]
f.close()
fOut.close()

Some explanations for the guys like me...:
- 52 and 4 are the arbitrary codes for comments and NEWLINE respectively
- the comment removing is obtained by clearing the comment (i[1]) in the
input line (i[4])
- I also right trimmed the line to get rid off the remaining blanks.


qwweeeit写道:
qwweeeit wrote:
谢谢!如果你再回答我的帖子,我可以认为你是
作为我的导师...

找到一个bug很奇怪......!在任何情况下,我都不会更深入地讨论这个问题,因为对我而言,这已经足够你的解释了。
我通过手动删除跨越多个
行的令牌纠正了这个问题(有只有8个案例...)。

相反,我还没理解你对评论的暗示......
我成功地实现了一个删除评论的python脚本。

这是(所有繁琐而又有折射的外观!......):

#removeCommentsTok.py
导入令牌化
输入=" pippo1"
输出=" pippo2"
f =打开(输入)
fOut =打开(输出,w)

nLastLine = 0
for i在tokenize.generate_tokens(f.readline)中:
。如果我[0] == 52和nLastLine!=(i [2])[0]:
。 。 fOut.write((i [4] .replace(i [1],''''))。rstrip()+''\ n'')
。 。 nLastLine =(i [2])[0]
。 elif i [0] == 4和nLastLine!=(i [2])[0]:
。 。 fOut.write((i [4]))
。 。 nLastLine =(i [2])[0]
f.close()
fOut.close()

像我这样的人的一些解释......:
- 52和4分别是注释和NEWLINE
的任意代码 - 删除注释是通过清除
中的注释(i [1])输入行(i [4])
- 我也正确修剪线以消除剩余的空白。
Thanks! If you answer to my posts one more time I could consider you as my tutor...

It was strange to have found a bug...! In any case I will not go deeper into the matter, because for me it''s enough your explanatiom.
I corrected the problem by hand removing the tokens spanning multiple lines (there were only 8 cases...).

Instead I haven''t understood your hint about comments...
I succeded in realizing a python script which removes comments.

Here it is (in all its cumbersome and criptic appearence!...):

# removeCommentsTok.py
import tokenize
Input = "pippo1"
Output = "pippo2"
f = open(Input)
fOut=open(Output,"w")

nLastLine=0
for i in tokenize.generate_tokens(f.readline):
. if i[0]==52 and nLastLine != (i[2])[0]:
. . fOut.write((i[4].replace(i[1],'''')).rstrip()+''\n'')
. . nLastLine=(i[2])[0]
. elif i[0]==4 and nLastLine != (i[2])[0]:
. . fOut.write((i[4]))
. . nLastLine=(i[2])[0]
f.close()
fOut.close()

Some explanations for the guys like me...:
- 52 and 4 are the arbitrary codes for comments and NEWLINE respectively - the comment removing is obtained by clearing the comment (i[1]) in the input line (i[4])
- I also right trimmed the line to get rid off the remaining blanks.



Tokenizer将多行字符串和注释作为单个令牌发送。


################################################ ### ####################

#python comment和whitespace stripper :)

### ############################################### ### #################

import keyword,os,sys,traceback

import StringIO
导入令牌,令牌化

__credits__ =''只是我需要的另一种工具''

__version__ =''。7''

__author__ =''MEFarmer''

__date__ =''2005年1月15日,O ct 24 2004''


################################# #####################################

class脱衣舞:

""""" python comment and whitespace stripper :)

"""

def __init __(self, raw):

self.raw = raw


def格式(self,out = sys.stdout,comments = 0,spaces = 1,
untabify = 1,eol =''unix''):

'''''剥离评论,剥去额外的空白,

兑换EOL'来自Python代码。

'''''

#在self.lines存储行偏移

self.lines = [0 ,0]

pos = 0

#剥离第一个空行如果1

self.lasttoken = 1

self.temp = StringIO.StringIO()

self.spaces = spaces

self.comments = comments


if untabify :

self.raw = self.raw.expandtabs()

self.raw = self.raw.rstrip()+''''

self。 out = out


self.raw = self.raw.replace(''\\ n''',''\ n'')

self.raw = self.raw.replace(''\''',''\ n'')

self.lineend =''\ n''


#收集行

而1:

pos = self.raw.find(self.lineend,pos)+ 1
如果不是pos:break

self.lines.append(pos)


self.lines.append(len(self.raw) )

#在文件对象中包装文本

self.pos = 0

text = StringIO.StringIO(self.raw )


#解析源代码。

## Tokenize为每个令牌调用__call__

##函数,直到完成。 br />
尝试:

tokenize.tokenize(text.readline,self)
除了tokenize.TokenError之外的
,例如:

traceback.print_exc()


#好现在我们把它写到一个文件

#但是我们还需要清理空白

#在线之间和末端之间。

self.temp.seek(0)


#Mac CR

如果eol ==''mac'':

self.lineend =''\ r''

#Windows CR LF

elif eol ==''win'':

self.lineend =''\\ n \\ n''

#Unix LF

else:

self.lineend = '\ n''


for self.temp.readlines():

if spaces == -1:

self.out.write(line.rstrip()+ self.lineend)

else:

如果不是line.isspace():

self.lasttoken = 0

self.out.write(line.rstrip()+ self.lineend)

else:

self.lasttoken + = 1

如果self.lasttoken< = self.spaces和self.spaces:

self.out.write(self.lineend)

def __call __(self,toktype,toktext,

(srow,scol),(erow,ecol),line):

''''''令牌处理程序。

'''''

#计算新职位

oldpos = self.pos

newpos = self.lines [srow] + scol

self.pos = newpos + len(toktext)


#kill评论

如果不是self.comments:

#杀了评论?

如果toktype == tokenize.COMMENT:

返回


#处理换行符

如果[token.NEWLINE,tokenize.NL]中的toktype:

self.temp.write(self.lineend)

返回


#发送原始空白,如果需要

如果newpos> oldpos:

self.temp.write(self.raw [oldpos:newpos])


#跳过缩进代币

if [token.INDENT,token.DEDENT]中的toktype:

self.pos = newpos

返回


#发送文本到临时文件

self.temp.write(toktext)

返回

############## ################################################## ######

def Main():

import sys

if sys.argv [1]:

filein = open(sys.argv [1])。read()

Stripper(filein).format(out = sys.stdout,comments = 1,untabify = 1 ,

eol =''win'')


##################### ################################################## br />

if __name__ ==''__ main__'':

Main()


MEFarmer


Tokenizer sends multiline strings and comments as a single token.

################################################## ####################
# python comment and whitespace stripper :)
################################################## ####################

import keyword, os, sys, traceback
import StringIO
import token, tokenize
__credits__ = ''just another tool that I needed''
__version__ = ''.7''
__author__ = ''M.E.Farmer''
__date__ = ''Jan 15 2005, Oct 24 2004''

################################################## ####################

class Stripper:
"""python comment and whitespace stripper :)
"""
def __init__(self, raw):
self.raw = raw

def format(self, out=sys.stdout, comments=0, spaces=1,
untabify=1, eol=''unix''):
'''''' strip comments, strip extra whitespace,
convert EOL''s from Python code.
''''''
# Store line offsets in self.lines
self.lines = [0, 0]
pos = 0
# Strips the first blank line if 1
self.lasttoken = 1
self.temp = StringIO.StringIO()
self.spaces = spaces
self.comments = comments

if untabify:
self.raw = self.raw.expandtabs()
self.raw = self.raw.rstrip()+'' ''
self.out = out

self.raw = self.raw.replace(''\r\n'', ''\n'')
self.raw = self.raw.replace(''\r'', ''\n'')
self.lineend = ''\n''

# Gather lines
while 1:
pos = self.raw.find(self.lineend, pos) + 1
if not pos: break
self.lines.append(pos)

self.lines.append(len(self.raw))
# Wrap text in a filelike object
self.pos = 0

text = StringIO.StringIO(self.raw)

# Parse the source.
## Tokenize calls the __call__
## function for each token till done.
try:
tokenize.tokenize(text.readline, self)
except tokenize.TokenError, ex:
traceback.print_exc()

# Ok now we write it to a file
# but we also need to clean the whitespace
# between the lines and at the ends.
self.temp.seek(0)

# Mac CR
if eol == ''mac'':
self.lineend = ''\r''
# Windows CR LF
elif eol == ''win'':
self.lineend = ''\r\n''
# Unix LF
else:
self.lineend = ''\n''

for line in self.temp.readlines():
if spaces == -1:
self.out.write(line.rstrip()+self.lineend)
else:
if not line.isspace():
self.lasttoken=0
self.out.write(line.rstrip()+self.lineend)
else:
self.lasttoken+=1
if self.lasttoken<=self.spaces and self.spaces:
self.out.write(self.lineend)
def __call__(self, toktype, toktext,
(srow,scol), (erow,ecol), line):
'''''' Token handler.
''''''
# calculate new positions
oldpos = self.pos
newpos = self.lines[srow] + scol
self.pos = newpos + len(toktext)

#kill the comments
if not self.comments:
# Kill the comments ?
if toktype == tokenize.COMMENT:
return

# handle newlines
if toktype in [token.NEWLINE, tokenize.NL]:
self.temp.write(self.lineend)
return

# send the original whitespace, if needed
if newpos > oldpos:
self.temp.write(self.raw[oldpos:newpos])

# skip indenting tokens
if toktype in [token.INDENT, token.DEDENT]:
self.pos = newpos
return

# send text to the temp file
self.temp.write(toktext)
return
################################################## ####################

def Main():
import sys
if sys.argv[1]:
filein = open(sys.argv[1]).read()
Stripper(filein).format(out=sys.stdout, comments=1, untabify=1,
eol=''win'')

################################################## ####################

if __name__ == ''__main__'':
Main()

M.E.Farmer


这篇关于删除注释... tokenize错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆