Python kludge以ASCII格式读取UCS-2(UTF-16?) [英] Python kludge to read UCS-2 (UTF-16?) as ASCII

查看:322
本文介绍了Python kludge以ASCII格式读取UCS-2(UTF-16?)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在这一点上有点过分,所以请提前谅解我的术语。



我在Windows XP上运行Python 2.7



我发现一些Python代码读取日志文件,做一些东西,然后显示一些东西。



什么,这不够细节?好的,这是一个简化版本:

 #!/ usr / bin / python 

import re
import sys

class NotSupportedTOCError(Exception):
pass

def filter_toc_entries(lines):
while True:
line = lines.next()
如果re.match(r\s *
。+ \s + \ |(?#track)
\s +。+ \\s + \ |(?#start)
\s +。+ \s + \ |(?#length)
\s +。+ \s + \ |(?#start sec )
\s +。+ \s * $(?#end sec)
,line,re.X):
lines.next()
break

while True:
line = lines.next()
m = re.match(r
^ \s *
(? P#\d +)
\s * \ | \s *
(?P< start_time> [0-9:。] +)
\s * \ | \s *
(?P< length_time> [0-9:。] +)
\s * \ | \s *
(?P< start_sector> \d +)
\\ s * \ | \s *
(?P< end_sector> \d +)
\s * $
,line,re.X)
如果不是m:
break
yield m.groupdict()

def calculate_mb_toc_numbers(eac_entries):
eac = list(eac_entries)
num_tracks = len (eac)

tracknums = [int(e ['num'])for e in eac]
如果range(1,num_tracks + 1)!= tracknums:
raise NotSupportedTOCError(非标准轨道号序列:%s,tracknums)

leadout_offset = int(eac [-1] ['end_sector'])+ 150 + 1
offsets = [ (int(x ['start_sector'])+ 150)for x in eac]
return [1,num_tracks,leadout_offset] + offsets

f = open(sys.argv [1])

mb_toc_urlpart =%20.join(str(x)for x in calculate_mb_toc_numbers(filter_toc_ent ries(f)))

打印mb_toc_urlpart

代码工作正常只要日志文件是简单的文本(我很想说ASCII,尽管可能不准确/准确 - 例如Notepad ++表示它是ANSI)。



但是,脚本在某些日志文件上不起作用(在这些情况下,Notepad ++说UCS-2 Little Endian)。 / p>

我收到以下错误:

 追溯(最近的呼叫最后):
文件simple.py,第55行,< module>
mb_toc_urlpart =%20。在convert_mb_toc_numbers(filter_
toc_entries(f)))中的join(str(x)for x)
文件simple.py,第49行,calculate_mb_toc_numbers
leadout_offset = int(eac [-1] ['end_sector'])+ 150 + 1
IndexError:列表索引超出范围

此日志作品



此日志中断



我相信这是打破脚本的编码,因为如果我在命令提示符下执行此操作:

 键入ascii.log> scrubbed.log 

然后在scrubbed.log上运行脚本,脚本工作正常(这实际上是因为没有重要信息的丢失,我的目的很好,我不会写回文件,只是打印到控制台)。



一个解决办法是将该文件传给Python之前(例如,使用上面的类型管道技巧到临时文件,然后让脚本运行),但是如果可能的话,我想让Python忽略编码。我也不知道如何检测脚本正在读取什么类型的日志文件,以便我可以适当地采取行动。



我正在阅读这个这个,但我的眼睛还在旋转,所以虽然这可能是我的长期战略,但我想知道是否有一个临时黑客可以使用。

解决方案

works.log 似乎以ASCII编码:

 >>> data = open('works.log','rb')。read()
>>>全部(数据中的d为d\x80)
True

breaks.log 似乎以UTF-16LE编码 - 它以2个字节开始'\xff\xfe' breaks.log 中的字符都不在ASCII范围之内:

 >>> data = open('breaks.log','rb')。read()
>>>数据[:2]
'\xff\xfe'
>>> udata = data.decode('utf16')
>>>所有(dü\x80'在udata中的d)
True

如果这是唯一的两种可能性,你应该可以摆脱以下的攻击。更改您的主线代码:

  f = open(sys.argv [1])$ ​​b $ b mb_toc_urlpart =%20 .join(
str(x)for x in calculate_mb_toc_numbers(filter_toc_entries(f)))
打印mb_toc_urlpart

  f = open(sys.argv [1],'rb') 
data = f.read()
f.close()
如果data [:2] =='\xff\xfe':
data = data.decode ('utf16')。encode('ascii')
#ilines是一个生成换行符的字符串
ilines =(line +'\\\
'for data in data.splitlines())
mb_toc_urlpart =%20.join(
str(x)for calculate_mb_toc_numbers(filter_toc_entries(ilines)))
打印mb_toc_urlpart
/ pre>

I'm in a little over my head on this one, so please pardon my terminology in advance.

I'm running this using Python 2.7 on Windows XP.

I found some Python code that reads a log file, does some stuff, then displays something.

What, that's not enough detail? Ok, here's a simplified version:

#!/usr/bin/python

import re
import sys

class NotSupportedTOCError(Exception):
    pass

def filter_toc_entries(lines):
    while True:
        line = lines.next()
        if re.match(r""" \s* 
                   .+\s+ \| (?#track)
                \s+.+\s+ \| (?#start)
                \s+.+\s+ \| (?#length)
                \s+.+\s+ \| (?#start sec)
                \s+.+\s*$   (?#end sec)
                """, line, re.X):
            lines.next()
            break

    while True:
        line = lines.next()
        m = re.match(r"""
            ^\s*
            (?P<num>\d+)
            \s*\|\s*
            (?P<start_time>[0-9:.]+)
            \s*\|\s*
            (?P<length_time>[0-9:.]+)
            \s*\|\s*
            (?P<start_sector>\d+)
            \s*\|\s*
            (?P<end_sector>\d+)
            \s*$
            """, line, re.X)
        if not m:
            break
        yield m.groupdict()

def calculate_mb_toc_numbers(eac_entries):
    eac = list(eac_entries)
    num_tracks = len(eac)

    tracknums = [int(e['num']) for e in eac]
    if range(1,num_tracks+1) != tracknums:
        raise NotSupportedTOCError("Non-standard track number sequence: %s", tracknums)

    leadout_offset = int(eac[-1]['end_sector']) + 150 + 1
    offsets = [(int(x['start_sector']) + 150) for x in eac]
    return [1, num_tracks, leadout_offset] + offsets

f = open(sys.argv[1])

mb_toc_urlpart = "%20".join(str(x) for x in calculate_mb_toc_numbers(filter_toc_entries(f)))

print mb_toc_urlpart

The code works fine as long as the log file is "simple" text (I'm tempted to say ASCII although that may not be precise/accurate - for e.g. Notepad++ indicates it's ANSI).

However, the script doesn't work on certain log files (in these cases, Notepad++ says "UCS-2 Little Endian").

I get the following error:

Traceback (most recent call last):
  File "simple.py", line 55, in <module>
    mb_toc_urlpart = "%20".join(str(x) for x in calculate_mb_toc_numbers(filter_
toc_entries(f)))
  File "simple.py", line 49, in calculate_mb_toc_numbers
    leadout_offset = int(eac[-1]['end_sector']) + 150 + 1
IndexError: list index out of range

This log works

This log breaks

I believe it's the encoding that's breaking the script because if I simply do this at a command prompt:

type ascii.log > scrubbed.log

and then run the script on scrubbed.log, the script works fine (this is actually fine for my purposes since there's no loss of important information and I'm not writing back to a file, just printing to the console).

One workaround would be to "scrub" the log file before passing it to Python (e.g. using the type pipe trick above to a temporary file and then have the script run on that), but I would like to have Python "ignore" the encoding if it's possible. I'm also not sure how to detect what type of log file the script is reading so I can act appropriately.

I'm reading this and this but my eyes are still spinning around in their head, so while that may be my longer term strategy, I'm wondering if there's an interim hack I could use.

解决方案

works.log appears to be encoded in ASCII:

>>> data = open('works.log', 'rb').read()
>>> all(d < '\x80' for d in data)
True

breaks.log appears to be encoded in UTF-16LE -- it starts with the 2 bytes '\xff\xfe'. None of the characters in breaks.log are outside the ASCII range:

>>> data = open('breaks.log', 'rb').read()
>>> data[:2]
'\xff\xfe'
>>> udata = data.decode('utf16')
>>> all(d < u'\x80' for d in udata)
True

If these are the only two possibilities, you should be able to get away with the following hack. Change your mainline code from:

f = open(sys.argv[1])
mb_toc_urlpart = "%20".join(
    str(x) for x in calculate_mb_toc_numbers(filter_toc_entries(f)))
print mb_toc_urlpart

to this:

f = open(sys.argv[1], 'rb')
data = f.read()
f.close()
if data[:2] == '\xff\xfe':
    data = data.decode('utf16').encode('ascii')
# ilines is a generator which produces newline-terminated strings
ilines = (line + '\n' for line in data.splitlines())
mb_toc_urlpart = "%20".join(
    str(x) for x in calculate_mb_toc_numbers(filter_toc_entries(ilines))        )
print mb_toc_urlpart

这篇关于Python kludge以ASCII格式读取UCS-2(UTF-16?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆