使用Python检测文本文件编码的代码中的陷阱? [英] Pitfalls in my code for detecting text file encoding with Python?

查看:155
本文介绍了使用Python检测文本文件编码的代码中的陷阱?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我比我做Python或文字编码更了解自行车修理,电锯使用和沟渠安全;考虑到这一点...



Python文本编码似乎是一个常年问题(我自己的问题:,以及其他我读过的内容:1 2 我已经采取在编写一些代码的时候会破解下面的编码。



在有限的测试中,这段代码似乎适用于我的目的*,而我不必知道前三个字节的超量的文本编码以及这些数据不提供信息的情况。



*我的目的是:


  1. 有一个无依赖关系的片段,我可以用中等程度的成功,

  2. 扫描本地工作站以获取任何编码的基于文本的日志文件,并根据其内容(需要使用适当的编码打开文件)将其标识为我感兴趣的文件。

问题:使用我认为是一个比较和计数字符的klutzy方法的陷阱是什么?非常感谢任何输入。

  def guess_encoding_debug(file_path):

DEBUG - 返回许多2个值元组
将返回所有可能的文本编码列表,其中列出了一些常见字符的字符数
,这可能是成功的症状
在姐妹中查看警告函数


导入编解码器
导入字符串
从运算符import itemgetter

READ_LEN = 1000
ENCODINGS = [ 'ascii','cp1252','mac_roman','utf_8','utf_16','utf_16_le',\
'utf_16_be','utf_32','utf_32_le','utf_32_be']

#chars在常规的ascii可打印集合是BY FAR最常见的
#在大多数用英语写的文件,所以他们的存在表明文件
#被正确解码。
nonsuspect_chars = string.printable

#成为2值元组的列表
results = []

在ENCODINGS中的e:
#some编码将导致异常与不兼容的文件,
#they是无效的编码,所以使用尝试从结果中排除它们[]
尝试:
与codecs.open(file_path, 'r',e)作为f:

从文件开始的#sample从
data = f.read(READ_LEN)

nonsuspect_sum = 0

#在
#READ_LEN的文件中列出了
#READ_LEN大小的样本文件中的可打印ASCII字符数。
在nonsuspect_chars中为n:
nonsuspect_sum + = data.count(n)

#if有更多的字符比READ_LEN
#编码错误,并且数据膨胀
如果nonsuspect_sum< = READ_LEN:
result.append([e,nonsuspect_sum])
除了:
pass

基于
#tuple(itemgetter index 1)的nonsuspect_sum部分的结果降序。
results = sorted(results,key = itemgetter(1),reverse = True)

返回结果


def guess_encoding(file_path):

愚蠢的,简单的,缓慢的,粗暴的,稍微准确的文本文件编码猜测
将返回一个可能的文本编码,尽管可能有其他的可能。
警告:如果您的文件在标准ASCII可打印字符之外使用大量字符
,则不要使用!
警告:不要使用关键应用程序,此代码将失败


results = guess_encoding_debug(file_path)

#从编码的降序列表中的第一个
#result返回编码字符串(第二个0索引)(第一个0索引)
返回结果[0] [0]

我假设会慢与 chardet 相比,我不是特别的很熟悉也不太准确。他们的设计方式,任何罗马字符的语言,使用口音,变音符等将不起作用,至少不好。当它失败时很难知道。然而,大多数英文文本(包括大多数编程代码)将主要用string.printable编写,该代码依赖于。



外部库可能是未来的选项,但现在我想避免它们,因为:


  1. 此脚本将在多个公司的计算机上运行,​​网络上和各种版本的python,所以并发症越少越好。当我说'公司'我的意思是社会科学家的小非营利。

  2. 我负责从GPS数据处理收集日志,但我不是系统管理员 - 她是不是一个python程序员,而且她的时间越少越好。

  3. 我公司通常可以安装的Python安装了GIS软件包,一般情况下单独的。

  4. 我的要求不是太严格,我只想确定我感兴趣的文件,并使用其他方法将其复制到档案。我没有读完全部内容来操纵,追加或重写内容。

  5. 似乎一个高级编程语言应该有一些方法可以自己完成。虽然似乎对于任何努力来说都是一个摇摇欲坠的基础,但我想尝试看看是否可以让它上班。


解决方案

可能最简单的方式来了解代码的工作原理是为其他现有库获取测试套件,并将其作为基础来创建自己的综合测试套件。他们会知道您的代码是否适用于所有这些情况,您还可以测试所有关心的案例。


I know more about bicycle repair, chainsaw use and trench safety than I do Python or text encoding; with that in mind...

Python text encoding seems to be a perennial issue (my own question: Searching text files' contents with various encodings with Python?, and others I've read: 1, 2. I've taken a crack at writing some code to guess the encoding below.

In limited testing this code seems to work for my purposes* without me having to know an excess about the first three bytes of text encoding and the situations where those data aren't informative.

*My purposes are:

  1. Have a dependency-free snippet I can use with a moderate-high degree of success,
  2. Scan a local workstation for text based log files of any encoding and identify them as a file I am interested in based on their contents (which requires the file to be opened with the proper encoding)
  3. for the challenge of getting this to work.

Question: What are the pitfalls with using a what I assume to be a klutzy method of comparing and counting characters like I do below? Any input is greatly appreciated.

def guess_encoding_debug(file_path):
    """
    DEBUG - returns many 2 value tuples
    Will return list of all possible text encodings with a count of the number of chars
    read that are common characters, which might be a symptom of success.
    SEE warnings in sister function
    """

    import codecs
    import string
    from operator import itemgetter

    READ_LEN = 1000
    ENCODINGS = ['ascii','cp1252','mac_roman','utf_8','utf_16','utf_16_le',\
                 'utf_16_be','utf_32','utf_32_le','utf_32_be']

    #chars in the regular ascii printable set are BY FAR the most common
    #in most files written in English, so their presence suggests the file
    #was decoded correctly.
    nonsuspect_chars = string.printable

    #to be a list of 2 value tuples
    results = []

    for e in ENCODINGS:
        #some encodings will cause an exception with an incompatible file,
        #they are invalid encoding, so use try to exclude them from results[]
        try:
            with codecs.open(file_path, 'r', e) as f:

                #sample from the beginning of the file
                data = f.read(READ_LEN)

                nonsuspect_sum = 0

                #count the number of printable ascii chars in the
                #READ_LEN sized sample of the file
                for n in nonsuspect_chars:
                    nonsuspect_sum += data.count(n)

                #if there are more chars than READ_LEN
                #the encoding is wrong and bloating the data
                if nonsuspect_sum <= READ_LEN:
                    results.append([e, nonsuspect_sum])
        except:
            pass

    #sort results descending based on nonsuspect_sum portion of
    #tuple (itemgetter index 1).
    results = sorted(results, key=itemgetter(1), reverse=True)

    return results


def guess_encoding(file_path):
    """
    Stupid, simple, slow, brute and yet slightly accurate text file encoding guessing.
    Will return one likely text encoding, though there may be others just as likely.
    WARNING: DO NOT use if your file uses any significant number of characters
             outside the standard ASCII printable characters!
    WARNING: DO NOT use for critical applications, this code will fail you.
    """

    results = guess_encoding_debug(file_path)

    #return the encoding string (second 0 index) from the first
    #result in descending list of encodings (first 0 index)
    return results[0][0]

I am assuming it would be slow compared to chardet, which I am not particularly familiar with. Also less accurate. They way it is designed, any roman character based language that uses accents, umlauts, etc. will not work, at least not well. It will be hard to know when it fails. However, most text in English, including most programming code, would largely be written with string.printable on which this code depends.

External libraries may be an option in the future, but for now I want to avoid them because:

  1. This script will be run on multiple company computers on and off the network with various versions of python, so the fewer complications the better. When I say 'company' I mean small non-profit of social scientists.
  2. I am in charge of collecting the logs from GPS data processing, but I am not the systems administrator - she is not a python programmer and the less time I take of hers the better.
  3. The installation of Python that is generally available at my company is installed with a GIS software package, and is generally better when left alone.
  4. My requirements aren't too strict, I just want to identify the files I am interested in and use other methods to copy them to an archive. I am not reading the full contents to memory to manipulate, appending or to rewriting the contents.
  5. It seems like a high-level programming language should have some way of accomplishing this on its own. While "seems like" is a shaky foundation for any endeavor, I wanted to try and see if I could get it to work.

解决方案

Probably the simplest way to find out how well your code works is to take the test suites for the other existing libraries, and use those as a base to create your own comprehensive test suite. They you will know if your code works for all of those cases, and you can also test for all of the cases you care about.

这篇关于使用Python检测文本文件编码的代码中的陷阱?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆