使用Python用各种编码搜索文本文件的内容? [英] Searching text files' contents with various encodings with Python?

查看:192
本文介绍了使用Python用各种编码搜索文本文件的内容?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我正在编写一个脚本来扫描文件系统对于具有特定内容的日志文件,以将其复制到存档。名称经常被改变,所以内容是识别它们的唯一方法。我需要识别* .txt文件,并在其内容中找到这些特定日志文件所特有的字符串。



我有以下代码主要工作。问题是如果打开和编辑日志可能会改变它们的编码。在这种情况下,当Python使用错误的编码打开文件时,Python将不会将搜索项与内容相匹配。因为内容是乱码,而是使用错误的编码来打开文件。

 code> import os 
import codecs

#Filepaths to search
FILEPATH =SomeDrive:\\\SomeDirs\\

#Text匹配文件名
MATCH_CONDITION =.txt

#Text匹配文件内容
MATCH_CONTENT =--------基数据详细信息:--------------------

为root,dirs,os.walk中的文件(FILEPATH):
对于文件中的f:
如果f中的MATCH_CONDITION:
打印搜索:+ os.path.join(root,f)

#ATTEMPT A -
#matches只有文本文件重新编码为ANSI,
#UTF-8,UTF-8否BOM

#search_file = open(os.path.join(root,f),' r')

#ATTEMPT B -
#匹配Trimble软件的文本文件
#UCS-2 LE w / o BOM,还有UCS-2 Little Endian -
#(使用Windows记事本保存的相同文件),

search_file = codecs.open(os.path.join root,f),'r','utf_16_le')


file_data = search_file.read()

如果file_data中的MATCH_CONTENT:
打印CONTENTS MATCHED:+ f

search_file.close()

I可以在记事本++中打开检测编码的文件。使用常规的file.open()Python命令不会自动检测编码。我可以使用codecs.open并指定编码来捕获单个编码,但是必须编写多余的代码才能捕获其余的编码。我已经阅读了Python编解码器模块文档,似乎没有任何自动检测。



我有什么选择可以简洁,强大地搜索任何文本文件编码?



我已经阅读过 chardet 模块,这似乎很好,但我真的需要避免安装模块。无论如何,必须有一个更简单的方式来与古代和尊贵的文本文件进行交互。当然,作为一个新手,我使这太复杂了,对吧?



Python 2.7.2,Windows 7 64位。可能不是必需的,但这里是一个示例日志文件



编辑:
据我所知,这些文件几乎肯定会在代码注释中的一个编码中:ANSI,UTF-8,UTF_16_LE(UCS-2 LE w / o BOM; UCS-2 Little Endian)。总是有可能找到一种方式绕过我的期望...



编辑
使用外部库时当然是一个很好的方法,我已经有机会在编写一些业余的代码来猜测编码并在另一个问题中征求反馈 - > 我的代码中用Python检测文本文件编码的错误

解决方案

出于某种原因, chardet 包存在(并且由于类似的原因而被从一些较旧的Netscape代码移植):检测编码一个任意的文本文件是棘手的。



有两种基本的选择:


  1. 使用一些硬编码的规则来确定一个文件是否具有一定的编码。例如,您可以在文件的开头查找UTF字节顺序标记。对于使用不同字节的重复编码或针对不符合您使用的标记字节的文件,这种编码会中断。


  2. 以已知编码的形式获取文件数据库,并对每个编码中的不同字节(和字节对,三元组等)的分布进行计数。然后,当您有一个未知编码的文件时,请采取其字节的样本,并查看哪个字节使用模式是最佳匹配。当您有短测试文件(这使得频率估计不准确)或测试文件中的字节使用与您用于构建频率数据的文件数据库中的使用不匹配时,这会中断。


记事本++可以执行字符检测(以及Web浏览器,文字处理程序等)的原因是这些程序都有这些方法中的一个或两个内置于程序中。 Python不会将其构建到其解释器中 - 它是一种通用编程语言,而不是文本编辑器,但这只是 chardet package所做的一切。



我会说,因为你知道有关你正在处理的文本文件的一些事情,你可能可以采取一些快捷方式。例如,您的日志文件是编码A或编码B之一吗?如果是这样,那么你的决定要简单得多,或者基于频率或基于规则的方法可能是很简单的实现。但是,如果您需要检测到任意字符集,我强烈建议在巨人的肩膀上搭建。


I am having trouble with variable text encoding when opening text files to find a match in the files' contents.

I am writing a script to scan the file system for log files with specific contents in order to copy them to an archive. The names are often changed, so the contents are the only way to identify them. I need to identify *.txt files and find within their contents a string that is unique to these particular log files.

I have the code below that mostly works. The problem is the logs may have their encoding changed if they are opened and edited. In this case, Python won't match the search term to the contents because the contents are garbled when Python uses the wrong encoding to open the file.

import os
import codecs

#Filepaths to search
FILEPATH = "SomeDrive:\\SomeDirs\\"

#Text to match in file names
MATCH_CONDITION = ".txt"

#Text to match in file contents
MATCH_CONTENT = "--------Base Data Details:--------------------"

for root, dirs, files in os.walk(FILEPATH):
    for f in files:
        if MATCH_CONDITION in f:
            print "Searching: "  + os.path.join(root,f)

            #ATTEMPT A -
            #matches only text file re-encoded as ANSI,
            #UTF-8, UTF-8 no BOM

            #search_file = open(os.path.join(root,f), 'r')

            #ATTEMPT B -
            #matches text files ouput from Trimble software
            #"UCS-2 LE w/o BOM", also "UCS-2 Little Endian" -
            #(same file resaved using Windows Notepad),

            search_file = codecs.open(os.path.join(root,f), 'r', 'utf_16_le')


            file_data = search_file.read()

            if MATCH_CONTENT in file_data:
                print "CONTENTS MATCHED: " + f

            search_file.close()

I can open the files in Notepad ++ which detects the encoding. Using the regular file.open() Python command does not automatically detect the encoding. I can use codecs.open and specify the encoding to catch a single encoding, but then have to write excess code to catch the rest. I've read the Python codecs module documentation and it seems to be devoid of any automatic detection.

What options do I have to concisely and robustly search any text file with any encoding?

I've read about the chardet module, which seems good but I really need to avoid installing modules. Anyway, there must be a simpler way to interact with the ancient and venerable text file. Surely as a newb I am making this too complicated, right?

Python 2.7.2, Windows 7 64-bit. Probably not necessary, but here is a sample log file.

EDIT: As far as I know the files will almost surely be in one of the encodings in the code comments: ANSI, UTF-8, UTF_16_LE (as UCS-2 LE w/o BOM; UCS-2 Little Endian). There is always the potential for someone to find a way around my expectations...

EDIT: While using an external library is certainly the sound approach, I've taken a chance at writing some amateurish code to guess the encoding and solicited feedback in another question -> Pitfalls in my code for detecting text file encoding with Python?

解决方案

The chardet package exists for a reason (and was ported from some older Netscape code, for a similar reason) : detecting the encoding of an arbitrary text file is tricky.

There are two basic alternatives :

  1. Use some hard-coded rules to determine whether a file has a certain encoding. For example, you could look for the UTF byte-order marker at the beginning of the file. This breaks for encodings that overlap significantly in their use of different bytes, or for files that don't happen to use the "marker" bytes that your detection rules use.

  2. Take a database of files in known encodings and count up the distributions of different bytes (and byte pairs, triplets, etc.) in each of the encodings. Then, when you have a file of unknown encoding, take a sample of its bytes and see which pattern of byte usage is the best match. This breaks when you have short test files (which makes the frequency estimates inaccurate), or when the usage of the bytes in your test file doesn't match the usage in the file database you used to build up your frequency data.

The reason notepad++ can do character detection (as well as web browsers, word processors, etc.) is that these programs all have one or both of these methods built in to the program. Python doesn't build this into its interpreter -- it's a general-purpose programming language, not a text editor -- but that's just what the chardet package does.

I would say that because you know some things about the text files that you're handling, you might be able to take a few shortcuts. For example, are your log files all in one of either encoding A or encoding B ? If so, then your decision is much simpler, and probably either the frequency-based or the rule-based approach above would be pretty straightforward to implement on your own. But if you need to detect arbitrary character sets, I'd highly recommend building on the shoulders of giants.

这篇关于使用Python用各种编码搜索文本文件的内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆