如何确定流是Python中的文本还是二进制? [英] How to determine whether a stream is text or binary in Python?

查看:131
本文介绍了如何确定流是Python中的文本还是二进制?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有一种方法可以确定(测试,检查或分类)文件(或字节流或其他类似文件的对象)是文本还是二进制,在大多数情况下,这与Unix中file命令的魔力相似?

Is there a way to determine (test, check or classify) whether a file (or a bytestream, or other file-like object) is text or binary, similar to the file command's magic in Unix, in a practical majority of cases?

动机:尽管应避免拼写,其中Python 可以确定这一点,我想利用此功能.可以涵盖大量有用的案例并处理例外情况.

Motivation: Although guesswork should be avoided, where Python can determine this, I'd like to utilize the capability. One could cover a useful amount of cases and handle the exceptions.

将优先考虑跨平台或纯Python方法.一种方法是 python-magic ,但是它取决于Windows上的Cygwin和 libmagic .

Preference would be given to cross-platform or pure-python methods. One way is python-magic however it depends on Cygwin on Windows, and on libmagic in general.

推荐答案

file手册页中:

打印的类型通常包含以下文字之一(文件 仅包含印刷字符和一些 通用控制字符,可能在ASCII终端上可以安全读取),可执行文件(文件包含 以某种UNIX内核或其他UNIX内核可以理解的形式编译程序的结果,或者数据意味着任何东西 其他(数据通常是``二进制''或不可打印的).

The type printed will usually contain one of the words text (the file contains only printing characters and a few common control characters and is probably safe to read on an ASCII terminal), executable (the file contains the result of compiling a program in a form understandable to some UNIX kernel or another), or data meaning anything else (data is usually ``binary'' or non-printable).

看到您只是想确定它是文本还是二进制,我只是检查流中的每个字符是否可打印

Seeing as you just want to determine if it's text or binary, I would just check if every character in the stream is printable

import string
all(c in string.printable for c in stream)

我认为您不可能完全正确地做到这一点,但这应该是相当准确的.您是否需要处理unicode编码?

I don't think you will ever be able to get this 100% right, but this should be reasonably accurate. Do you need to handle unicode encodings though?

编辑-Unicode支持有些棘手,但是如果您有一组可能的编码,则可以在检查所有字符是否可打印之前测试文档是否成功从每个编码中解码出来

EDIT - Unicode support is a little tricky, but if you have a set of possible encodings then you could test if the document successfully decodes from each one, before checking if all of the characters are printable

import string
import unicodedata

encodings = 'ascii', 'utf-8', 'utf-16'

test_strings = '\xf0\x01\x01\x00\x44', 'this is a test', 'a utf-8 test \xe2\x98\x83'

def attempt_decode(s, encodings):
    for enc in encodings:
        try:
            return s.decode(enc), enc
        except UnicodeDecodeError:
            pass
    return s, 'binary'

def printable(s):
    if isinstance(s, unicode):
        return not any(unicodedata.category(c) in ['Cc'] for c in s)
    return all(c in string.printable for c in s)

for s in test_strings:
    result, enc = attempt_decode(s, encodings)
    if enc != 'binary':
        if not printable(result):
            result, enc = s, 'binary'
    print enc + ' - ' + repr(result)

结果是:

binary - '\xf0\x01\x01\x00D'
ascii - u'this is a test'
utf-8 - u'a utf-8 test \u2603'

这篇关于如何确定流是Python中的文本还是二进制?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆