命令行参数作为字节而不是python3中的字符串 [英] Command-line arguments as bytes instead of strings in python3

查看:67
本文介绍了命令行参数作为字节而不是python3中的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个python3程序,该程序从命令行参数获取要处理的文件名。我对于处理不同编码的正确方法感到困惑。

I'm writing a python3 program, that gets the names of files to process from command-line arguments. I'm confused regarding what is the proper way to handle different encodings.

我想我宁愿将文件名视为字节而不是字符串,因为这避免了使用错误的编码。确实,我的某些文件名使用了不正确的编码(当我的系统语言环境使用utf-8时为latin1),但这不会阻止ls之类的工具正常工作。我也希望我的工具对此具有弹性。

I think I'd rather consider filenames as bytes and not strings, since that avoids the danger of using an incorrect encoding. Indeed, some of my file names use an incorrect encoding (latin1 when my system locale uses utf-8), but that doesn't prevent tools like ls from working. I'd like my tool to be resilient to that as well.

我有两个问题:命令行参数以字符串形式提供给我(我使用argparse) ,并且我想以字符串形式向用户报告错误。

I have two problems: the command-line arguments are given to me as strings (I use argparse), and I want to report errors to the user as strings.

我已经成功地修改了代码以使用二进制文件,并且我的工具可以处理名称无效的文件。当前的默认编码,只要通过递归通过文件系统即可,因为我将参数早期转换为二进制文件,并且在调用fs函数时使用了二进制文件。但是,当我收到一个无效的文件名参数时,它将作为带有奇怪字符(例如 \udce8 )的Unicode字符串传递给我。我不知道这些是什么,并且尝试对它进行编码总是会失败,无论是使用utf8还是使用相应的(错误的)编码(此处为latin1)。

I've successfuly adapted my code to use binaries, and my tool can handle files whose name are invalid in the current default encoding, as long as it is by recursing trough the filesystem, because I convert the arguments to binaries early, and use binaries when calling fs functions. When I receive a filename argument which is invalid, however, it is handed to me as a unicode string with strange characters like \udce8. I do not know what these are, and trying to encode it always fail, be it with utf8 or with the corresponding (wrong) encoding (latin1 here).

问题是报告错误。我希望我的工具的用户能够解析我的stdout(因此想要保留文件名),但是当报告stderr错误时,我宁愿使用utf-8对其进行编码,将无效的序列替换为适当的无效/问号字符。

The other problem is for reporting errors. I expect users of my tool to parse my stdout (hence wanting to preserve filenames), but when reporting errors on stderr I'd rather encode it in utf-8, replacing invalid sequences with appropriate "invalid/question mark" characters.

所以,

1)是否有更好的,完全不同的方法? (是的,已计划修复文件名,但我仍然希望我的工具更强大)

1) Is there a better, completely different way to do it ? (yes, fixing the filenames is planned, but i'd still like my tool to be robust)

2)如何获取原始二进制文件中的命令行参数形式(对我而言未预先解码),知道对于无效序列,对已解码参数进行重新编码将失败,并且

2) How do I get the command line arguments in their original binary form (not pre-decoded for me), knowing that for invalid sequences re-encoding the decoded argument will fail, and

3)如何告诉utf-8编解码器用一些无效标记代替无效,无法分解的序列,而不是死在我身上?

3) How do I tell the utf-8 codec to replace invalid, undecodable sequences with some invalid mark rather than dying on me ?

推荐答案


何时我收到了一个文件名参数
,该参数无效,但是它是
作为带有codeudce8之类的
奇怪字符的unicode字符串传递给我的。

When I receive a filename argument which is invalid, however, it is handed to me as a unicode string with strange characters like \udce8.

这些是代理字符。低8位是原始无效字节。

Those are surrogate characters. The low 8 bits is the original invalid byte.

请参见

See PEP 383: Non-decodable Bytes in System Character Interfaces.

这篇关于命令行参数作为字节而不是python3中的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆