在 Python 中为 Windows 记事本创建 UTF-16 换行符 [英] Creating UTF-16 newline characters in Python for Windows Notepad

查看:42
本文介绍了在 Python 中为 Windows 记事本创建 UTF-16 换行符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Ubuntu 中运行的 Python 2.7 中,此代码:

In Python 2.7 running in Ubuntu this code:

f = open("testfile.txt", "w")
f.write("Line one".encode("utf-16"))
f.write(u"\r\n".encode("utf-16"))
f.write("Line two".encode("utf-16"))

在 Gedit 中读取时在两行文本之间生成所需的换行符:

produces the desired newline between the two lines of text when read in Gedit:

Line one
Line two

但是,在 Windows 7 中执行并在记事本中读取的相同代码在第一行"之后产生难以理解的字符,但记事本无法识别换行符.如何在 Windows 中为 UTF-16 编写正确的换行符以匹配我在 Ubuntu 中获得的输出?

However, the same code executed in Windows 7 and read in Notepad produces unintelligible characters after "Line one" but no newline is recognized by Notepad. How can I write correct newline characters for UTF-16 in Windows to match the output I get in Ubuntu?

我正在为仅读取 Unicode UTF-16 的 Windows 应用程序编写输出.我花了几个小时尝试不同的技巧,但似乎对记事本没有任何作用.值得一提的是,我可以在记事本中成功地将文本文件转换为 UTF-16,但我更希望脚本首先正确保存编码.

I am writing output for a Windows only application that only reads Unicode UTF-16. I've spent hours trying out different tips, but nothing seems to work for Notepad. It's worth mentioning that I can successfully convert a text file to UTF-16 right in the Notepad, but I'd rather have the script save the encoding correctly in the first place.

推荐答案

问题是您正在以文本模式打开文件,但试图将其用作二进制文件.

The problem is that you're opening the file in text mode, but trying to use it as a binary file.

这个:

u"\r\n".encode("utf-16")

... 编码为 '\r\0\n\0'.

那么:

f.write('\r\0\n\0')

... 将 Unix 换行符转换为 Windows 换行符,给出 '\r\0\r\n\0'.

… converts the Unix newline to a Windows newline, giving '\r\0\r\n\0'.

当然,这会破坏您的 UTF-16 编码.除了两个 \r\n 字节将解码为有效但未分配的代码点 U+0A0D 之外,这是奇数个字节,这意味着您还有剩余的 \0.因此,下一个字符不是 L\0,而是 \0L,也就是 ,依此类推.

And that, of course, breaks your UTF-16 encoding. Besides the fact that the two \r\n bytes will decode into the valid but unassigned codepoint U+0A0D, that's an odd number of bytes, meaning you've got a leftover \0. So, instead of L\0 being the next character, it's \0L, aka , and so on.

最重要的是,您可能正在为每个编码字符串编写一个新的 UTF-16 BOM.大多数 Windows 应用程序实际上会透明地处理并忽略它们,因此您实际上所做的只是浪费了两个字节/行,但这实际上并不正确.

On top of that, you're probably writing a new UTF-16 BOM for each encoded string. Most Windows apps will actually transparently handle that and ignore them, so all you're practically doing is wasting two bytes/line, but it isn't actually correct.

第一个问题的快速解决方法是以二进制模式打开文件:

The quick fix to the first problem is to open the file in binary mode:

f = open("testfile.txt", "wb")

这不能解决多 BOM 问题,但可以解决损坏的 \n 问题.如果要修复 BOM 问题,可以使用有状态编码,或者明确指定 'utf-16-le'(或 'utf-16-be') 用于除第一次写入之外的所有写入.

This doesn't fix the multiple-BOM problem, but it fixes the broken \n problem. If you want to fix the BOM problem, you either use a stateful encode, or you explicitly specify 'utf-16-le' (or 'utf-16-be') for all writes but the first write.

但是对于两个问题,简单修复是使用 io 模块(或者,对于较旧的 Python 2.x,codecs 模块)为您完成所有繁重的工作:

But the easy fix, for both problems, is to use the io module (or, for older Python 2.x, the codecs module) to do all the hard work for you:

f = io.open("testfile.txt", "w", encoding="utf-8")
f.write("Line one")
f.write(u"\r\n")
f.write("Line two")

这篇关于在 Python 中为 Windows 记事本创建 UTF-16 换行符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆