为什么 Python re.search 会在我的字符串中添加空格? [英] Why is Python re.search adding spaces to my string?

查看:72
本文介绍了为什么 Python re.search 会在我的字符串中添加空格?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望 Python 打开一个 Unicode 文本文件,通读每一行,如果该行不包含任何数字,然后将该行写入一个新的 Unicode 文本文件.所以如果输入是:

I want Python to open a Unicode text file, read through each line, and then write the line to a new Unicode text file if the line does not contain any digits. So if the input is:

1
8:00:00 --> 8:00:01
Hello World!

它应该输出:

Hello World!

但我得到的是:

H e l l o  W o r l d !

我不确定为什么要在每个字符之间添加空格.我错过了什么?这是我正在使用的代码:

I'm not sure why its adding in the spaces between each character. What am I missing? Here is the code I'm using:

import re

nFile = open("NewFile.txt", 'w')

with open("OriginalFile.txt", 'r') as f:
    for line in f:
        if not (re.search("\d", line)):
            nFile.write(line)

推荐答案

这很困难,但这似乎有效.

That was a tough one, but this seems to work.

首先,正如我们在评论中已经讨论过的,这是一个编码问题.事实上,search不能给字符串加空格,即使它想加空格,因为字符串是不可变的,所以唯一的方法是改变line是通过执行类似 line = ... 的操作.

First off, as we've already discussed in the comments, it's an encoding problem. In fact, search could not add spaces to the string even if it wanted to, because strings are immutable, so the only way to change line is by doing something like line = ....

您在注释中链接的输入文件编码为 UTF-16-LE,这不是 Python 使用的默认格式.阅读它的一种方法(可能还有其他方法,请随时发表评论)是使用 codecs 模块.

The input file you linked in the comments in encoded as UTF-16-LE, which is not the default format used by Python. One way to read it (there might be others, feel free to comment) is to use the codecs module.

import re, codecs
with codecs.open("HarryPotterSubsEs2.txt", 'r', encoding="utf-16-le") as f:
    for line in f:
        if not (re.search("\d", line)):
            print line

要将选定的行写入输出文件,您可以对 outfile 执行相同操作,或者执行 line = line.encode("utf8") 将行写入文件为 <代码>utf8.(出于某种原因,同样无法读取行,导致 unicode 错误.在这种情况下,不能 100% 确定从 UTF16 到 UTF8 的编码是无损的;同样,请随时发表评论.)

For writing the selected lines to the output file, you can do the same with the outfile, or do line = line.encode("utf8") to write the line to the file as utf8. (For some reason, the same did not work for reading the lines, there it caused a unicode error. Not 100% sure the encoding from UTF16 to UTF8 is lossless in this case; again, feel free to comment.)

作为替代方案,您可能会找到一种方法以不同的编码保存文件(最好是 utf8 使用与记事本不同的文本编辑器...

As an alternative, you might find a way to save the file in a different encoding (preferrably utf8 using a different text editor than Notepad...

这篇关于为什么 Python re.search 会在我的字符串中添加空格?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆