在Python中删除'\ xad'的最佳方法? [英] Best way to remove '\xad' in Python?

查看:515
本文介绍了在Python中删除'\ xad'的最佳方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从iso8859-15:

I'm trying to build a corpus from the .txt file found at this link. I believe the instances of \xad are supposedly 'soft-hyphens', but do not appear to be read correctly under UTF-8 encoding. I've tried encoding the .txt file as iso8859-15, using the code:

with open('Harry Potter 3 - The Prisoner Of Azkaban.txt', 'r', 
encoding='iso8859-15') as myfile:
data=myfile.read().replace('\n', '')

data2 = data.split(' ')

这将返回一个'words'数组,但是'\ xad'仍附加到data2中的许多条目.我已经尝试过

This returns an array of 'words', but '\xad' remains attached to many entries in data2. I've tried

data_clean = data.replace('\\xad', '')

data_clean = data.replace('\\xad|\\xad\\xad','')

,但这似乎并未删除'\ xad'的实例.有人遇到过类似的问题吗?理想情况下,我想将此数据编码为UTF-8以利用nltk库,但是由于出现以下错误,它不会使用UTF-8编码读取文件:

but this doesn't seem to remove the instances of '\xad'. Has anyone ran into a similar problem before? Ideally I'd like to encode this data as UTF-8 to avail of the nltk library, but it won't read the file with UTF-8 encoding as I get the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 471: invalid start byte

任何帮助将不胜感激!

其他上下文::这是一个娱乐项目,旨在能够基于txt文件生成故事.到目前为止,我生成的所有内容都充满了'\ xad',这破坏了乐趣!

Additional context: This is a recreational project with the aim of being able to generate stories based on the txt file. Everything I've generated thus far has been permeated with '\xad', which ruins the fun!

推荐答案

您的文件几乎可以肯定具有实际的其中包含U + 00AD软连字符.

Your file almost certainly has actual U+00AD soft-hyphen characters in it.

这些字符标记在将行插入页面时可以拆分单词的位置.这个想法是,如果不需要拆分单词,则软连字符是不可见的,但与

These are characters that mark places where a word could be split when fitting lines to a page. The idea is that the soft hyphen is invisible if the word doesn't need to be split, but printed the same as a U+2010 normal hyphen if it does.

由于您不希望在文本中使用流畅的文字来呈现此文本,因此您永远不会使用连字符,因此只想删除这些字符即可.

Since you don't care about rendering this text in a book with nicely flowing text, you're never going to hyphenate anything, so you just want to remove these characters.

执行此操作的方法不是摆弄编码.只需使用您认为最易读的方式从Unicode文本中删除它们即可:

The way to do this is not to fiddle with the encoding. Just remove them from the Unicode text, using whichever of these you find most readable:

data = data.replace('\xad', '')
data = data.replace('\u00ad', '')
data = data.replace('\N{SOFT HYPHEN}', '')

注意单个反斜杠.我们不是要替换文字反斜杠xad,而是要替换文字软连字符,即代码点为十六进制0xad的字符.

Notice the single backslash. We're not replacing a literal backslash, x, a, d, we're replacing a literal soft-hyphen character, that is, the character whose code point is hex 0xad.

您可以在拆分成单词之前对整个文件执行此操作,也可以在拆分后对每个单词执行一次.

You can either do this to the whole file before splitting into words, or do it once per word after splitting.

与此同时,您似乎对什么是编码以及如何使用它们感到困惑:

Meanwhile, you seem to be confused about what encodings are and what to do with them:

我尝试将.txt文件编码为iso8859-15

I've tried encoding the .txt file as iso8859-15

否,您已尝试将文件解码为ISO-8859-15.目前尚不清楚为什么首先要尝试ISO-8859-15.但是,由于字符'\xad'的ISO-8859-15编码是字节b'\xad',所以也许是正确的.

No, you've tried decoding the file as ISO-8859-15. It's not clear why you tried ISO-8859-15 in the first place. But, since the ISO-8859-15 encoding for the character '\xad' is the byte b'\xad', maybe that's correct.

理想情况下,我想将此数据编码为UTF-8以利用nltk库

Ideally I'd like to encode this data as UTF-8 to avail of the nltk library

但是NLTK不需要UTF-8字节,它需要Unicode字符串.您不需要为此进行编码.

But NLTK doesn't want UTF-8 bytes, it wants Unicode strings. You don't need to encode it for that.

此外,您不是要尝试将Unicode文本编码为UTF-8,而是要从 解码 个字节.如果那不是这些字节的数量,那么...如果幸运的话,您会收到像这样的错误;如果没有,您将得到mojibake,直到您弄糟了500GB的语料库并丢弃了原始数据. 1

Plus, you're not trying to encode your Unicode text to UTF-8, you're trying to decode your bytes from UTF-8. If that's not what those bytes are… if you're lucky, you'll get an error like this one; if not, you'll get mojibake that you don't notice until you've screwed up a 500GB corpus and thrown away the original data.1

1. UTF-8是专门设计的,因此您将尽早获得早期错误.在这种情况下,像使用UTF-8一样使用软连字符读取ISO-8859-15文本会完全引起您所看到的错误,但是像使用ISO-8859-15一样使用软连字符读取UTF-8文本将无提示地进行提示.成功,但在每个软连字符之前加一个'Â'字符.该错误通常会更有帮助.

1. UTF-8 is specifically designed so you'll get early errors whenever possible. In this case, reading ISO-8859-15 text with soft hyphens as if it were UTF-8 raises exactly the error you're seeing, but reading UTF-8 text with soft hyphens as if it were ISO-8859-15 will silently succeed, but with an extra 'Â' character before each soft hyphen. The error is usually more helpful.

这篇关于在Python中删除'\ xad'的最佳方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆