Python:Got \ xa0而不是CSV中的空格,并且无法删除或转换 [英] Python:Got \xa0 instead of space in CSV and cannot remove or convert

查看:109
本文介绍了Python:Got \ xa0而不是CSV中的空格,并且无法删除或转换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个与python(IPython notebook)中的编码问题有关的问题.由于这类问题非常普遍和简单,但我仍然无法真正解决.

I have a problem related to the encoding problem in python (IPython notebook ). As these kind of problems is very common and simple, but I still cannot really fix it.

我有一个 CSV 文件在这里,如您所见,该文件中有很多'\ xa0'和其他'\ n'字符.

I have a CSV file here, as you can see we got many '\xa0' and other '\n' characters in this file.

我用过

with io.open(train_fname) as f:
for line in f:
    line = line.encode("ascii", "replace")

但是它不起作用,我总是得到以下输出.

But it is not working, I always get the following output.

想象一下,您可以说,您知道什么,没有制裁,也没有关于IEAA法规的永久听证会,没有更多隐藏\ xa0under \ xa0友好核能的伪装.\ xa0您有2天的时间;\ xa0i.e.\ xa0let在检查人员中,退出了杀死平民的行动.

Imagine being able say, you know what, no sanctions, no forever hearings on IEAA regulations, no more hiding\xa0under\xa0the pretense of friendly nuclear energy. \xa0You have 2 days to; \xa0i.e. \xa0let in the inspectors, quit killing the civilians.

我尝试了其他方法,例如

I tried other methods like

line.replace(u"\ xa0",")它也不起作用,我还尝试了各种编码来在我的文本编辑(崇高的文本)中打开此CSV文件.我尝试使用Windows-1252,utf-8和所有其他编码,但是在查看此CSV文件时,总是得到\ xa0是我的文本编辑.

line.replace(u"\xa0", " ") It is not working either, I also tried all kinds of encoding to open this CSV file in my text edit, sublime text. I tried windows-1252, utf-8 and all other encodings, but I always get \xa0 is my text edit when viewing this CSV file.

这是否意味着

\ xa0

已经作为输入文本写入了此CSV文件吗?这不是python编码的问题吗?如果是这种情况,为什么我不能使用replace方法简单地替换此字符串?\ xa0表示文件正在使用哪种编码进行编码?这意味着该文件是用utf-8编写的,但是我试图以ascii或其他方式打开它?

is already written in this CSV file as input text? It is not a problem of python encoding? If it is this case, why cannot I use replace method to simply replace this string? The \xa0 indicates the file is encoding in which encode? This means this file is written in utf-8 but I tried to open it in ascii or other case?

我搜索了许多问题,但它们似乎并没有提供太多帮助.如果我的问题不是很清楚,请问我.非常感谢你!

I searched many questions but they don't seem provide much help. Please ask me if my question is not very clear. Thank you very much!

`

推荐答案

您看到的 \ xa0 是4个字符的序列: \ x a 0 .所有这些字符都是纯ASCII,因此这里没有字符集问题.

The \xa0 that you see is a sequence of 4 characters: \ x a 0. All these characters are plain ASCII, so no character set problem here.

显然,您应该解释这些转义序列.您想用空格代替它们的想法很好,但是您必须注意反斜杠字符.当它以字符串文字形式出现时,必须写为 \\ .所以试试这个:

Apparently, you are supposed to interpret these escape sequences. Your idea of replacing them with a space is good, but you have to be careful about the backslash character. When it appears in a string literal, it has to be written \\. So try this:

line.replace("\\xa0", " ")

或:

line.replace(r"\xa0", " ")

字符串前面的 r 意味着按字面意义解释每个字符,甚至包括反斜杠.

The r in front of the string means to interpret each character literally, even a backslash.

请注意,CSV文件中的数据充满了不一致之处.例子:

Note that the data in the CSV file is full of inconsistencies. Examples:

  • \ n 可能意味着换行.
  • \\ n 也会出现,这也可能意味着换行.
  • \ xa0 是不间断的空格,以ISO-8859-1编码.
  • \ xc2 \ xa0 是不间断的空间,以UTF-8编码.
  • \\ xc2 \\ xa0 也会出现,含义相同.
  • \\\\ n 也会出现.
  • \n probably means a linebreak.
  • \\n also appears, and it probably means a linebreak also.
  • \xa0 is a nonbreaking space, encoded in ISO-8859-1.
  • \xc2\xa0 is a nonbreaking space, encoded in UTF-8.
  • \\xc2\\xa0 also appears, with the same meaning.
  • \\\\n also appears.

因此,要从该文件中获取有意义的内容,应重复解释转义序列,直到没有任何变化为止.之后,尝试将生成的字节序列解释为UTF-8.如果行得通,那就好.如果不是,则将其解释为Codepage 1252(是ISO-8859-1的超集).

So to get meaningful content out of that file, you should repeatedly interpret the escape sequences until nothing changes. After that, try to interpret the resulting byte sequence as UTF-8. If it works, fine. If not, interpret it as Codepage 1252 (which is a superset of ISO-8859-1).

这篇关于Python:Got \ xa0而不是CSV中的空格,并且无法删除或转换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆