将UTF-8(从字面意义上)转换为Umlaute [英] Converting UTF-8 (in literal) to Umlaute

查看:78
本文介绍了将UTF-8(从字面意义上)转换为Umlaute的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用刮板从Facebook上获取评论.不幸的是,它把德语的UmlauteÄ"Ü"Ö"转换为UTF-8文字,例如"\ xc3 \ xb6".我现在尝试了不同的方法来重新转换文件,但是不幸的是,我所做的任何事情都没有成功.

I used a scraper to get comments from Facebook. Unfortunately, it converted the Umlaute "Ä" "Ü" "Ö" in German to UTF-8 literals such as "\xc3\xb6". I tried now different approaches to reconvert the files but unfortunately none of the things I have done, were successful.

for file in glob.glob("Comments/*.csv"):
    rawfile=csv.reader(open(file,"rU", encoding = "ISO-8859-1"))
    new_tablename=file +"converted"
    new_table=csv.writer(open("%s.csv" % (new_tablename),"w"))
    for row in rawfile:
        for w in row:
            a=str(w)
            b=a.encode('latin-1').decode('utf-8')
            print(b)
        new_table.writerow(row)

另一种方法是创建一个包含所有文字和德语字符的字典,但是这种方法也不起作用.

Another approach was creating a dictionary with all the literals and the German characters but this approach did not work either.

import csv, glob, re
print("Start")
converter_table=csv.reader(open("LiteralConvert.csv","rU"))
converterdic={}
for line in converter_table:
    charToFind=line[2]
    charForReplace=line[1]
    print(charToFind+" will be replaced by "+charForReplace)
    converterdic[charToFind] = charForReplace


print(converterdic)

for file in glob.glob("Comments/*.csv"):
        rawfile=csv.reader(open(file,"rU", encoding = "ISO-8859-1"))
    print("opening: "+ file)
    new_tablename=file +"converted"
    new_table=csv.writer(open("%s.csv" % (new_tablename),"w"))
    print("created clean file: " + new_tablename)
    for row in rawfile:
        for w in row:
            #print(w)
            try:
                w.translate(converterdic)
            except KeyError:
                continue
        new_table.writerow(row)

但是,如果我这样做的话,第一个解决方案就可以了:

However, the first solution works fine, if I just do:

s="N\xc3\xb6 kein Schnee von gestern doch der beweis daf\xc3\xbcr das L\xc3\xbcgenpresse existiert."
b = s.encode('latin-1').decode('utf-8')

print(b)

但是当我从文件中解析字符串时不是.

But not when I parse in the string from a file.

推荐答案

我已经遍历所有注释,而另一个答案试图理解WHERE是问题所在,而WHERE是您面临的问题的核心.经过许多深思熟虑后,我得出的所有结论如下:

I have been through all the comments and the other answer trying to understand WHERE is the problem and WHAT is the core of the problem you face. Here my conclusion from all this after many deep thoughts about it:

常见的编码/解码字符串问题的核心是对所见所闻的解释.在这种情况下,非常重要的是要了解:

Frequent core of problems with encoding/decoding strings is the interpretation of what you have from what you see. In this context it is VERY IMPORTANT to understand, that:

如果您在Python(或文件)中有字符串/文本,则永远不会照原样"看到它.

,并且必须首先确定编码/解码方案.

and have to decide about the encoding/decoding scheme first.

换句话说,您总是通过给定的编码/解码过滤器来查看外观,并且如果编码/解码方案有变化,它会改变您所看到的内容,而不会改变你看.

In other words, you look ALWAYS through a filter of a given encoding/decoding on what you look at and if there is a change in the encoding/decoding scheme, it changes what you see without a change in what you look at.

让我们再说一遍,现在换句话说: 要查看文件中的字符串或文本,您必须使用某种工具进行可视化...并且...这种可视化工具使用有关编码的某种信息(隐式采用默认值或通过敦促您明确使用(以指定应使用哪种编码),因此没有编码/解码就没有可视化.理解这一点对您对所见事物的看法产生了巨大影响.就像电影院中的3D眼镜一样:戴上它们不会改变屏幕上的内容,但是会改变您的观看方式.

Let's say the same once again, now in other 'other words": To look at a string or text in a file you MUST use some kind of tool for its VISUALIZATION ... AND ... such tool for visualization USES some kind of information about the ENCODING (implicit taking a default value or explicit by urging you to specify which coding should it use), so without encoding/decoding there is no visualization. Understanding this has an huge impact on how you think about what you see in terms of thinking what are you looking at. It is like with 3D-glasses in a cinema: wearing them does not change what is on the screen, but changes how you see it.

因此,如果您有一个包含非ASCII字符的UTF-8编码的字符串,并使用显示UTF-8字符的工具查看它,那么您会看到德国的Umlaute,但如果您使用某个工具查看同一字符串,则为了使二进制字符串可视化,ti既不会显示其中的非ASCII字符(它是二进制的,因此它会逐字节显示,并且在不了解所使用代码的情况下也无法显示非ASCII)或UTF-8解释(即Umlaut是两个字节,但是用于可视化的工具逐字节显示)-它会以"\ xc3 \ xb6"的形式向您显示非ASCII字符,但是...在其中的字符串/文件中 ARE NOT 8个字节-只有两个字节'0xC3'和'0xB6'.这就是例如为了向您显示字节,使用print()命令使用"\ xc3 \ xb6".

So if you have an UTF-8 encoded string with non-ASCII characters and look at it with tools showing you UTF-8 characters you see the German Umlaute as they are, BUT if you look at the same string using a tool for visualization of binary strings ti will show you neither the non-ASCII characters in it (it's binary, so it visualizes byte by byte and can't show non-ASCII without knowledge about the used code) nor the UTF-8 interpretation (the Umlaut are two bytes but the tool for visualization shows byte by byte) - it will show you the non-ASCII characters in the form "\xc3\xb6", BUT ... in the string/file there ARE NOT 8 bytes there - there are only TWO bytes '0xC3' and '0xB6'. This is how it comes that e.g. the print() command in order to show you what the bytes are uses "\xc3\xb6".

希望您现在知道我在说什么(经过长时间/几天/几个月的困惑后,这是一种启发体验),对吗?

Hope you got now the idea what I am talking about (it's a kind of enlightenment experience after long hours/days/months of confusion), did you?

以下摘录自 UTF- 8张桌子,您可以在以下位置找到字母ö":

Here an excerpt from the UTF-8 table you can find the letter 'ö' in:

"""U+00F6 ö c3 b6 ö ö LATIN SMALL LETTER O WITH DIAERESIS"""

这篇关于将UTF-8(从字面意义上)转换为Umlaute的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆