如何更正文件的字符编码? [英] How do I correct the character encoding of a file?

查看:123
本文介绍了如何更正文件的字符编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个ANSI编码的文本文件,不应该被编码为ANSI,因为ANSI不支持重音的
个字符。我想使用UTF-8。

I have an ANSI encoded text file that should not have been encoded as ANSI as there were accented characters that ANSI does not support. I would rather work with UTF-8.

数据是否可以正确解码或在转码时丢失?

Can the data be decoded correctly or is it lost in transcoding?

我可以使用什么工具?

下面是我所拥有的示例:

Here is a sample of what I have:

ç é

我可以从上下文)这些应该是这两个字符:

I can tell from context (café should be café) that these should be these two characters:

ç é


推荐答案

编辑:在进入更复杂的解决方案之前消除的一个简单的可能性:您尝试在文本中将字符集设置为utf8编辑器,你正在读的文件?这可能只是一个人给你一个utf8文件,你正在读一个编辑器设置为说cp1252的情况。

A simple possibility to eliminate before getting into more complicated solutions: have you tried setting the character set to utf8 in the text editor in which you're reading the file? This could just be a case of somebody sending you a utf8 file that you're reading in an editor set to say cp1252.

只是举两个例子,这是utf8通过单字节编码的镜头读取的情况,可能是iso-8859-1,iso- 8859-15或cp1252。如果您可以发布其他问题字符的示例,则应该可以将其缩小更多。

Just taking the two examples, this is a case of utf8 being read through the lens of a single-byte encoding, likely one of iso-8859-1, iso-8859-15, or cp1252. If you can post examples of other problem characters, it should be possible to narrow that down more.

由于视觉检查字符可能会产生误导,您还需要查看底层字节:您在屏幕上看到的可能是0xa7或0xc2a7,这将确定您必须做的字符集转换的类型。

As visual inspection of the characters can be misleading, you'll also need to look at the underlying bytes: the § you see on screen might be either 0xa7 or 0xc2a7, and that will determine the kind of character set conversion you have to do.

你可以假设你的所有数据都是以相同的方式扭曲的 - 它来自同一个来源,并经过相同的变换序列,所以例如你的文本中没有一个é,它总是一个§?如果是这样,可以通过一系列字符集转换来解决问题。如果您可以更具体地了解您所使用的环境和您使用的数据库,那么这里的某人可能会告诉您如何执行适当的转换。

Can you assume that all of your data has been distorted in exactly the same way - that it's come from the same source and gone through the same sequence of transformations, so that for example there isn't a single é in your text, it's always ç? If so, the problem can be solved with a sequence of character set conversions. If you can be more specific about the environment you're in and the database you're using, somebody here can probably tell you how to perform the appropriate conversion.

否则,如果问题字符只出现在你的数据中的一些地方,你将不得不采取它的实例,基于假设沿着无作者意图在他们的文本,所以每当你看到它,用ç替换。后一种选择是更危险的,首先是因为这些关于作者的意图的假设可能是错误的,其次是因为你必须自己发现每个问题字符,如果有太多的文本可视化检查或者它是写的在对你不熟悉的语言或文字系统中。

Otherwise, if the problem characters are only occurring in some places in your data, you'll have to take it instance by instance, based on assumptions along the lines of "no author intended to put ç in their text, so whenever you see it, replace by ç". The latter option is more risky, firstly because those assumptions about the intentions of the authors might be wrong, secondly because you'll have to spot every problem character yourself, which might be impossible if there's too much text to visually inspect or if it's written in a language or writing system that's foreign to you.

这篇关于如何更正文件的字符编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆