检测编码并将所有内容设为 UTF-8 [英] Detect encoding and make everything UTF-8
问题描述
我正在从各种 RSS 提要中读出大量文本并将它们插入到我的数据库中.
I'm reading out lots of texts from various RSS feeds and inserting them into my database.
当然,提要中使用了几种不同的字符编码,例如UTF-8 和 ISO 8859-1.
Of course, there are several different character encodings used in the feeds, e.g. UTF-8 and ISO 8859-1.
不幸的是,文本的编码有时会出现问题.示例:
Unfortunately, there are sometimes problems with the encodings of the texts. Example:
Fußball"中的ß"在我的数据库中应该是这样的:Ÿ".如果是Ÿ",则显示正确.
The "ß" in "Fußball" should look like this in my database: "Ÿ". If it is a "Ÿ", it is displayed correctly.
有时,Fußball"中的ß"在我的数据库中看起来像这样:ß".那么当然是显示错误了.
Sometimes, the "ß" in "Fußball" looks like this in my database: "ß". Then it is displayed wrongly, of course.
在其他情况下,ß"被保存为ß"——所以没有任何改变.然后也是显示错误.
In other cases, the "ß" is saved as a "ß" - so without any change. Then it is also displayed wrongly.
我该怎么做才能避免情况 2 和 3?
What can I do to avoid the cases 2 and 3?
我怎样才能使所有内容都使用相同的编码,最好是 UTF-8?什么时候必须使用utf8_encode()
,什么时候必须使用utf8_decode()
(效果很清楚,但什么时候必须使用函数?)以及什么时候必须使用输入什么都没有?
How can I make everything the same encoding, preferably UTF-8? When must I use utf8_encode()
, when must I use utf8_decode()
(it's clear what the effect is but when must I use the functions?) and when must I do nothing with the input?
如何使所有内容都使用相同的编码?也许使用 mb_detect_encoding()
函数?我可以为此编写一个函数吗?所以我的问题是:
How do I make everything the same encoding? Perhaps with the function mb_detect_encoding()
? Can I write a function for this? So my problems are:
- 如何找出文本使用的编码?
- 如何将其转换为 UTF-8 - 无论旧编码是什么?
像这样的函数会起作用吗?
Would a function like this work?
function correct_encoding($text) {
$current_encoding = mb_detect_encoding($text, 'auto');
$text = iconv($current_encoding, 'UTF-8', $text);
return $text;
}
我已经测试过了,但它不起作用.有什么问题吗?
I've tested it, but it doesn't work. What's wrong with it?
推荐答案
如果将 utf8_encode()
应用于已经是 UTF-8 的字符串,它会返回乱码的 UTF-8 输出.
If you apply utf8_encode()
to an already UTF-8 string, it will return garbled UTF-8 output.
我制作了一个函数来解决所有这些问题.它被称为 Encoding::toUTF8()
.
I made a function that addresses all this issues. It´s called Encoding::toUTF8()
.
您不需要知道字符串的编码是什么.它可以是 Latin1 (ISO 8859-1)、Windows-1252 或 UTF-8,或者字符串可以混合使用它们.Encoding::toUTF8()
会将所有内容转换为 UTF-8.
You don't need to know what the encoding of your strings is. It can be Latin1 (ISO 8859-1), Windows-1252 or UTF-8, or the string can have a mix of them. Encoding::toUTF8()
will convert everything to UTF-8.
我这样做是因为某项服务给我提供了一堆乱七八糟的数据,在同一个字符串中混合了 UTF-8 和 Latin1.
I did it because a service was giving me a feed of data all messed up, mixing UTF-8 and Latin1 in the same string.
用法:
require_once('Encoding.php');
use ForceUTF8Encoding; // It's namespaced now.
$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);
$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);
下载:
https://github.com/neitanod/forceutf8
我已经包含了另一个函数,Encoding::fixUFT8()
,它将修复每个看起来乱码的 UTF-8 字符串.
I've included another function, Encoding::fixUFT8()
, which will fix every UTF-8 string that looks garbled.
用法:
require_once('Encoding.php');
use ForceUTF8Encoding; // It's namespaced now.
$utf8_string = Encoding::fixUTF8($garbled_utf8_string);
示例:
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
将输出:
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
我已将函数 (forceUTF8
) 转换为名为 Encoding
的类上的一系列静态函数.新函数是Encoding::toUTF8()
.
I've transformed the function (forceUTF8
) into a family of static functions on a class called Encoding
. The new function is Encoding::toUTF8()
.
这篇关于检测编码并将所有内容设为 UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!