检测编码并使一切UTF-8 [英] Detect encoding and make everything UTF-8
问题描述
我正在从各种RSS Feed中读出大量文本并将它们插入到我的数据库中。
当然,Feed中使用了几种不同的字符编码,例如UTF-8和ISO-8859-1。
不幸的是,文本的编码有时会出现问题。示例:
1)Fußball中的ß在我的数据库中应该是这样:Ÿ。
2)有时,Fußball中的ß在我的数据库中看起来像这样:ß 。
3)在其他情况下,ß保存为ß - 因此没有任何更改。
如何避免第2和第3种情况?
我可以让一切一样的编码,最好是UTF-8?当我必须使用utf8_encode(),当我必须使用utf8_decode()(这是明确的效果是什么,但是我必须使用的功能?)和什么时候我什么也不做输入?
你能帮我,告诉我如何让一切都一样编码?也许与函数mb-detect-encoding()?我可以为此写一个函数吗?所以我的问题是:
1)如何找出文本使用什么编码
2)如何将其转换为UTF-8 - 无论旧编码是什么
EDIT:
这样的函数是否可以工作?
correct_encoding($ text){
$ current_encoding = mb_detect_encoding($ text,'auto');
$ text = iconv($ current_encoding,'UTF-8',$ text);
return $ text;
}
我测试了它,但它不工作。
如果将utf8_encode()应用到已经是UTF8的字符串,它将返回一个乱码的UTF8输出。 / p>
我做了一个函数来解决所有这些问题。它被称为Encoding :: toUTF8()。
你不需要知道你的字符串的编码是什么。它可以是Latin1(iso 8859-1),Windows-1252或UTF8,或字符串可以混合使用。 Encoding :: toUTF8()会将所有内容转换为UTF8。
我这样做是因为一个服务给我一个数据源,混乱UTF8和Latin1相同的字符串。
用法:
require_once('Encoding.php ');
使用\ForceUTF8\Encoding; //现在的命名空间。
$ utf8_string = Encoding :: toUTF8($ utf8_or_latin1_or_mixed_string);
$ latin1_string = Encoding :: toLatin1($ utf8_or_latin1_or_mixed_string);
下载:
https://github.com/neitanod/forceutf8
更新:
我包含另一个函数Encoding :: fixUFT8(),它将修复每个看起来乱码的UTF8字符串。
用法:
require_once('Encoding.php');
使用\ForceUTF8\Encoding; //现在的命名空间。
$ utf8_string = Encoding :: fixUTF8($ garbled_utf8_string);
示例:
echo Encoding :: fixUTF8(Fédicure Camerounaise de Football);
echo Encoding :: fixUTF8(Fé©déréCamerounaise de Football);
echo Encoding :: fixUTF8(Fêçé©déÃÂréCamerounaise de Football);
echo Encoding :: fixUTF8(Fê©dérationCamerounaise de Football);
将输出:
FédérationCamerounaise de Football
FédérationCamerounaise de Football
FédérationCamerounaise de Football
FédérationCamerounaise de Football
更新:我已经将函数(forceUTF8)转换为一个名为Encoding的类的静态函数系列。新的函数是Encoding :: toUTF8()。
I'm reading out lots of texts from various RSS feeds and inserting them into my database.
Of course, there are several different character encodings used in the feeds, e.g. UTF-8 and ISO-8859-1.
Unfortunately, there are sometimes problems with the encodings of the texts. Example:
1) The "ß" in "Fußball" should look like this in my database: "Ÿ". If it is a "Ÿ", it is displayed correctly.
2) Sometimes, the "ß" in "Fußball" looks like this in my database: "ß". Then it is displayed wrongly, of course.
3) In other cases, the "ß" is saved as a "ß" - so without any change. Then it is also displayed wrongly.
What can I do to avoid the cases 2 and 3?
How can I make everything the same encoding, preferably UTF-8? When must I use utf8_encode(), when must I use utf8_decode() (it's clear what the effect is but when must I use the functions?) and when must I do nothing with the input?
Can you help me and tell me how to make everything the same encoding? Perhaps with the function mb-detect-encoding()? Can I write a function for this? So my problems are: 1) How to find out what encoding the text uses 2) How to convert it to UTF-8 - whatever the old encoding is
EDIT: Would a function like this work?
function correct_encoding($text) {
$current_encoding = mb_detect_encoding($text, 'auto');
$text = iconv($current_encoding, 'UTF-8', $text);
return $text;
}
I've tested it but it doesn't work. What's wrong with it?
If you apply utf8_encode() to an already UTF8 string it will return a garbled UTF8 output.
I made a function that addresses all this issues. It´s called Encoding::toUTF8().
You dont need to know what the encoding of your strings is. It can be Latin1 (iso 8859-1), Windows-1252 or UTF8, or the string can have a mix of them. Encoding::toUTF8() will convert everything to UTF8.
I did it because a service was giving me a feed of data all messed up, mixing UTF8 and Latin1 in the same string.
Usage:
require_once('Encoding.php');
use \ForceUTF8\Encoding; // It's namespaced now.
$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);
$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);
Download:
https://github.com/neitanod/forceutf8
Update:
I've included another function, Encoding::fixUFT8(), which will fix every UTF8 string that looks garbled.
Usage:
require_once('Encoding.php');
use \ForceUTF8\Encoding; // It's namespaced now.
$utf8_string = Encoding::fixUTF8($garbled_utf8_string);
Examples:
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
will output:
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Update: I've transformed the function (forceUTF8) into a family of static functions on a class called Encoding. The new function is Encoding::toUTF8().
这篇关于检测编码并使一切UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!