检测编码并使一切UTF-8 [英] Detect encoding and make everything UTF-8

查看:150
本文介绍了检测编码并使一切UTF-8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从各种RSS Feed中读出大量文本并将它们插入到我的数据库中。



当然,Feed中使用了几种不同的字符编码,例如UTF-8和ISO-8859-1。



不幸的是,文本的编码有时会出现问题。示例:



1)Fußball中的ß在我的数据库中应该是这样:Ÿ。



2)有时,Fußball中的ß在我的数据库中看起来像这样:ß 。



3)在其他情况下,ß保存为ß - 因此没有任何更改。



如何避免第2和第3种情况?



我可以让一切一样的编码,最好是UTF-8?当我必须使用utf8_encode(),当我必须使用utf8_decode()(这是明确的效果是什么,但是我必须使用的功能?)和什么时候我什么也不做输入?



你能帮我,告诉我如何让一切都一样编码?也许与函数mb-detect-encoding()?我可以为此写一个函数吗?所以我的问题是:
1)如何找出文本使用什么编码
2)如何将其转换为UTF-8 - 无论旧编码是什么



EDIT:
这样的函数是否可以工作?

  correct_encoding($ text){
$ current_encoding = mb_detect_encoding($ text,'auto');
$ text = iconv($ current_encoding,'UTF-8',$ text);
return $ text;
}



我测试了它,但它不工作。

$ p

解决方案

如果将utf8_encode()应用到已经是UTF8的字符串,它将返回一个乱码的UTF8输出。 / p>

我做了一个函数来解决所有这些问题。它被称为Encoding :: toUTF8()。



你不需要知道你的字符串的编码是什么。它可以是Latin1(iso 8859-1),Windows-1252或UTF8,或字符串可以混合使用。 Encoding :: toUTF8()会将所有内容转换为UTF8。



我这样做是因为一个服务给我一个数据源,混乱UTF8和Latin1相同的字符串。



用法:

  require_once('Encoding.php '); 
使用\ForceUTF8\Encoding; //现在的命名空间。

$ utf8_string = Encoding :: toUTF8($ utf8_or_latin1_or_mixed_string);

$ latin1_string = Encoding :: toLatin1($ utf8_or_latin1_or_mixed_string);

下载:



https://github.com/neitanod/forceutf8



更新:



我包含另一个函数Encoding :: fixUFT8(),它将修复每个看起来乱码的UTF8字符串。



用法:

  require_once('Encoding.php'); 
使用\ForceUTF8\Encoding; //现在的命名空间。

$ utf8_string = Encoding :: fixUTF8($ garbled_utf8_string);

示例:

  echo Encoding :: fixUTF8(Fédicure Camerounaise de Football); 
echo Encoding :: fixUTF8(Fé©déréCamerounaise de Football);
echo Encoding :: fixUTF8(Fêçé©déÃÂréCamerounaise de Football);
echo Encoding :: fixUTF8(Fê©dérationCamerounaise de Football);

将输出:

 FédérationCamerounaise de Football 
FédérationCamerounaise de Football
FédérationCamerounaise de Football
FédérationCamerounaise de Football

更新:我已经将函数(forceUTF8)转换为一个名为Encoding的类的静态函数系列。新的函数是Encoding :: toUTF8()。


I'm reading out lots of texts from various RSS feeds and inserting them into my database.

Of course, there are several different character encodings used in the feeds, e.g. UTF-8 and ISO-8859-1.

Unfortunately, there are sometimes problems with the encodings of the texts. Example:

1) The "ß" in "Fußball" should look like this in my database: "Ÿ". If it is a "Ÿ", it is displayed correctly.

2) Sometimes, the "ß" in "Fußball" looks like this in my database: "ß". Then it is displayed wrongly, of course.

3) In other cases, the "ß" is saved as a "ß" - so without any change. Then it is also displayed wrongly.

What can I do to avoid the cases 2 and 3?

How can I make everything the same encoding, preferably UTF-8? When must I use utf8_encode(), when must I use utf8_decode() (it's clear what the effect is but when must I use the functions?) and when must I do nothing with the input?

Can you help me and tell me how to make everything the same encoding? Perhaps with the function mb-detect-encoding()? Can I write a function for this? So my problems are: 1) How to find out what encoding the text uses 2) How to convert it to UTF-8 - whatever the old encoding is

EDIT: Would a function like this work?

function correct_encoding($text) {
    $current_encoding = mb_detect_encoding($text, 'auto');
    $text = iconv($current_encoding, 'UTF-8', $text);
    return $text;
}

I've tested it but it doesn't work. What's wrong with it?

解决方案

If you apply utf8_encode() to an already UTF8 string it will return a garbled UTF8 output.

I made a function that addresses all this issues. It´s called Encoding::toUTF8().

You dont need to know what the encoding of your strings is. It can be Latin1 (iso 8859-1), Windows-1252 or UTF8, or the string can have a mix of them. Encoding::toUTF8() will convert everything to UTF8.

I did it because a service was giving me a feed of data all messed up, mixing UTF8 and Latin1 in the same string.

Usage:

require_once('Encoding.php'); 
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);

$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);

Download:

https://github.com/neitanod/forceutf8

Update:

I've included another function, Encoding::fixUFT8(), which will fix every UTF8 string that looks garbled.

Usage:

require_once('Encoding.php'); 
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

Examples:

echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");

will output:

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

Update: I've transformed the function (forceUTF8) into a family of static functions on a class called Encoding. The new function is Encoding::toUTF8().

这篇关于检测编码并使一切UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆