检测编码并将所有内容设为 UTF-8 [英] Detect encoding and make everything UTF-8

查看:32
本文介绍了检测编码并将所有内容设为 UTF-8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从各种 RSS 提要中读出大量文本并将它们插入到我的数据库中.

I'm reading out lots of texts from various RSS feeds and inserting them into my database.

当然,提要中使用了几种不同的字符编码,例如UTF-8 和 ISO 8859-1.

Of course, there are several different character encodings used in the feeds, e.g. UTF-8 and ISO 8859-1.

不幸的是,文本的编码有时会出现问题.示例:

Unfortunately, there are sometimes problems with the encodings of the texts. Example:

  1. Fußball"中的ß"在我的数据库中应该是这样的:Ÿ".如果是Ÿ",则显示正确.

  1. The "ß" in "Fußball" should look like this in my database: "Ÿ". If it is a "Ÿ", it is displayed correctly.

有时,Fußball"中的ß"在我的数据库中看起来像这样:ß".那么当然是显示错误了.

Sometimes, the "ß" in "Fußball" looks like this in my database: "ß". Then it is displayed wrongly, of course.

在其他情况下,ß"被保存为ß"——所以没有任何改变.然后也是显示错误.

In other cases, the "ß" is saved as a "ß" - so without any change. Then it is also displayed wrongly.

我该怎么做才能避免情况 2 和 3?

What can I do to avoid the cases 2 and 3?

我怎样才能使所有内容都使用相同的编码,最好是 UTF-8?什么时候必须使用utf8_encode(),什么时候必须使用utf8_decode()(效果很清楚,但什么时候必须使用函数?)以及什么时候必须使用输入什么都没有?

How can I make everything the same encoding, preferably UTF-8? When must I use utf8_encode(), when must I use utf8_decode() (it's clear what the effect is but when must I use the functions?) and when must I do nothing with the input?

如何使所有内容都使用相同的编码?也许使用 mb_detect_encoding() 函数?我可以为此编写一个函数吗?所以我的问题是:

How do I make everything the same encoding? Perhaps with the function mb_detect_encoding()? Can I write a function for this? So my problems are:

  1. 如何找出文本使用的编码?
  2. 如何将其转换为 UTF-8 - 无论旧编码是什么?

像这样的函数会起作用吗?

Would a function like this work?

function correct_encoding($text) {
    $current_encoding = mb_detect_encoding($text, 'auto');
    $text = iconv($current_encoding, 'UTF-8', $text);
    return $text;
}

我已经测试过了,但它不起作用.有什么问题吗?

I've tested it, but it doesn't work. What's wrong with it?

推荐答案

如果将 utf8_encode() 应用于已经是 UTF-8 的字符串,它会返回乱码的 UTF-8 输出.

If you apply utf8_encode() to an already UTF-8 string, it will return garbled UTF-8 output.

我制作了一个函数来解决所有这些问题.它被称为 Encoding::toUTF8().

I made a function that addresses all this issues. It´s called Encoding::toUTF8().

您不需要知道字符串的编码是什么.它可以是 Latin1 (ISO 8859-1)Windows-1252 或 UTF-8,或者字符串可以混合使用它们.Encoding::toUTF8() 会将所有内容转换为 UTF-8.

You don't need to know what the encoding of your strings is. It can be Latin1 (ISO 8859-1), Windows-1252 or UTF-8, or the string can have a mix of them. Encoding::toUTF8() will convert everything to UTF-8.

我这样做是因为某项服务给我提供了一堆乱七八糟的数据,在同一个字符串中混合了 UTF-8 和 Latin1.

I did it because a service was giving me a feed of data all messed up, mixing UTF-8 and Latin1 in the same string.

用法:

require_once('Encoding.php');
use ForceUTF8Encoding;  // It's namespaced now.

$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);

$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);

下载:

https://github.com/neitanod/forceutf8

我已经包含了另一个函数,Encoding::fixUFT8(),它将修复每个看起来乱码的 UTF-8 字符串.

I've included another function, Encoding::fixUFT8(), which will fix every UTF-8 string that looks garbled.

用法:

require_once('Encoding.php');
use ForceUTF8Encoding;  // It's namespaced now.

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

示例:

echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");

将输出:

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

我已将函数 (forceUTF8) 转换为名为 Encoding 的类上的一系列静态函数.新函数是Encoding::toUTF8().

I've transformed the function (forceUTF8) into a family of static functions on a class called Encoding. The new function is Encoding::toUTF8().

这篇关于检测编码并将所有内容设为 UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆