mb_detect_order()在PHP中的奇怪行为 [英] Strange behaviour of mb_detect_order() in PHP

查看:65
本文介绍了mb_detect_order()在PHP中的奇怪行为的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想检测一些文本的编码(使用PHP)。
为此,我使用mb_detect_encoding()函数。



问题是如果我改变mb_detect_order()可能编码的顺序,该函数返回不同的结果,



考虑以下示例

  $ html = << STR 
ちょっとのアクセスでちてしまったり,サーバー障害が多いレンターサーバーをぶぶああなジジジジりりりりりりりりりりりいいいいいいいいいいいいいいいいいいいいいいいいいいいいいよ;
mb_detect_order('UTF-8','EUC-JP','SJIS','eucJP-win','SJIS-win','JIS','ISO-2022-JP' -8859-1' , 'ISO-8859-2'));
$ originalEncoding = mb_detect_encoding($ str);
die($ originalEncoding); // $ originalEncoding ='UTF-8'

但是,如果您更改mb_detect_order中的编码顺序)结果将不同:

  mb_detect_order(array('EUC-JP','UTF-8','SJIS' ,ISO,ISO-8859-2,ISO-8859-2)); 
die($ originalEncoding); // $ originalEncoding ='EUC-JP'




所以我的问题是:

为什么会发生这种情况?

PHP中有没有方法正确和明确地检测文本的编码?

解决方案

这是我预期会发生的。



检测算法可能只是继续尝试,按照您在 mb_detect_order 中指定的编码,然后返回第一个根据该字节生效的编码。



<更聪明的东西需要统计方法(我认为机器学习是常用的)。



编辑: 这篇文章更智能的方法。


由于其重要性,自动字符集检测已经在主要的Internet应用程序(如Mozilla或Internet Explorer)中实现。它们是非常准确和快速的,但实施在个案的基础上应用了许多领域特定的知识。与他们的方法相反,我们针对一个可以统一应用于每个字符集的简单算法,该算法基于成熟的标准机器学习技术。我们还研究了语言和字符集检测之间的关系,并比较了基于字节的算法和基于字符的算法。我们使用朴素贝叶斯(NB)和支持向量机(SVM)。



I would like to detect encoding of some text (using PHP). For that purpose i use mb_detect_encoding() function.

The problem is that the function returns different results if i change the order of possible encodings with mb_detect_order() function.

Consider the following example

$html = <<< STR
ちょっとのアクセスで落ちてしまったり、サーバー障害が多いレンタルサーバーを選ぶとあなたのビジネス等にかなりの影響がでてしまう可能性があります。特に商売をされている個人の方、法人の方は気をつけるようにしてください
STR;
mb_detect_order(array('UTF-8','EUC-JP', 'SJIS', 'eucJP-win', 'SJIS-win', 'JIS', 'ISO-2022-JP','ISO-8859-1','ISO-8859-2'));
$originalEncoding = mb_detect_encoding($str);
die($originalEncoding); // $originalEncoding = 'UTF-8'

However if you change the order of encodings in mb_detect_order() the results will be different:

mb_detect_order(array('EUC-JP','UTF-8', 'SJIS', 'eucJP-win', 'SJIS-win', 'JIS', 'ISO-2022-JP','ISO-8859-1','ISO-8859-2'));        
die($originalEncoding); // $originalEncoding = 'EUC-JP'



So my questions are:
Why is that happening ?
Is there a way in PHP to correctly and unambiguously detect encoding of text ?

解决方案

That's what I would expect to happen.

The detection algorithm probably just keeps trying, in order, the encodings you specified in mb_detect_order and then returns the first one under which the bytestream would be valid.

Something more intelligent requires statistical methods (I think machine learning is commonly used).

EDIT: See e.g. this article for more intelligent methods.

Due to its importance, automatic charset detection is already implemented in major Internet applications such as Mozilla or Internet Explorer. They are very accurate and fast, but the implementation applies many domain specific knowledges in case-by-case basis. As opposed to their methods, we aimed at a simple algorithm which can be uniformly applied to every charset, and the algorithm is based on well-established, standard machine learning techniques. We also studied the relationship between language and charset detection, and compared byte-based algorithms and character-based algorithms. We used Naive Bayes (NB) and Support Vector Machine (SVM).

这篇关于mb_detect_order()在PHP中的奇怪行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆