PHP中的字符集检测 [英] Charset detection in PHP
问题描述
//我已添加了新的操作,请参阅欺骗PHP整数。任何帮助将非常感激。我有一个想法,通过将整数转换为无符号字节(只需要8或16位整数,以显着减少内存)尝试和破解数组的存储选项。
//i've added a new take on this please see Cheating PHP integers . any help will be much appreciated. I've had an idea to trying and hack the storage option of the arrays by packing the integers into unsigned bytes (only need 8 or 16 bits integers to reduce the memory considerably).
您好
我目前正在使用自定义字符集检测库,并从Mozilla的字符集检测算法创建了一个端口并使用chardet(python端口)帮助手。然而,这是非常内存密集的PHP(约30mb内存,如果我只是加载在西方语言检测)。我已经优化了所有我可以没有重写从头开始加载每一块(这将减少内存,但使它慢得多)。
I'm currently working on custom charset detection libraries and created a port from Mozilla's charset detection algorithm and used chardet (the python port) for a helping hand. However, this is extremely memory intensive in PHP (around 30mb of memory if I just load in Western language detection). I've optimised all I can without rewriting it from scratch to load each piece (this would reduce memory but make it a lot slower).
我的问题是,你知道任何LGPL PHP库,做字符集检测吗?
这只是为了研究给我一个正确的方向轻微的指导手。
My question is that, do you know of any LGPL PHP libraries that do charset detection? This would be purely for research to give me a slight guiding hand in the right direction.
我已经知道mb_detect_encoding,但它太有限,并提出
I already know of mb_detect_encoding but it's far too limited and brings up far too many false positives with the text files i have (yet python's chardet detects them perfectly)
推荐答案
我创建了一个方法,它编码正确地到UTF-8。但是很难弄清楚目前编码的是什么,所以我来到这个解决方案:
I created a method which encodes correctly to UTF-8. But it was hard to figure out what is currently encoded so I came to this solution:
<?php
function _convert($content) {
if(!mb_check_encoding($content, 'UTF-8')
OR !($content === mb_convert_encoding(mb_convert_encoding($content, 'UTF-32', 'UTF-8' ), 'UTF-8', 'UTF-32'))) {
$content = mb_convert_encoding($content, 'UTF-8');
if (mb_check_encoding($content, 'UTF-8')) {
// log('Converted to UTF-8');
} else {
// log('Could not converted to UTF-8');
}
}
return $content;
}
?>
正如你可以看到,我做一个转换来检查它是否仍然相同(UTF-8/16 )和如果不转换它。也许你可以使用这些代码。
As you can see I do a conversion to check if it still the same (UTF-8/16) and if not convert it. Maybe you can use some of this code.
这篇关于PHP中的字符集检测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!