PHP中的字符集检测 [英] Charset detection in PHP

查看:97
本文介绍了PHP中的字符集检测的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

//我已添加了新的操作,请参阅欺骗PHP整数。任何帮助将非常感激。我有一个想法,通过将整数转换为无符号字节(只需要8或16位整数,以显着减少内存)尝试和破解数组的存储选项。

//i've added a new take on this please see Cheating PHP integers . any help will be much appreciated. I've had an idea to trying and hack the storage option of the arrays by packing the integers into unsigned bytes (only need 8 or 16 bits integers to reduce the memory considerably).

您好

我目前正在使用自定义字符集检测库,并从Mozilla的字符集检测算法创建了一个端口并使用chardet(python端口)帮助手。然而,这是非常内存密集的PHP(约30mb内存,如果我只是加载在西方语言检测)。我已经优化了所有我可以没有重写从头开始加载每一块(这将减少内存,但使它慢得多)。

I'm currently working on custom charset detection libraries and created a port from Mozilla's charset detection algorithm and used chardet (the python port) for a helping hand. However, this is extremely memory intensive in PHP (around 30mb of memory if I just load in Western language detection). I've optimised all I can without rewriting it from scratch to load each piece (this would reduce memory but make it a lot slower).

我的问题是,你知道任何LGPL PHP库,做字符集检测吗?
这只是为了研究给我一个正确的方向轻微的指导手。

My question is that, do you know of any LGPL PHP libraries that do charset detection? This would be purely for research to give me a slight guiding hand in the right direction.

我已经知道mb_detect_encoding,但它太有限,并提出

I already know of mb_detect_encoding but it's far too limited and brings up far too many false positives with the text files i have (yet python's chardet detects them perfectly)

推荐答案

我创建了一个方法,它编码正确地到UTF-8。但是很难弄清楚目前编码的是什么,所以我来到这个解决方案:

I created a method which encodes correctly to UTF-8. But it was hard to figure out what is currently encoded so I came to this solution:

<?php
function _convert($content) { 
    if(!mb_check_encoding($content, 'UTF-8')
        OR !($content === mb_convert_encoding(mb_convert_encoding($content, 'UTF-32', 'UTF-8' ), 'UTF-8', 'UTF-32'))) {

        $content = mb_convert_encoding($content, 'UTF-8');

        if (mb_check_encoding($content, 'UTF-8')) {
            // log('Converted to UTF-8');
        } else {
            // log('Could not converted to UTF-8');
        }
    }
    return $content;
}
?>

正如你可以看到,我做一个转换来检查它是否仍然相同(UTF-8/16 )和如果不转换它。也许你可以使用这些代码。

As you can see I do a conversion to check if it still the same (UTF-8/16) and if not convert it. Maybe you can use some of this code.

这篇关于PHP中的字符集检测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆