PHP:如何检测输入字符串是否为阿拉伯语 [英] PHP: How do I detect if an input string is Arabic

查看:124
本文介绍了PHP:如何检测输入字符串是否为阿拉伯语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以检测通过输入字段输入的数据的语言?

Is there a way to detect the language of the data being entered via the input field?

推荐答案

嗯,我可能会提供DimaKrasun函数的改进版本:

hmm i may offer an improved version of DimaKrasun's function:

functoin is_arabic($string) {
    if($string === 'arabic') {
         return true;
    }
    return false;
}

好吧,开玩笑吧!

Pekkas建议使用Google翻译API是一个不错的选择!但是您所依赖的外部服务总是比较复杂,等等.

Pekkas suggestion to use the google translate api is a good one! but you are relying on an external service which is always more complicated etc.

我认为Rushyos的做法很好!它不是那么容易. 我为您编写了以下功能,但未经测试,但是应该可以工作...

i think Rushyos approch is good! its just not that easy. i wrote the following function for you but its not tested, but it should work...

    <?
function uniord($u) {
    // i just copied this function fron the php.net comments, but it should work fine!
    $k = mb_convert_encoding($u, 'UCS-2LE', 'UTF-8');
    $k1 = ord(substr($k, 0, 1));
    $k2 = ord(substr($k, 1, 1));
    return $k2 * 256 + $k1;
}
function is_arabic($str) {
    if(mb_detect_encoding($str) !== 'UTF-8') {
        $str = mb_convert_encoding($str,mb_detect_encoding($str),'UTF-8');
    }

    /*
    $str = str_split($str); <- this function is not mb safe, it splits by bytes, not characters. we cannot use it
    $str = preg_split('//u',$str); <- this function woulrd probably work fine but there was a bug reported in some php version so it pslits by bytes and not chars as well
    */
    preg_match_all('/.|\n/u', $str, $matches);
    $chars = $matches[0];
    $arabic_count = 0;
    $latin_count = 0;
    $total_count = 0;
    foreach($chars as $char) {
        //$pos = ord($char); we cant use that, its not binary safe 
        $pos = uniord($char);
        echo $char ." --> ".$pos.PHP_EOL;

        if($pos >= 1536 && $pos <= 1791) {
            $arabic_count++;
        } else if($pos > 123 && $pos < 123) {
            $latin_count++;
        }
        $total_count++;
    }
    if(($arabic_count/$total_count) > 0.6) {
        // 60% arabic chars, its probably arabic
        return true;
    }
    return false;
}
$arabic = is_arabic('عربية إخبارية تعمل على مدار اليوم. يمكنك مشاهدة بث القناة من خلال الموقع'); 
var_dump($arabic);
?>

最后的想法: 如您所见,例如,我添加了一个拉丁计数器,范围只是一个虚拟数字,但是通过这种方式,您可以检测字符集(希伯来语,拉丁语,阿拉伯语,印地语,中文等)

final thoughts: as you see i added for example a latin counter, the range is just a dummy number b ut this way you could detect charsets (hebrew, latin, arabic, hindi, chinese, etc...)

您可能还想先消除一些字符...例如@,空格,换行符,斜杠等... preg_split函数的PREG_SPLIT_NO_EMPTY标志会很有用,但是由于这个错误,我在这里没有使用它.

you may also want to eliminate some chars first... maybe @, space, line breaks, slashes etc... the PREG_SPLIT_NO_EMPTY flag for the preg_split function would be useful but because of the bug I didn't use it here.

您还可以为所有字符集设置一个计数器,并查看其中哪一个当然是最多的...

you can as well have a counter for all the character sets and see which one of course the most...

最后,您应该考虑在200个字符左右后将字符串切掉.这应该足以告诉您使用了什么字符集.

and finally you should consider chopping your string off after 200 chars or something. this should be enough to tell what character set is used.

,您必须执行一些错误处理!例如被零除,空字符串等!别忘了……有什么问题吗?评论!

and you have to do some error handling! like division by zero, empty string etc etc! don't forget that please... any questions? comment!

如果要检测字符串的语言,则应拆分为单词并检查一些预定义表中的单词.您不需要完整的字典,只需最常用的单词,它就可以正常工作.标记化/规范化也是必须的!无论如何都有一些库,这不是您要的:)只是想提起它

if you want to detect the LANGUAGE of a string, you should split into words and check for the words in some pre-defined tables. you don't need a complete dictionary, just the most common words and it should work fine. tokenization/normalization is a must as well! there are libraries for that anyway and this is not what you asked for :) just wanted to mention it

这篇关于PHP:如何检测输入字符串是否为阿拉伯语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆