如何使用Perl正则表达式检测阿拉伯字符? [英] How to detect Arabic chars using perl regex?

查看:107
本文介绍了如何使用Perl正则表达式检测阿拉伯字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在解析一些html页面,并且需要检测其中的任何阿拉伯字符. 尝试过各种正则表达式,但没有运气.

I'm parsing some html pages, and need to detect any Arabic char inside.. Tried various regexs, but no luck..

有人知道这样做的工作方式吗?

Does anyone know working way to do that?

谢谢

这是我正在处理的页面: http://pastie.org/2509936

Here is the page I'm processing: http://pastie.org/2509936

我的代码是:

#!/usr/bin/perl 
use LWP::UserAgent; 
@MyAgent::ISA = qw(LWP::UserAgent); 

# set inheritance 
$ua = LWP::UserAgent->new; 
$q = 'pastie.org/2509936';; 
$request = HTTP::Request->new('GET', $q); 
$response = $ua->request($request); 
if ($response->is_success) { 
    if ($response->content=~/[\p{Script=Arabic}]/g) { 
        print "found arabic"; 
    } else { 
        print "not found"; 
    } 
}

推荐答案

编辑(因为我显然已经涉入tchrist的专业领域).使用总是返回原始字节字符串的$response->content跳过,并使用$response->decoded_content跳过,它将应用从响应头获得的所有解码提示.

EDIT (as I have obviously wandered into tchrist's area of expertise). Skip using $response->content, which always returns a raw byte string, and use $response->decoded_content, which applies any decoding hints it gets from the response headers.

您正在下载的页面是UTF-8编码的,但您并未将其读为UTF-8(公平地说,页面上没有关于编码是什么的提示 [更新:服务器确实返回标头Content-Type: text/html; charset=utf-8,]].

The page you are downloading is UTF-8 encoded, but you are not reading it as UTF-8 (in fairness, there are no hints on the page about what the encoding is [update: the server does return the header Content-Type: text/html; charset=utf-8, though]).

如果您检查$response->content,您会看到是否这样:

You can see if this if you examine $response->content:

use List::Util 'max';
my $max_ord = max map{ord}split //, $response->content;
print "max ord of response content is $max_ord\n";

如果获得的值小于256,则您将以原始字节的形式读取此内容,并且您的字符串将永远不匹配/\p{Arabic}/.在应用正则表达式之前,必须将输入解码为UTF-8:

If you get a value less than 256, then you are reading this content in as raw bytes, and your strings will never match /\p{Arabic}/. You must decode the input as UTF-8 before you apply the regex:

use Encode;
my $content = decode('utf-8', $response->content);
# now check  $content =~ /\p{Arabic}/

有时(现在我很不擅长于我的专业领域),正在加载的页面包含有关其解码方式的提示,并且$response->content可能已经正确解码.在这种情况下,上面的decode调用是不必要的,并且可能是有害的.有关检测任意字符串的编码,请参见其他SO帖子.

Sometimes (and now I am wading well outside my area of expertise) the page you are loading contains hints about how it is decoded, and $response->content may already be decoded correctly. In that case, the decode call above is unnecessary and may be harmful. See other SO posts on detecting the encoding of an arbitrary string.

这篇关于如何使用Perl正则表达式检测阿拉伯字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆