如何使用 perl regex 检测阿拉伯字符? [英] How to detect Arabic chars using perl regex?

查看:31
本文介绍了如何使用 perl regex 检测阿拉伯字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在解析一些 html 页面,需要检测里面的任何阿拉伯字符.尝试了各种正则表达式,但没有运气..

I'm parsing some html pages, and need to detect any Arabic char inside.. Tried various regexs, but no luck..

有人知道这样做的工作方法吗?

Does anyone know working way to do that?

谢谢

这是我正在处理的页面:http://pastie.org/2509936

Here is the page I'm processing: http://pastie.org/2509936

我的代码是:

#!/usr/bin/perl 
use LWP::UserAgent; 
@MyAgent::ISA = qw(LWP::UserAgent); 

# set inheritance 
$ua = LWP::UserAgent->new; 
$q = 'pastie.org/2509936';; 
$request = HTTP::Request->new('GET', $q); 
$response = $ua->request($request); 
if ($response->is_success) { 
    if ($response->content=~/[\p{Script=Arabic}]/g) { 
        print "found arabic"; 
    } else { 
        print "not found"; 
    } 
}

推荐答案

EDIT(因为我显然已经涉足了 tchrist 的专业领域).跳过使用 $response->content,它总是返回一个原始字节字符串,并使用 $response->decoded_content,它应用它从响应中获得的任何解码提示标题.

EDIT (as I have obviously wandered into tchrist's area of expertise). Skip using $response->content, which always returns a raw byte string, and use $response->decoded_content, which applies any decoding hints it gets from the response headers.

您正在下载的页面是 UTF-8 编码的,但您没有将其阅读为 UTF-8(公平地说,页面上没有关于编码是什么的提示[更新:服务器确实返回标题 Content-Type: text/html;charset=utf-8,虽然]).

The page you are downloading is UTF-8 encoded, but you are not reading it as UTF-8 (in fairness, there are no hints on the page about what the encoding is [update: the server does return the header Content-Type: text/html; charset=utf-8, though]).

如果您检查 $response->content,您可以看到是否如此:

You can see if this if you examine $response->content:

use List::Util 'max';
my $max_ord = max map{ord}split //, $response->content;
print "max ord of response content is $max_ord\n";

如果您得到的值小于 256,那么您正在以原始字节形式读取此内容,并且您的字符串将永远不会匹配 /\p{Arabic}/.在应用正则表达式之前,您必须将输入解码为 UTF-8:

If you get a value less than 256, then you are reading this content in as raw bytes, and your strings will never match /\p{Arabic}/. You must decode the input as UTF-8 before you apply the regex:

use Encode;
my $content = decode('utf-8', $response->content);
# now check  $content =~ /\p{Arabic}/

有时(现在我涉水远远超出了我的专业领域)您正在加载的页面包含有关如何解码的提示,并且 $response->content 可能已经被正确解码.在这种情况下,上面的 decode 调用是不必要的并且可能是有害的.请参阅关于检测任意字符串编码的其他 SO 帖子.

Sometimes (and now I am wading well outside my area of expertise) the page you are loading contains hints about how it is decoded, and $response->content may already be decoded correctly. In that case, the decode call above is unnecessary and may be harmful. See other SO posts on detecting the encoding of an arbitrary string.

这篇关于如何使用 perl regex 检测阿拉伯字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆