Perl 只打印匹配的内容 [英] Perl print matched content only

查看:44
本文介绍了Perl 只打印匹配的内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用 Perl 开发一个网络爬虫.它从页面中提取内容,然后进行模式匹配以检查内容的语言.Unicode 值用于匹配内容.

有时提取的内容包含多种语言的文本.我在这里使用的模式匹配会打印所有文本,但我只想打印与模式中指定的 Unicode 值匹配的文本.

my $uu = LWP::UserAgent->new('Mozilla 1.3');我的 $extractorr = HTML::ContentExtractor->new();# 创建响应对象以获取 url我的 $responsee = $uu->get($url);我的 $contentss = $responsee->decoded_content();$range = "([\x{0C00}-\x{0C7F}]+)";# 匹配特定语言如果($contentss =~ m/$range/){$extractorr->extract($url, $contentss);打印 "$url\n";二进制模式(标准输出,:utf8");打印 $extractorr->as_text;}

解决方案

最好将字符与特定的 Unicode 属性进行匹配,而不是尝试制定适当的字符类.

0x0C00...0x0C7F 范围内的代码点对应于泰卢固语(印度语言之一)中的字符,您可以使用正则表达式 /\p{Telugu}/ 匹配这些字符.>

您可能需要的其他属性是 /\p{Kannada}/, /\p{Malayalam}/, /\p{Devanagari}//\p{Tamil}/

I am developing a web crawler in Perl. It extracts contents from the page and then a pattern match is done to check the language of the content. Unicode values are used to match the content.

Sometimes the extracted content contains text in multiple languages. The pattern match I used here prints all the text, but I want to print only the text that matches the Unicode values specified in the pattern.

my $uu         = LWP::UserAgent->new('Mozilla 1.3');
my $extractorr = HTML::ContentExtractor->new();

# create response object to get the url
my $responsee = $uu->get($url);
my $contentss = $responsee->decoded_content();

$range = "([\x{0C00}-\x{0C7F}]+)";    # match particular language

if ($contentss =~ m/$range/) {
  $extractorr->extract($url, $contentss);
  print "$url\n";
  binmode(STDOUT, ":utf8");
  print $extractorr->as_text;
}

解决方案

It would be better to match characters with a particular Unicode property, rather than trying to formulate an appropriate character class.

The code points in the range 0x0C00...0x0C7F correspond to characters in Telugu (one of the Indian languages) which you can match using the regex /\p{Telugu}/.

The other properties you will probably need are /\p{Kannada}/, /\p{Malayalam}/, /\p{Devanagari}/, and /\p{Tamil}/

这篇关于Perl 只打印匹配的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆