如何在Perl中使用正则表达式计算文件中的中文单词? [英] How to count the Chinese word in a file using regex in perl?

查看:133
本文介绍了如何在Perl中使用正则表达式计算文件中的中文单词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试按照perl代码对文件的中文单词进行计数,这似乎可以正常工作,但并没有得到正确的结果。非常感谢您的帮助。

I tried following perl code to count the Chinese word of a file, it seems working but not get the right thing. Any help is greatly appreciated.

错误消息是

Use of uninitialized value $valid in concatenation (.) or string at word_counting.pl line 21, <FILE> line 21.
Total things  = 125, valid words = 

在我看来,这是问题所在是文件格式。 总值是125,即字符串数(125行)。最奇怪的部分是我的控制台正确显示了所有单个中文单词,没有任何问题。已安装 utf-8 编译指示。

which seems to me the problem is the file format. The "total thing" is 125 that is the string number (125 lines). The strangest part is my console displayed all the individual Chinese words correctly without any problem. The utf-8 pragma is installed.

#!/usr/bin/perl -w
use strict;
use utf8;
use Encode qw(encode);
use Encode::HanExtra;

my $input_file = "sample_file.txt";
my ($total, $valid);
my %count;

open (FILE, "< $input_file") or die "Can't open $input_file: $!";

while (<FILE>) {
 foreach (split) { #break $_ into words, assign each to $_ in turn
 $total++;
 next if /\W|^\d+/;  #strange words skip the remainder of the loop
 $valid++;
 $count{$_}++;  # count each separate word stored in a hash
 ## next comes here ##
      }
   }

   print "Total things  = $total, valid words = $valid\n";
   foreach my $word (sort keys %count) {
      print "$word \t was seen \t $count{$word} \t times.\n";
   }

##---Data----
sample_file.txt

那天约二更时,只见封肃方回来,欢天喜地.众人忙问端的.他乃说道:"原来本府新升的太爷姓贾名化,本贯胡州人氏,曾与女婿旧日相交.方才在咱门前过去,因见娇杏那丫头买线, 所以他只当女婿移住于此.我一一将原故回明,那太爷倒伤感叹息了一回,又问外孙女儿,我说看灯丢了.太爷说:`不妨,我自使番役务必探访回来.'说了一回话, 临走倒送了我二两银子."甄家娘子听了,不免心中伤感.一宿无话.至次日, 早有雨村遣人送了两封银子,四匹锦缎,答谢甄家娘子,又寄一封密书与封肃,转托问甄家娘子要那娇杏作二房. 封肃喜的屁滚尿流,巴不得去奉承,便在女儿前一力撺掇成了,乘夜只用一乘小轿,便把娇杏送进去了.雨村欢喜,自不必说,乃封百金赠封肃, 外谢甄家娘子许多物事,令其好生养赡,以待寻访女儿下落.封肃回家无话.


推荐答案

我们将STDOUT设置为:utf8 IO层,这样用户就不会显示格式错误的数据,然后使用同一层打开文件以使菱形不会读取格式错误的数据。
然后,在一段时间内,而不是分割空字符串,我们使用带有的正则表达式East_Asian_Width:宽的类似Unicode的属性

We set STDOUT to the :utf8 IO layer so the says won't show malformed the data, then open the file with the same layer so that the diamond won't read malformed data. Afterward, inside the while, rather than splitting on the empty string, we use a regex with the "East_Asian_Width: Wide" Unicode-like property.

utf8用于我的个人健全性检查,可以将其删除(是)。

utf8 is for my personal sanity checking, and can be removed (Y).

use strict;
use warnings;
use 5.010;
use utf8;
use autodie;

binmode(STDOUT, ':utf8');

open my $fh, '<:utf8', 'sample_file.txt';

my ($total, $valid);
my %count;

while (<$fh>) {
    $total += length;
    for (/(\p{Ea=W})/g) {
        $valid++;
        $count{$_}++;
    }
}

say "Total things  = $total, valid words = $valid";
for my $word (sort keys %count) {
   say "$word \t was seen \t $count{$word} \t times.";
}

编辑:J-16 SDiZ和daxim指出 sample_file.txt 位于UTF-8中。阅读他们的评论,然后查看perldoc中的编码模块,特别是通过PerlIO编码 部分。

J-16 SDiZ and daxim pointed out that the chances of sample_file.txt being in UTF-8 are.. slim. Read their comments, then take a look at the Encode module in perldoc, specifically the 'Encoding via PerlIO' portion.

这篇关于如何在Perl中使用正则表达式计算文件中的中文单词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆