Perl 脚本中的大小写敏感 - 如何使它不敏感? [英] Case Sensitivity In Perl Script - How Do I Make It Insensitive?

查看:74
本文介绍了Perl 脚本中的大小写敏感 - 如何使它不敏感?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何更改以下马尔可夫脚本以将大写和小写单词视为相同?

How would I change the following markov script to treat capitalized and lowercase words as the same?

整个想法是帮助提高我的马尔可夫文本生成器的输出质量.

The entire idea is to help increase the quality of output of my markov text generator.

就目前而言,如果将 99 个小写句子和 1 个大写句子插入其中 - 您几乎总能在输出中找到大写句子的非标记化版本.

As it stands, if you plug 99 lowercase sentences into it and 1 capitalized sentence - you almost always find a non-markovized version of the capitalized sentence in the output.

# Copyright (C) 1999 Lucent Technologies
# Excerpted from 'The Practice of Programming'
# by Brian W. Kernighan and Rob Pike

# markov.pl: markov chain algorithm for 2-word prefixes

$MAXGEN = 10000;
$NONWORD = "\n";
$w1 = $w2 = $NONWORD;                    # initial state
while (<>)
{                                        # read each line of input
    foreach (split)
    {
      push(@{$statetab{$w1}{$w2}}, $_);
      ($w1, $w2) = ($w2, $_);        # multiple assignment
    }
}

push(@{$statetab{$w1}{$w2}}, $NONWORD);  # add tail
$w1 = $w2 = $NONWORD;

for ($i = 0; $i < $MAXGEN; $i++) 
{
    $suf = $statetab{$w1}{$w2};      # array reference
    $r = int(rand @$suf);            # @$suf is number of elems
    exit if (($t = $suf->[$r]) eq $NONWORD);
    print "$t\n";
    ($w1, $w2) = ($w2, $t);          # advance chain
}

推荐答案

Nathan Fellman 和 mobrule 都提出了一种常见做法:规范化.

Nathan Fellman and mobrule are both suggesting a common practice: Normalization.

在进行作为程序或子例程主要目标的实际计算之前,处理数据以使其符合预期的内容和结构规范通常更简单.

It's often simpler to process data so that it conforms to expected norms of content and structure, before doing the actual computation that is the main goal of the program or subroutine.

马尔可夫链程序很有趣,所以我决定尝试一下.

The Markov chain program was interesting, so I decided to play with it.

这是一个允许您控制马尔可夫链中层数的版本.通过改变$DEPTH,你可以调整模拟的顺序.

Here's a version that allows you to control the number of layers in the Markov chain. By changing $DEPTH you can adjust the order of the simulation.

我将代码分解为可重用的子程序.您可以通过更改规范化例程来修改规范化规则.您还可以根据一组定义的值生成一个链.

I broke the code into reusable subroutines. You can modify the normalization rules by changing the normalization routines. You can also generate a chain based on a defined set of values.

生成多层状态表的代码是最有趣的部分.我本可以使用 Data::Diver,但我想自己解决.

The code to generate the multi-layer state table was the most interesting bit. I could have used Data::Diver, but I wanted to work it out myself.

词规范化代码确实应该允许规范化器返回要处理的词列表,而不仅仅是一个词——但我不想修复它现在可以返回一个列表词.. 其他诸如序列化处理过的语料库之类的事情会很好,使用 Getopt::Long 进行命令行开关仍有待完成.我只做了有趣的部分.

The word normalization code really should allow the normalizer to return a list of words to process, rather than just a single word--but I don't feel like fixing it now can return a list of words.. Other things like serializing your processed corpus would be good, and using Getopt::Long for command line switches remain to do. I only did the fun bits.

在不使用对象的情况下编写它对我来说有点挑战——这真的是制作马尔可夫生成器对象的好地方.我喜欢物体.但是,我决定保持代码程序化,以保留原始的精神.

It was a bit of a challenge for me to write this without using objects--this really felt like a good place to make a Markov generator object. I like objects. But, I decided to keep the code procedural so it would retain the spirit of the original.

玩得开心.

#!/usr/bin/perl
use strict;
use warnings;

use IO::Handle;

use constant NONWORD => "-";
my $MAXGEN = 10000;
my $DEPTH  = 2;

my %state_table;

process_corpus( \*ARGV, $DEPTH, \%state_table );
generate_markov_chain( \%state_table, $MAXGEN );


sub process_corpus {
    my $fh    = shift;
    my $depth = shift;
    my $state_table = shift || {};;

    my @history = (NONWORD) x $depth;


    while( my $raw_line = $fh->getline ) {

        my $line = normalize_line($raw_line);
        next unless defined $line;

        my @words = map normalize_word($_), split /\s+/, $line;
        for my $word ( @words ) {

            next unless defined $word; 

            add_word_to_table( $state_table, \@history, $word );
            push  @history, $word;
            shift @history;
        }

    }

    add_word_to_table( $state_table, \@history, NONWORD );

    return $state_table;
}

# This was the trickiest to write.
# $node has to be a reference to the slot so that 
# autovivified items will be retained in the $table.
sub add_word_to_table {
    my $table   = shift;
    my $history = shift;
    my $word    = shift;

    my $node = \$table;

    for( @$history ) {
        $node = \${$node}->{$_};
    }

    push @$$node, $word;

    return 1;
}

# Replace this with anything.
# Return undef to skip a word
sub normalize_word {
    my $word = shift;
    $word =~ s/[^A-Z]//g;
    return length $word ? $word : ();
}

# Replace this with anything.
# Return undef to skip a line
sub normalize_line {
    return uc shift;
}


sub generate_markov_chain {
    my $table   = shift;
    my $length  = shift;
    my $history = shift || [];

    my $node = $table;

    unless( @$history ) {

        while( 
            ref $node eq ref {}
                and
            exists $node->{NONWORD()} 
        ) {
            $node = $node->{NONWORD()};
            push @$history, NONWORD;
        }

    }

    for (my $i = 0; $i < $MAXGEN; $i++) {

        my $word = get_word( $table, $history );

        last if $word eq NONWORD;
        print "$word\n";

        push @$history, $word;
        shift @$history;
    }

    return $history;
}


sub get_word {
    my $table   = shift;
    my $history = shift;

    for my $step ( @$history ) {
        $table = $table->{$step};
    }

    my $word = $table->[ int rand @$table ];
    return $word;
}

更新:我修复了上面的代码以处理从 normalize_word() 例程返回的多个单词.

Update: I fixed the above code to handle multiple words coming back from the normalize_word() routine.

要保持大小写不变并将标点符号视为单词,请替换 normalize_line()normalize_word():

To leave case intact and treat punctuation symbols as words, replace normalize_line() and normalize_word():

sub normalize_line {
    return shift;
}

sub normalize_word {
    my $word = shift;

    # Sanitize words to only include letters and ?,.! marks 
    $word =~ s/[^A-Z?.,!]//gi;

    # Break the word into multiple words as needed.
    my @words = split /([.?,!])/, $word;

    # return all non-zero length words. 
    return grep length, @words;
}

另一个潜在的大问题是我使用了 - 作为非字字符.如果要包含连字符作为标点符号,则需要更改第 8 行的 NONWORD 常量定义.只需选择永远不会成为单词的内容.

The other big lurking gotcha is that I used - as the NONWORD character. If you want to include a hyphen as a punctuation symbol, you will need to change the NONWORD constant definition at line 8. Just choose something that can never be a word.

这篇关于Perl 脚本中的大小写敏感 - 如何使它不敏感?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆