Perl正则表达式|如何从文件中排除单词 [英] Perl Regular expression | how to exclude words from a file

查看:182
本文介绍了Perl正则表达式|如何从文件中排除单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找有关项目中某些要求的Perl正则表达式语法. 首先,我想从txt文件(字典)中排除字符串.

例如,如果我的文件包含以下字符串:

path.../Document.txt |
  tree
  car
  ship

i使用正则表达式

a1testtre  --  match
orangesh1  --  match
apleship3  --  not match  [contains word from file ]

我还有一个无法解决的要求.我必须创建一个正则表达式,不允许字符串具有超过3倍的字符重复(两个字符).

例如:

adminnisstrator21     -- match  (have 2 times a repetition of chars)
kkeeykloakk           -- not match have over 3 times repetition
stack22ooverflow      -- match  (have 2 times a repetition of chars)

为此,我尝试了

\b(?:([a-z])(?!\1))+\b

,但仅适用于第一个字符重复 知道如何解决这两个问题吗?

解决方案

从给定列表中排除包含单词的字符串的一种方法是形成带有单词替换形式的模式,并在正则表达式中使用它,从而进行匹配排除字符串.

use warnings;
use strict;
use feature qw(say);

use Path::Tiny;

my $file = shift // die "Usage: $0 file\n";  #/

my @words = split ' ', path($file)->slurp;

my $exclude = join '|', map { quotemeta } @words;

foreach my $string (qw(a1testtre orangesh1 apleship3)) 
{ 
    if ($string !~ /$exclude/) { 
        say "OK: $string"; 
    }
}

我使用 Path :: Tiny 将文件读取为一个字符串("slurp "),然后通过空格将拆分成单词以用于排除.如果您的单词中出现任何 quotemeta 转义为非"word"字符的情况,然后由|连接在一起以形成带有正则表达式模式的字符串. (对于复杂的模式,请使用 qr .)

根据您的用例,可能会针对与公共部分交替的模式顺序进行调整和改进.

检查连续重复的字符出现的次数不超过三次

foreach my $string (qw(adminnisstrator21 kkeeykloakk stack22ooverflow))
{
    my @chars_that_repeat = $string =~ /(.)\1+/g;

    if (@chars_that_repeat < 3) { 
        say "OK: $string";
    }
}

由于正则表达式中的+量词,一长串重复的字符(aaaa)被视为一个实例;如果您希望计算所有对,请删除+,而四个a将算作两对.字符串中不同位置重复的同一字符每次都会计数,因此aaXaa被视为两对.

此代码段可以直接添加到上述程序中,并使用带有用于排除的单词的文件名来调用该程序.他们都打印了所提供样本的预期结果.


 考虑一个带有排除词的示例:sosolesolely.如果您只需要检查这些匹配项中的任何一个,那么您需要在替换中先选择较短的匹配项

my $exclude = join '|', map { quotemeta } sort { length $a <=> length $b } @words;
#==>  so|sole|solely

进行更快的匹配(so匹配所有三个).无论如何,这里似乎就是这种情况.

但是,如果您想正确地识别出匹配的单词,那么您必须首先使用更长的单词,

solely|sole|so

,以便字符串solely与单词正确匹配,然后才能被so窃取".然后,在这种情况下,您可能会想反过来, sort { length $b <=> length $a }

i searching to find some Perl Regular Expression Syntax about some requirements i have in a project. First i want to exclude strings from a txt file (dictionary).

For example if my file have this strings:

path.../Document.txt |
  tree
  car
  ship

i using Regular Expression

a1testtre  --  match
orangesh1  --  match
apleship3  --  not match  [contains word from file ]

Also i have one more requirement that i couldnt solve. I have to create a Regex that not allow a String to have over 3 times a char repeat (two chars).

For example :

adminnisstrator21     -- match  (have 2 times a repetition of chars)
kkeeykloakk           -- not match have over 3 times repetition
stack22ooverflow      -- match  (have 2 times a repetition of chars)

for this i have try

\b(?:([a-z])(?!\1))+\b

but it works only for the first char-reppeat Any idea how to solve these two?

解决方案

One way to exclude strings that contain words from a given list is to form a pattern with an alternation of the words and use that in a regex, whereby a match excludes the string.

use warnings;
use strict;
use feature qw(say);

use Path::Tiny;

my $file = shift // die "Usage: $0 file\n";  #/

my @words = split ' ', path($file)->slurp;

my $exclude = join '|', map { quotemeta } @words;

foreach my $string (qw(a1testtre orangesh1 apleship3)) 
{ 
    if ($string !~ /$exclude/) { 
        say "OK: $string"; 
    }
}

I use Path::Tiny to read the file into a a string ("slurp"), which is then split by whitespace into words to use for exclusion. The quotemeta escapes non-"word" characters, should any happen in your words, which are then joined by | to form a string with a regex pattern. (With complex patterns use qr.)

This may be possible to tweak and improve, depending on your use cases, for one in regards to the order of of patterns with common parts in alternation.

The check that successive duplicate characters do not occur more than three times

foreach my $string (qw(adminnisstrator21 kkeeykloakk stack22ooverflow))
{
    my @chars_that_repeat = $string =~ /(.)\1+/g;

    if (@chars_that_repeat < 3) { 
        say "OK: $string";
    }
}

A long string of repeated chars (aaaa) counts as one instance, due to the + quantifier in regex; if you'd rather count all pairs remove the + and four as will count as two pairs. The same char repeated at various places in the string counts every time, so aaXaa counts as two pairs.

This snippet can be just added to the above program, which is invoked with the name of the file with words to use for exclusion. They both print what is expected from provided samples.


  Consider an example with exclusion-words: so, sole, and solely. If you only need to check whether any one of these matches then you'd want shorter ones first in the alternation

my $exclude = join '|', map { quotemeta } sort { length $a <=> length $b } @words;
#==>  so|sole|solely

for a quicker match (so matches all three). This, by all means, appears to be the case here.

But, if you wanted to correctly identify which word matched then you must have longer words first,

solely|sole|so

so that a string solely is correctly matched by its word before it can be "stolen" by so. Then in this case you'd want it the other way round, sort { length $b <=> length $a }

这篇关于Perl正则表达式|如何从文件中排除单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆