给出一个单词列表,在文本主体上进行全字匹配 [英] Whole-word matching on a body of text, given a list of words

查看:109
本文介绍了给出一个单词列表,在文本主体上进行全字匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在开始工作之前,我想指出其他一些不太合理的SO帖子回答我的问题,而不是这一个的重复:



$ p $背景:

我有一个名为 words.txt (每行一个字)。我想从另一个更大的文件中找到来自 file.txt 的所有行,这些文件包含 words.txt中的任何单词。不过,我只想要全字匹配。这意味着当 file.txt 中的一行包含至少一个来自 words.txt 被发现全部由它自己(我知道这是模糊的,所以请允许我解释)。

换句话说,应该在下列情况下进行匹配:


  1. 单词本身就是一行

  2. 单词被非字母数字/非连字符

  3. 该单词位于一行的开头,后面跟着非字母数字/非连字符

  4. 在一行的结尾处,并在前面加上非字母数字/非连字符字符

例如,如果其中一个单词在 words.txt中 cat ,我希望它的行为如下所示:

  cat#=>匹配
猫猫猫#=>匹配
猫是灰色的#=>匹配
鼠标,猫,狗#=> match
caterpillar cat#=>匹配
caterpillar#=>无匹配
连结#=>不匹配
bobcat#=>无匹配
catcat#=>无匹配
cat100#=>无匹配
cat-in-law#=>不匹配

以前的研究:

有一个 grep 命令,几乎可以满足我的需求。它是如下所示:

grep -wf words.txt file.txt



其中的选项为:

  -w,--word-regexp 
仅选择那些包含形成整个单词的匹配的行。
测试是匹配的子字符串必须位于该行的开始
处,或者以非单词组成字符开头。
类似地,它必须在行尾,或者后面跟着一个
非单词组成字符。字组成字符是
字母,数字和下划线。
-f FILE,--file = FILE
从FILE获取模式,每行一个。空文件包含
零模式,因此不匹配任何内容。

我遇到的一个大问题是它将一个连字符(即 - )作为非单词组成字符。因此(根据上面的例子)对 cat 进行全文搜索会返回 cat-in-law ,这不是我想要的。



我意识到 -w 选项可能为许多人达到预期的效果。然而,在我的具体情况下,如果一个单词(例如 cat )后面跟着一个连字符,那么我需要把它看作是一个较大单词的一部分例如 cat-in-law ),而不是单词的一个实例。



另外,我知道我可以改变 words.txt 以包含正则表达式而不是固定字符串,然后使用:

grep -Ef words.txt file.txt



其中

  -E,--extended-regexp 
将PATTERN解释为扩展正则表达式

然而,我想避免改变 words.txt ,并保持它没有正则表达式模式。



问题:



是否有一个简单的bash命令可以让我给它一个单词列表并执行匹配文本正文?

解决方案

我终于想出了一个解决方案:

  grep -Ef<(awk'{print([^ a-zA-Z0- 9-] | ^)$ 0([^ a-zA-Z0-9-] | $)}'words.txt)file.txt 

说明: $ b


  • words.txt 是我的单词列表(每行一个)。

  • file.txt
  • awk 命令将预处理 words.txt
  • code>即时,将每个单词包装成一个特殊的正则表达式来定义它的正式开始和结束(基于我在上面的问题中发布的规范)。
  • awk 命令被<(,以便它的输出用作 -f 选项的输入。

  • 我正在使用 -E 选项,因为我现在正在输入一个正则表达式列表,而不是来自 words.txt

    这里的好处是 words.txt 可以保持人类可读性,不必包含一堆正则表达式模式。


    Note:

    Before I get down to business, I'd like to point out some other SO posts that didn't quite answer my question and are not duplicates of this one:

    Background:

    I have a list of words in a file called words.txt (one word per line). I would like to find all lines from a different, much larger file called file.txt that contain any of the words from words.txt. However, I only want whole-word matches. This means that a match should be made when a line from file.txt contains at least one instance where a word from words.txt is found "all by itself" (I know this is vague, so allow me to explain).

    In other words, a match should be made when:

    1. The word is all by itself on a line
    2. The word is surrounded by non-alphanumeric/non-hyphen characters
    3. The word is at the beginning of a line and followed by a non-alphanumeric/non-hyphen character
    4. The word is at the end of a line and preceded by a non-alphanumeric/non-hyphen character

    For example, if one of the words in words.txt is cat, I would like it to behave as follows:

    cat              #=> match
    cat cat cat      #=> match
    the cat is gray  #=> match
    mouse,cat,dog    #=> match
    caterpillar cat  #=> match
    caterpillar      #=> no match
    concatenate      #=> no match
    bobcat           #=> no match
    catcat           #=> no match
    cat100           #=> no match
    cat-in-law       #=> no match
    

    Previous research:

    There's a grep command that almost suits my needs. It is as follows:

    grep -wf words.txt file.txt

    where the options are:

    -w, --word-regexp
           Select only those lines containing matches that form whole words.
           The test is that the matching substring must either be at the beginning
           of the line, or preceded by a non-word constituent character.
           Similarly, it must be either at the end of the line or followed by a
           non-word constituent character. Word-constituent characters are
           letters, digits, and the underscore.
    -f FILE, --file=FILE
           Obtain patterns from FILE, one per line. The empty file contains
           zero patterns, and therefore matches nothing.
    

    The big problem I'm having with this is that it treats a hyphen (i.e. -) as a "non-word constituent character". Therefore (based on the example above) doing a whole-word search for cat will return cat-in-law, which is not what I want.

    I realize that the -w option probably achieves the desired effect for many people. However, in my particular case, if a word (e.g. cat) is followed/preceded by a hyphen, then I need to treat it as if it's part of a larger word (e.g. cat-in-law) and not an instance of the word by itself.

    Additionally, I know I could alter words.txt to contain regular expressions instead of fixed strings and then use:

    grep -Ef words.txt file.txt

    where

    -E, --extended-regexp
                  Interpret PATTERN as an extended regular expression
    

    However, I would like to avoid altering words.txt and keep it free of regex patterns.

    Question:

    Is there a simple bash command that will allow me to give it a list of words and perform whole-word matching on a body of text?

    解决方案

    I finally came up with a solution:

    grep -Ef <(awk '{print "([^a-zA-Z0-9-]|^)"$0"([^a-zA-Z0-9-]|$)"}' words.txt) file.txt
    

    Explanation:

    • words.txt is my list of words (one per line).
    • file.txt is the body of text that I would like to search.
    • The awk command will preprocess words.txt on-the-fly, wrapping each word in a special regular expression to define its official beginning and ending (based on the specifications posted in my question above).
    • The awk command is surrounded by <( and ) so that its output is used as the input for the -f option.
    • I'm using the -E option because I'm now inputting a list of regular expressions instead of fixed strings from words.txt.

    The nice thing here is that words.txt can remain human-readable and doesn't have to contain a bunch of regex patterns.

    这篇关于给出一个单词列表,在文本主体上进行全字匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆