给出一个单词列表,在文本主体上进行全字匹配 [英] Whole-word matching on a body of text, given a list of words
问题描述
$ p $背景:
我有一个名为
words.txt
(每行一个字)。我想从另一个更大的文件中找到来自 file.txt
的所有行,这些文件包含 words.txt中的任何单词
。不过,我只想要全字匹配。这意味着当 file.txt
中的一行包含至少一个来自 words.txt $ c的单词时,应进行匹配$ c>被发现全部由它自己(我知道这是模糊的,所以请允许我解释)。
换句话说,应该在下列情况下进行匹配:
- 单词本身就是一行
- 单词被非字母数字/非连字符
- 该单词位于一行的开头,后面跟着非字母数字/非连字符
- 在一行的结尾处,并在前面加上非字母数字/非连字符字符
例如,如果其中一个单词在 words.txt中
是 cat
,我希望它的行为如下所示:
cat#=>匹配
猫猫猫#=>匹配
猫是灰色的#=>匹配
鼠标,猫,狗#=> match
caterpillar cat#=>匹配
caterpillar#=>无匹配
连结#=>不匹配
bobcat#=>无匹配
catcat#=>无匹配
cat100#=>无匹配
cat-in-law#=>不匹配
以前的研究: 有一个 其中的选项为: 我遇到的一个大问题是它将一个连字符(即 我意识到 另外,我知道我可以改变 其中 然而,我想避免改变 问题: 是否有一个简单的bash命令可以让我给它一个单词列表并执行词匹配文本正文? 我终于想出了一个解决方案: 说明: $ b 这里的好处是 Note: Before I get down to business, I'd like to point out some other SO posts that didn't quite answer my question and are not duplicates of this one: Background: I have a list of words in a file called In other words, a match should be made when: For example, if one of the words in Previous research: There's a where the options are: The big problem I'm having with this is that it treats a hyphen (i.e. I realize that the Additionally, I know I could alter where However, I would like to avoid altering Question: Is there a simple bash command that will allow me to give it a list of words and perform whole-word matching on a body of text? I finally came up with a solution: Explanation: The nice thing here is that 这篇关于给出一个单词列表,在文本主体上进行全字匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
grep
命令,几乎可以满足我的需求。它是如下所示:
grep -wf words.txt file.txt
-w,--word-regexp
仅选择那些包含形成整个单词的匹配的行。
测试是匹配的子字符串必须位于该行的开始
处,或者以非单词组成字符开头。
类似地,它必须在行尾,或者后面跟着一个
非单词组成字符。字组成字符是
字母,数字和下划线。
-f FILE,--file = FILE
从FILE获取模式,每行一个。空文件包含
零模式,因此不匹配任何内容。
-
)作为非单词组成字符。因此(根据上面的例子)对 cat
进行全文搜索会返回 cat-in-law
,这不是我想要的。
-w
选项可能为许多人达到预期的效果。然而,在我的具体情况下,如果一个单词(例如 cat
)后面跟着一个连字符,那么我需要把它看作是一个较大单词的一部分例如 cat-in-law
),而不是单词的一个实例。
words.txt
以包含正则表达式而不是固定字符串,然后使用:
grep -Ef words.txt file.txt
-E,--extended-regexp
将PATTERN解释为扩展正则表达式
words.txt
,并保持它没有正则表达式模式。
grep -Ef<(awk'{print([^ a-zA-Z0- 9-] | ^)$ 0([^ a-zA-Z0-9-] | $)}'words.txt)file.txt
words.txt
是我的单词列表(每行一个)。
file.txt
是
awk
命令将预处理 words.txt
awk
命令被<(
和)
,以便它的输出用作 -f
选项的输入。
-E
选项,因为我现在正在输入一个正则表达式列表,而不是来自 words.txt $ c的固定字符串$ c $。
words.txt
可以保持人类可读性,不必包含一堆正则表达式模式。
words.txt
(one word per line). I would like to find all lines from a different, much larger file called file.txt
that contain any of the words from words.txt
. However, I only want whole-word matches. This means that a match should be made when a line from file.txt
contains at least one instance where a word from words.txt
is found "all by itself" (I know this is vague, so allow me to explain).
words.txt
is cat
, I would like it to behave as follows:cat #=> match
cat cat cat #=> match
the cat is gray #=> match
mouse,cat,dog #=> match
caterpillar cat #=> match
caterpillar #=> no match
concatenate #=> no match
bobcat #=> no match
catcat #=> no match
cat100 #=> no match
cat-in-law #=> no match
grep
command that almost suits my needs. It is as follows:grep -wf words.txt file.txt
-w, --word-regexp
Select only those lines containing matches that form whole words.
The test is that the matching substring must either be at the beginning
of the line, or preceded by a non-word constituent character.
Similarly, it must be either at the end of the line or followed by a
non-word constituent character. Word-constituent characters are
letters, digits, and the underscore.
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file contains
zero patterns, and therefore matches nothing.
-
) as a "non-word constituent character". Therefore (based on the example above) doing a whole-word search for cat
will return cat-in-law
, which is not what I want.-w
option probably achieves the desired effect for many people. However, in my particular case, if a word (e.g. cat
) is followed/preceded by a hyphen, then I need to treat it as if it's part of a larger word (e.g. cat-in-law
) and not an instance of the word by itself.words.txt
to contain regular expressions instead of fixed strings and then use:grep -Ef words.txt file.txt
-E, --extended-regexp
Interpret PATTERN as an extended regular expression
words.txt
and keep it free of regex patterns.grep -Ef <(awk '{print "([^a-zA-Z0-9-]|^)"$0"([^a-zA-Z0-9-]|$)"}' words.txt) file.txt
words.txt
is my list of words (one per line).file.txt
is the body of text that I would like to search.awk
command will preprocess words.txt
on-the-fly, wrapping each word in a special regular expression to define its official beginning and ending (based on the specifications posted in my question above). awk
command is surrounded by <(
and )
so that its output is used as the input for the -f
option.-E
option because I'm now inputting a list of regular expressions instead of fixed strings from words.txt
.words.txt
can remain human-readable and doesn't have to contain a bunch of regex patterns.