给出一个单词列表，在文本主体上进行全字匹配 [英] Whole-word matching on a body of text, given a list of words

查看：109 发布时间：2018/5/28 19:32:03 regex string bash shell grep

本文介绍了给出一个单词列表，在文本主体上进行全字匹配的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在开始工作之前，我想指出其他一些不太合理的SO帖子回答我的问题，而不是这一个的重复：

如何用单词列表grep

如何让grep只匹配整行匹配？

如何grep整个词

Grep只提取整个单词

$ p $背景：

我有一个名为 words.txt （每行一个字）。我想从另一个更大的文件中找到来自 file.txt 的所有行，这些文件包含 words.txt中的任何单词。不过，我只想要全字匹配。这意味着当 file.txt 中的一行包含至少一个来自words.txt 被发现全部由它自己（我知道这是模糊的，所以请允许我解释）。换句话说，应该在下列情况下进行匹配：单词本身就是一行单词被非字母数字/非连字符该单词位于一行的开头，后面跟着非字母数字/非连字符在一行的结尾处，并在前面加上非字母数字/非连字符字符例如，如果其中一个单词在 words.txt中是 cat ，我希望它的行为如下所示： cat＃=>匹配猫猫猫＃=>匹配猫是灰色的＃=>匹配鼠标，猫，狗＃=> match caterpillar cat＃=>匹配 caterpillar＃=>无匹配连结＃=>不匹配 bobcat＃=>无匹配 catcat＃=>无匹配 cat100＃=>无匹配 cat-in-law＃=>不匹配以前的研究：有一个 grep 命令，几乎可以满足我的需求。它是如下所示： grep -wf words.txt file.txt 其中的选项为： -w，--word-regexp 仅选择那些包含形成整个单词的匹配的行。测试是匹配的子字符串必须位于该行的开始处，或者以非单词组成字符开头。类似地，它必须在行尾，或者后面跟着一个非单词组成字符。字组成字符是字母，数字和下划线。 -f FILE，--file = FILE 从FILE获取模式，每行一个。空文件包含零模式，因此不匹配任何内容。我遇到的一个大问题是它将一个连字符（即 - ）作为非单词组成字符。因此（根据上面的例子）对 cat 进行全文搜索会返回 cat-in-law ，这不是我想要的。我意识到 -w 选项可能为许多人达到预期的效果。然而，在我的具体情况下，如果一个单词（例如 cat ）后面跟着一个连字符，那么我需要把它看作是一个较大单词的一部分例如 cat-in-law ），而不是单词的一个实例。另外，我知道我可以改变 words.txt 以包含正则表达式而不是固定字符串，然后使用： grep -Ef words.txt file.txt 其中 -E，--extended-regexp 将PATTERN解释为扩展正则表达式然而，我想避免改变 words.txt ，并保持它没有正则表达式模式。问题：是否有一个简单的bash命令可以让我给它一个单词列表并执行词匹配文本正文？解决方案我终于想出了一个解决方案： grep -Ef<（awk'{print（[^ a-zA-Z0- 9-] | ^）$ 0（[^ a-zA-Z0-9-] | $）}'words.txt）file.txt 说明： $ b words.txt 是我的单词列表（每行一个）。 file.txt 是 awk 命令将预处理 words.txt code>即时，将每个单词包装成一个特殊的正则表达式来定义它的正式开始和结束（基于我在上面的问题中发布的规范）。 awk 命令被<（和），以便它的输出用作 -f 选项的输入。我正在使用 -E 选项，因为我现在正在输入一个正则表达式列表，而不是来自 words.txt 这里的好处是 words.txt 可以保持人类可读性，不必包含一堆正则表达式模式。 Note: Before I get down to business, I'd like to point out some other SO posts that didn't quite answer my question and are not duplicates of this one: How to grep with a list of words How to make grep only match if the entire line matches? how to grep for the whole word Grep extract only whole word Background: I have a list of words in a file called words.txt (one word per line). I would like to find all lines from a different, much larger file called file.txt that contain any of the words from words.txt. However, I only want whole-word matches. This means that a match should be made when a line from file.txt contains at least one instance where a word from words.txt is found "all by itself" (I know this is vague, so allow me to explain). In other words, a match should be made when: The word is all by itself on a line The word is surrounded by non-alphanumeric/non-hyphen characters The word is at the beginning of a line and followed by a non-alphanumeric/non-hyphen character The word is at the end of a line and preceded by a non-alphanumeric/non-hyphen character For example, if one of the words in words.txt is cat, I would like it to behave as follows: cat #=> match cat cat cat #=> match the cat is gray #=> match mouse,cat,dog #=> match caterpillar cat #=> match caterpillar #=> no match concatenate #=> no match bobcat #=> no match catcat #=> no match cat100 #=> no match cat-in-law #=> no match Previous research: There's a grep command that almost suits my needs. It is as follows: grep -wf words.txt file.txt where the options are: -w, --word-regexp Select only those lines containing matches that form whole words. The test is that the matching substring must either be at the beginning of the line, or preceded by a non-word constituent character. Similarly, it must be either at the end of the line or followed by a non-word constituent character. Word-constituent characters are letters, digits, and the underscore. -f FILE, --file=FILE Obtain patterns from FILE, one per line. The empty file contains zero patterns, and therefore matches nothing. The big problem I'm having with this is that it treats a hyphen (i.e. -) as a "non-word constituent character". Therefore (based on the example above) doing a whole-word search for cat will return cat-in-law, which is not what I want. I realize that the -w option probably achieves the desired effect for many people. However, in my particular case, if a word (e.g. cat) is followed/preceded by a hyphen, then I need to treat it as if it's part of a larger word (e.g. cat-in-law) and not an instance of the word by itself. Additionally, I know I could alter words.txt to contain regular expressions instead of fixed strings and then use: grep -Ef words.txt file.txt where -E, --extended-regexp Interpret PATTERN as an extended regular expression However, I would like to avoid altering words.txt and keep it free of regex patterns. Question: Is there a simple bash command that will allow me to give it a list of words and perform whole-word matching on a body of text? 解决方案 I finally came up with a solution: grep -Ef <(awk '{print "([^a-zA-Z0-9-]|^)"$0"([^a-zA-Z0-9-]|$)"}' words.txt) file.txt Explanation: words.txt is my list of words (one per line). file.txt is the body of text that I would like to search. The awk command will preprocess words.txt on-the-fly, wrapping each word in a special regular expression to define its official beginning and ending (based on the specifications posted in my question above). The awk command is surrounded by <( and ) so that its output is used as the input for the -f option. I'm using the -E option because I'm now inputting a list of regular expressions instead of fixed strings from words.txt. The nice thing here is that words.txt can remain human-readable and doesn't have to contain a bunch of regex patterns. 这篇关于给出一个单词列表，在文本主体上进行全字匹配的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

给出一个单词列表，在文本主体上进行全字匹配 [英] Whole-word matching on a body of text, given a list of words

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

给出一个单词列表，在文本主体上进行全字匹配 [英] Whole-word matching on a body of text, given a list of words

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭