使用 Python 在一个 .txt 文件中搜索单词或短语列表(并显示上下文) [英] Use Python to search one .txt file for a list of words or phrases (and show the context)

查看:43
本文介绍了使用 Python 在一个 .txt 文件中搜索单词或短语列表(并显示上下文)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

基本上如问题所述.我对 Python 相当陌生,喜欢边看边做边学习.

Basically as the question states. I am fairly new to Python and like to learn by seeing and doing.

我想创建一个脚本,用于在文本文档(例如从新闻文章中复制和粘贴的文本)中搜索某些单词或短语.理想情况下,单词和短语列表将存储在单独的文件中.

I would like to create a script that searches through a text document (say the text copied and pasted from a news article for example) for certain words or phrases. Ideally, the list of words and phrases would be stored in a separate file.

得到结果时,最好能得到结果的上下文.所以也许它可以打印出文本文件中每个找到的搜索词前后的 50 个字符.如果它还显示搜索词是在哪一行找到的,那就太好了.

When getting the results, it would be great to get the context of the results. So maybe it could print out the 50 characters in the text file before and after each search term that has been found. It'd be cool if it also showed what line the search term was found on.

任何有关如何对此进行编码的指针,甚至代码示例都将不胜感激.

Any pointers on how to code this, or even code examples would be much appreciated.

推荐答案

尽管 Python 社区中的许多人经常表达对正则表达式的反感,但它们确实是适用于适当用例的宝贵工具——其中肯定包括识别单词和短语(感谢 \b 正则表达式模式中的词边界"元素——基于字符串处理的替代方案是一个更大的问题,例如,.split() 使用空格作为分隔符,因此恼人地将标点符号附加到相邻的单词上,等等.

Despite the frequently expressed antipathy for Regular Expressions on the part of many in the Python community, they're really a precious tool for the appropriate use cases -- which definitely include identifying words and phrases (thanks to the \b "word boundary" element in regular expression patterns -- string-processing based alternatives are much more of a problem, e.g., .split() uses whitespace as the separator and thus annoyingly leave punctuation attached to words adjacent to it, etc, etc).

如果 RE 没问题,我会推荐以下内容:

If RE's are OK, I would recommend something like:

import re
import sys

def main():
  if len(sys.argv) != 3:
    print("Usage: %s fileofstufftofind filetofinditin" % sys.argv[0])
    sys.exit(1)

  with open(sys.argv[1]) as f:
    patterns = [r'\b%s\b' % re.escape(s.strip()) for s in f]
  there = re.compile('|'.join(patterns))

  with open(sys.argv[2]) as f:
    for i, s in enumerate(f):
      if there.search(s):
        print("Line %s: %r" % (i, s))

main()

第一个参数是(路径)一个文本文件,其中包含要查找的单词或短语,每行一个,第二个参数(路径)是在其中找到它们的文本文件.如果需要,很容易使大小写搜索不敏感(可能只是基于命令行选项开关的可选)等.

the first argument being (the path of) a text file with words or phrases to find, one per line, and the second argument (the path of) a text file in which to find them. It's easy, if desired, to make the case search-insensitive (perhaps just optionally based on a command line option switch), etc, etc.

对不熟悉 RE 的读者的一些解释...:

Some explanation for readers that are not familiar with REs...:

patterns 项中的 \b 项可确保不会出现意外匹配(如果您正在搜索cat"或dog",则您赢了不会看到catalog"或underdog"的意外命中;并且您不会因为一些分裂认为单词有cat"(包括逗号)而错过The cat,smile, run away"中的命中;-).

The \b item in the patterns items ensures that there will be no accidental matches (if you're searching for "cat" or "dog", you won't see an accidental hit with "catalog" or "underdog"; and you won't miss a hit in "The cat, smiling, ran away" by some splitting thinking that the word there is "cat," including the comma;-).

| 项表示 ,例如来自包含内容的文本文件(两行)

The | item means or, so e.g. from a text file with contents (two lines)

cat
dog

这将形成模式 '\bcat\b|\bdog\b' 它将定位cat"或dog"(作为独立词,忽略标点符号,但拒绝命中用更长的词).

this will form the pattern '\bcat\b|\bdog\b' which will locate either "cat" or "dog" (as stand-alone words, ignoring punctuation, but rejecting hits within longer words).

re.escape 转义标点符号,因此它按字面匹配,而不是像通常在 RE 模式中那样具有特殊含义.

The re.escape escapes punctuation so it's matched literally, not with special meaning as it would normally have in a RE pattern.

这篇关于使用 Python 在一个 .txt 文件中搜索单词或短语列表(并显示上下文)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆