使用Python在文件行中搜索列表条目 [英] Use Python to search lines of file for list entries

查看:191
本文介绍了使用Python在文件行中搜索列表条目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含成千上万行ASCII文本的文本文件.我有一个要搜索的数百个关键字的列表,分别考虑了每一行.最初,如果有任何匹配项,我想返回该行(打印到屏幕或文件),但最终我想根据匹配项的数量对返回的行进行排名或排序.

I have a text file with tens of thousands of lines of ASCII text. I have a list of a few hundred keywords that I want to search for, considering each line individually. Initially, I want to return (print to screen or a file) the line if there are any matches but eventually I'd like to rank or order the returned lines based on how many matches.

所以,我的清单是这样的...

So, my list is something like this...

keywords = ['one', 'two', 'three']

我的思路是这样的:

myfile = open('file.txt')
for line in myfile:
    if keywords in line:
        print line

但是,从psuedo到有效的代码并没有实现.

But taking this from psuedo to working code is not happening.

我还考虑过使用RegEx:

I've also thought of using RegEx:

print re.findall(keywords, myfile.read())

但这使我走上了一条错误与问题不同的道路.

But that leads me down a path of different errors and problems.

如果有人可以提供一些指导,语法或代码片段,我将不胜感激.

If anyone can offer some guidance, syntax or code snippets I would be grateful.

推荐答案

您不能测试字符串中是否有列表.您可以做的就是测试另一个字符串中是否有一个字符串.

You can't test to see if there is a list in a string. What you can do is test is there is a string in another string.

lines = ['this is a line without any keywords', 
         'this is a line with one', 
         'this is a line with one and two',
         'this is a line with three']
keywords = ['one', 'two', 'three']

for line in lines:
    for word in keywords:
        if word in line:
            print(line)
            break

当第一个单词匹配时,break对于打破单词"循环是必要的.否则,它将为匹配的每个单词打印一行.

The break is necessary to break out of the "word" loop when the first word is matched. Otherwise it will print the line for each word it matches.

正则表达式解决方案具有相同的问题.您可以使用与上面相同的解决方案,并在单词上添加一个附加循环,或者可以构造一个将自动匹配任何单词的正则表达式.请参阅 Python regex语法文档.

The regex solution has the same problem. You can either use the same solution as I did above and add an additional loop over the words, or you can construct a regex that will automatically match any of the words. See the Python regex syntax documentation.

for line in lines:
    matches = re.findall('one|two|three', line)
    if matches:
        print(line, len(matches))            

请注意,如果没有匹配项,则re.findall返回一个空列表,如果存在匹配项,则返回所有匹配项的列表.因此,我们可以在if条件下直接测试结果,因为空列表的值为False.

Note that re.findall returns an empty list if there are no matches and a list of all the matches if there are matches. So we can directly test the result in the if condition, as empty lists evaluate to False.

对于这些简单情况,您还可以轻松生成正则表达式模式:

You can also easily generate the regex pattern for these simple cases:

pattern = '|'.join(keywords)
print(pattern)
# 'one|two|three'


要对其进行排序,只需将它们放在元组列表中,并使用sortedkey自变量即可.


To sort them, you can simply put them in a list of tuples and use the key argument of sorted.

results = []
for line in lines:
    matches = re.findall('one|two|three', line)
    if matches:
        results.append((line, len(matches)))

results = sorted(results, key=lambda x: x[1], reverse=True)

您可以阅读sorted文档,但是key参数提供了用于排序的函数.在这种情况下,我们提取每个元组的第二个元素,即在该行中存储匹配项的数目,并以此对列表进行排序.

You can read the documentation for sorted, but the key argument provides a function to use for sorting. In this case, we extract the second element of each tuple, which is where we stored the number of matches in that line, and sort the list with that.

这是将其应用于实际文件并保存结果的方式.

This is how you might apply this to an actual file and save the results.

keywords = ['one', 'two', 'three']
pattern = '|'.join(keywords)

results = []
with open('myfile.txt', 'r') as f:
    for line in f:
        matches = re.findall(pattern, line)
        if matches:
            results.append((line, len(matches)))

results = sorted(results, key=lambda x: x[1], reverse=True)

with open('results.txt', 'w') as f:
    for line, num_matches in results:
        f.write('{}  {}\n'.format(num_matches, line))

您可以使用上下文管理器在上阅读,但是在这种情况下,它基本上可以确保您在处理完文件后就将其关闭.

You can read up on the with context manager, but in this situation it essentially ensures that you close the file once you're done with it.

这篇关于使用Python在文件行中搜索列表条目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆