Python Regex - 在文本文件中的(多个)表达式之间提取文本 [英] Python Regex - Extract text between (multiple) expressions in a textfile

查看：59 发布时间：2021/7/6 20:47:38 python regex text-mining text-extraction

本文介绍了Python Regex - 在文本文件中的(多个)表达式之间提取文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是 Python 初学者，如果您能帮助我解决文本提取问题，我将不胜感激.

I am a Python beginner and would be very thankful if you could help me with my text extraction problem.

我想提取位于文本文件中两个表达式之间的所有文本(字母的开头和结尾).对于这两个字母的开头和结尾，有多种可能的表达方式(在列表letter_begin"和letter_end"中定义，例如亲爱的"、致我们的"等).我想为一堆文件分析这个，在下面找到这样一个文本文件的样子的例子 - >我想提取从亲爱的"到道格拉斯"的所有文本.在letter_end"没有匹配的情况下，即没有找到 letter_end 表达式，输出应该从 letter_beginning 开始，并在要分析的文本文件的最后结束.

I want to extract all text, which lies between two expressions in a textfile (the beginning and end of a letter). For both, the beginning and the end of the letter there are multiple possible expressions (defined in the lists "letter_begin" and "letter_end", e.g. "Dear", "to our", etc.). I want to analyze this for a bunch of files, find below an example of how such a textfile looks like -> I want to extract all text starting from "Dear" till "Douglas". In cases where the "letter_end" has no match, i.e. no letter_end expression is found, the output should start from the letter_beginning and end at the very end of the text file to be analyzed.

记录的文本"的结尾必须在letter_end"匹配之后和20个字符或更多字符的第一行之前(就像此处的随机文本"一样 -> len=24.

the end of "the recorded text" has to be after the match of "letter_end" and before the first line with 20 characters or more (as is the case for "Random text here as well" -> len=24.

"""Some random text here
 
Dear Shareholders We
are pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.
Best regards 
Douglas

Random text here as well"""

到目前为止，这是我的代码 - 但它无法灵活地捕捉表达式之间的文本(在letter_begin"之前和letter_end"之后可以有任何东西(行、文本、数字、符号等)")

This is my code so far - but it is not able to flexible catch the text between the expressions (there can be anything (lines, text, numbers, signs, etc.) before the "letter_begin" and after the "letter_end")

import re

letter_begin = ["dear", "to our", "estimated"] # All expressions for "beginning" of letter 
openings = "|".join(letter_begin)
letter_end = ["sincerely", "yours", "best regards"] # All expressions for "ending" of Letter 
closings = "|".join(letter_end)
regex = r"(?:" + openings + r")\s+.*?" + r"(?:" + closings + r"),\n\S+"


with open(filename, 'r', encoding="utf-8") as infile:
         text = infile.read()
         text = str(text)
         output = re.findall(regex, text, re.MULTILINE|re.DOTALL|re.IGNORECASE) # record all text between Regex (Beginning and End Expressions)
         print (output)

我非常感谢每一个帮助！

I am very thankful for every help!

推荐答案

您可以使用

regex = r"(?:{})[\s\S]*?(?:{}).*(?:\n.*){{0,2}}".format(openings, closings)

这种模式将导致像这样的正则表达式

This pattern will result in a regex like

(?:dear|to our|estimated)[\s\S]*?(?:sincerely|yours|best regards).*(?:\n.*){0,2}

查看正则表达式演示.请注意，您不应在此模式中使用 re.DOTALL，并且 re.MULTILINE 选项也是多余的.

See the regex demo. Note you should not use re.DOTALL with this pattern, and the re.MULTILINE option is also redundant.

详情

(?:dear|to our|estimated) - 三个值中的任何一个
[\s\S]*? - 任何 0+ 个字符，尽可能少
(?:sincerely|yours|bestStudies) - 三个值中的任何一个
.* - 除换行符以外的任何 0+ 个字符
(?:\n.*){0,2} - 换行符的零次、一次或两次重复，后跟除换行符以外的任何 0+ 个字符.

(?:dear|to our|estimated) - any of the three values
[\s\S]*? - any 0+ chars, as few as possible
(?:sincerely|yours|best regards) - any of the three values
.* - any 0+ chars other than newline
(?:\n.*){0,2} - zero, one or two repetitions of a newline followed with any 0+ chars other than newline.

Python 演示代码:

import re
text="""Some random text here

Dear Shareholders We
are pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.
Best regards 
Douglas

Random text here as well"""
letter_begin = ["dear", "to our", "estimated"] # All expressions for "beginning" of letter 
openings = "|".join(letter_begin)
letter_end = ["sincerely", "yours", "best regards"] # All expressions for "ending" of Letter 
closings = "|".join(letter_end)
regex = r"(?:{})[\s\S]*?(?:{}).*(?:\n.*){{0,2}}".format(openings, closings)
print(regex)
print(re.findall(regex, text, re.IGNORECASE))

输出:

['Dear Shareholders We\nare pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.\nBest regards \nDouglas\n']

这篇关于Python Regex - 在文本文件中的(多个)表达式之间提取文本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python Regex - 在文本文件中的(多个)表达式之间提取文本 [英] Python Regex - Extract text between (multiple) expressions in a textfile

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python Regex - 在文本文件中的(多个)表达式之间提取文本 [英] Python Regex - Extract text between (multiple) expressions in a textfile

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭