从文本文件中提取文本块的正则表达式? [英] Regular expression to extract chunks of text from a text file?

查看:40
本文介绍了从文本文件中提取文本块的正则表达式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要使用正则表达式从 Python 中的文本文件中提取标题和标题下的文本块,但我发现这很困难.

我将此

I need to extract headings and the chunk of text beneath them from a text file in Python using regular expression but I'm finding it difficult.

I converted this PDF to text so that it now looks like this:

So far I have been able to get all the numerical headers (12.4.5.4, 12.4.5.6, 13, 13.1, 13.1.1, 13.1.12) using the following regex:

import re

with open('data/single.txt', encoding='UTF-8') as file:

    for line in file:
        headings = re.findall(r'^\d+(?:\.\d+)*\.?', line)
        print(headings)`

I just don't know how to get the worded part of those headings or the paragraph of text beneath them.

EDIT - Here is the text:

I.S. EN 60601-1:2006&A1:2013&AC:2014&A12:2014

60601-1 © IEC:2005 60601-1 © IEC:2005

– 337 – – 169 –

12.4.5.4 Other ME EQUIPMENT producing diagnostic or therapeutic radiation When applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the RISKS associated with ME EQUIPMENT producing diagnostic or therapeutic radiation other than for diagnostic X-rays and radiotherapy (see 12.4.5.2 and 12.4.5.3).

Compliance is checked by inspection of the RISK MANAGEMENT FILE.

12.4.6 Diagnostic or therapeutic acoustic pressure When applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the RISKS associated with diagnostic or therapeutic acoustic pressure.

Compliance is checked by inspection of the RISK MANAGEMENT FILE.

13 * HAZARDOUS SITUATIONS and fault conditions

13.1 Specific HAZARDOUS SITUATIONS

  • General

13.1.1 When applying the SINGLE FAULT CONDITIONS as described in 4.7 and listed in 13.2, one at a time, none of the HAZARDOUS SITUATIONS in 13.1.2 to 13.1.4 (inclusive) shall occur in the ME EQUIPMENT.

The failure of any one component at a time, which could result in a HAZARDOUS SITUATION, is described in 4.7.

  • Emissions, deformation of ENCLOSURE or exceeding maximum temperature

13.1.2 The following HAZARDOUS SITUATIONS shall not occur: – emission of flames, molten metal, poisonous or ignitable substance in hazardous

quantities;

– deformation of ENCLOSURES to such an extent that compliance with 15.3.1 is impaired; –

temperatures of APPLIED PARTS exceeding the allowed values identified in Table 24 when measured as described in 11.1.3; temperatures of ME EQUIPMENT parts that are not APPLIED PARTS but are likely to be touched, exceeding the allowable values in Table 23 when measured and adjusted as described in 11.1.3;

– exceeding the allowable values for "other components and materials" identified in Table 22 times 1,5 minus 12,5 °C. Limits for windings are found in Table 26, Table 27 and Table 31. In all other cases, the allowable values of Table 22 apply.

Temperatures shall be measured using the method described in 11.1.3.

The SINGLE FAULT CONDITIONS in 4.7, 8.1 b), 8.7.2 and 13.2.2, with regard to the emission of flames, molten metal or ignitable substances, shall not be applied to parts and components where: – The construction or the supply circuit limits the power dissipation in SINGLE FAULT

CONDITION to less than 15 W or the energy dissipation to less than 900 J.

解决方案

You could use your pattern and match a space after it followed by the rest of the line.

Then repeat matching all following lines that do not start with a heading.

^\d+(?:\.\d+)* .*(?:\r?\n(?!\d+(?:\.\d+)* ).*)*

  • ^\d+(?:.\d+)* Your pattern to match a heading followed by a space
  • .* Match any char except a newline 0+ times
  • (?: Non capturing group
    • \r?\n Match a newline
    • (?! Negative lookahead, assert what is directly to the right is not
      • \d+(?:.\d+)* The heading pattern
    • ) Close lookahead
    • .* Match any char except a newline 0+ times
  • )* Close the non capturing group and repeat 0+ times to match all the lines

Regex demo

这篇关于从文本文件中提取文本块的正则表达式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆