从带有自定义分隔符的大型文本文件中提取特定分隔符之间的部分文本,并使用 Python 将其写入另一个文件 [英] Extracting parts of text between specific delimiters from a large text file with custom delimiters and writing it to another file using Python

查看:19
本文介绍了从带有自定义分隔符的大型文本文件中提取特定分隔符之间的部分文本,并使用 Python 将其写入另一个文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从事一个项目,该项目涉及以某种格式创建美国联邦代码数据库.我已经获得了整个代码形式的官方来源,但结构并不好.我已经设法使用 GITHUB 上的一些代码将以下格式的美国代码刮到文本文件中.

I'm working on a project that involves creating a database of US federal code in a certain format. I've obtained the whole code form official source which is not structured well. I have managed to scrape the US Code in the below format into text files using some code on GITHUB.

-CITE-
    13 USC Sec. 1                                               1/15/2013

-EXPCITE-
    TITLE 13 - CENSUS
    CHAPTER 1 - ADMINISTRATION
    SUBCHAPTER I - GENERAL PROVISIONS

-HEAD-
    Sec. 1. Definitions

-STATUTE-
      As used in this title, unless the context requires another
    meaning or unless it is otherwise provided - 
        (1) "Bureau" means the Bureau of the Census;
        (2) "Secretary" means the Secretary of Commerce; and
        (3) "respondent" includes a corporation, company, association,
      firm, partnership, proprietorship, society, joint stock company,
      individual, or other organization or entity which reported
      information, or on behalf of which information was reported, in
      response to a questionnaire, inquiry, or other request of the
      Bureau.

-SOURCE-
    (Aug. 31, 1954, ch. 1158, 68 Stat. 1012; Pub. L. 94-521, Sec. 1,
    Oct. 17, 1976, 90 Stat. 2459.)


-MISC1-
                      <some text>

-End-


-CITE-
    13 USC Sec. 2                                               1/15/2013

-EXPCITE-
    TITLE 13 - CENSUS
    CHAPTER 1 - ADMINISTRATION
    SUBCHAPTER I - GENERAL PROVISIONS

-HEAD-
    Sec. 2. Bureau of the Census

-STATUTE-
      The Bureau is continued as an agency within, and under the
    jurisdiction of, the Department of Commerce.

-SOURCE-
    (Aug. 31, 1954, ch. 1158, 68 Stat. 1012.)


-MISC1-
                      <some text>

-End-

每个文本文件都包含数千个以 -CITE- 标签开头并以 -END- 结尾的块.

Each text file contains thousands of such blocks starting with a -CITE- tag and ending with an -END-.

除此之外,还有一些块代表章节或子章节的开始,并且这些块不包含 -STATUTE- 标签.

Apart from these there are certain blocks which represent the start of a chapter or sub chapter and these do not contain a -STATUTE- tag.

例如

-CITE-
    13 USC CHAPTER 3 - COLLECTION AND PUBLICATION OF
           STATISTICS                                      1/15/2013

-EXPCITE-
    TITLE 13 - CENSUS
    CHAPTER 3 - COLLECTION AND PUBLICATION OF STATISTICS

-HEAD-
           CHAPTER 3 - COLLECTION AND PUBLICATION OF STATISTICS       


-MISC1-
                           SUBCHAPTER I - COTTON                       
    Sec.                                                     
    41.         Collection and publication.                           
    42.         Contents of reports; number of bales of linter;
                 distribution; publication by Department of
                 Agriculture.                                         
    43.         Records and reports of cotton ginners.                

       SUBCHAPTER II - OILSEEDS, NUTS, AND KERNELS; FATS, OILS, AND
                                  GREASES
    61.         Collection and publication.                           
    62.         Additional statistics.                                
    63.         Duplicate collection of statistics prohibited; access
                 to available statistics.                             

                   SUBCHAPTER III - APPAREL AND TEXTILES               
    81.         Statistics on apparel and textile industries.         

              SUBCHAPTER IV - QUARTERLY FINANCIAL STATISTICS          
    91.         Collection and publication.                           

                       SUBCHAPTER V - MISCELLANEOUS                   
    101.        Defective, dependent, and delinquent classes; crime.  
    102.        Religion.                                             
    103.        Designation of reports.                               

                                AMENDMENTS                            
      <some text>

-End-

我只对那些带有 -STATUTE- 标签的块感兴趣.

I am interested only in those blocks that have a -STATUTE- tag.

有没有办法只提取带有 -STATUTE- 标签的文本块并将它们写入另一个文本文件?

Is there a way to extract only the blocks of text that have the -STATUTE- tag and write them to another text file?

我是 Python 新手,但我听说这可以在 Python 中轻松完成.

I'm new to Python but I'm told this can be easily done in Python.

感谢有人可以指导我.

推荐答案

我会逐行阅读文本并自己解析.通过这种方式,您可以将大输入作为流处理.使用多行正则表达式有更好的解决方案,但这些解决方案总是无法将输入作为流处理.

I'd read the text line-by-line and parse it myself. This way you can handle large input as streams. There are nicer solutions using multiline regexps but those will always suffer from being not able to handle the input as a stream.

#!/usr/bin/env python

import sys, re

# states for our state machine:
OUTSIDE = 0
INSIDE = 1
INSIDE_AFTER_STATUTE = 2

def eachCite(stream):
  state = OUTSIDE
  for lineNumber, line in enumerate(stream):
    if state in (INSIDE, INSIDE_AFTER_STATUTE):
      capture += line
    if re.match('^-CITE-', line):
      if state == OUTSIDE:
        state = INSIDE
        capture = line
      elif state in (INSIDE, INSIDE_AFTER_STATUTE):
        raise Exception("-CITE- in -CITE-??", lineNumber)
      else:
        raise NotImplementedError(state)
    elif re.match('^-End-', line):
      if state == OUTSIDE:
        raise Exception("-End- without -CITE-??", lineNumber)
      elif state == INSIDE:
        yield False, capture
        state = OUTSIDE
      elif state == INSIDE_AFTER_STATUTE:
        yield True, capture
        state = OUTSIDE
      else:
        raise NotImplementedError(state)
    elif re.match('^-STATUTE-', line):
      if state == OUTSIDE:
        raise Exception("-STATUTE- without -CITE-??", lineNumber)
      elif state == INSIDE:
        state = INSIDE_AFTER_STATUTE
      elif state == INSIDE_AFTER_STATUTE:
        raise Exception("-STATUTE- after -STATUTE-??", lineNumber)
      else:
        raise NotImplementedError(state)
  if state != OUTSIDE:
    raise Exception("EOF in -CITE-??")

for withStatute, cite in eachCite(sys.stdin):
  if withStatute:
    print "found cite with statute:"
    print cite

如果你不想处理sys.stdin,你可以这样做:

In case you want to process not sys.stdin you can do it like this:

with open('myInputFileName') as myInputFile, \
     open('myOutputFileName', 'w') as myOutputFile:
  for withStatute, cite in eachCite(myInputFile):
    if withStatute:
      myOutputFile.write("found cite with statute:\n")
      myOutputFile.write(cite)

这篇关于从带有自定义分隔符的大型文本文件中提取特定分隔符之间的部分文本,并使用 Python 将其写入另一个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆