从文本文件中提取两个定界符之间的文本 [英] Extract text between two delimiters from a text file

查看:108
本文介绍了从文本文件中提取两个定界符之间的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在写有关CEO自恋的硕士论文.为了对其进行衡量,我必须进行收入电话文本分析.我按照python中的中可用的答案编写了代码.此链接,该链接使我可以从收入电话记录中提取问题与答案"部分.该文件就是这样的(它称为"testoestratto.txt"):

I'm currently writing my master thesis about CEO narcissism. In order to measure it, I have to do an earnings calls text analysis. I wrote a code in python, following the answers available in this link, that allows me to extract the Question and Answers section from an earnings calls transcript. The file is like this (it's called 'testoestratto.txt'):

..............................
Delimiter [1]
..............................
A text that I don't need
..............................
Delimiter CEO [2]
..............................
I need this text
..............................
Delimiter [3]
..............................

[...]

..............................
Delimiter CEO [n-1]
..............................
I also need this text
..............................
Delimiter [n]
..............................

我还有另一个txt文件("lista.txt"),在其中我从笔录中提取了所有定界符:

I have also another txt file ('lista.txt') where I extracted all the delimiters from the transcript:

Delimiter [1]
Delimiter CEO [2]
Delimiter [3]
[...]
Delimiter CEO [n-1]
Delimiter [n]

我想做的是从Delimiter CEO [2]与Delimiter [3]之间,...以及Delimiter CEO [n-1]与Delimiter [之间]的'testoestratto.txt'中提取文本. n].提取的文本必须写在"test.txt"中.因此,如果来自"lista.txt"的定界符包含单词"CEO",那么我需要来自"testoestratto.txt"的文本,该文本位于该特定定界符和来自"lista.txt"的下一个定界符之间,且不包含单词首席执行官".为此,我编写了以下代码:

What I'd like to do, is to extract the text from 'testoestratto.txt' between Delimiter CEO [2] and Delimiter [3], ..., and between Delimiter CEO [n-1] and Delimiter [n]. The extracted text has to be written in 'test.txt'. So, if a delimiter from 'lista.txt' contains the word CEO, I need the text from 'testoestratto.txt' that is between that particular delimiter and the next delimiter from 'lista.txt' that doesn't have the word 'CEO' in it. In order to do so, I wrote the following code:

with open('testoestratto.txt','r', encoding='UTF-8') as infile, open('test.txt','a', encoding='UTF-8') as outfile, open('lista.txt', 'r', encoding='UTF-8') as mylist:
   text= mylist.readlines()
   text= [frase.strip('\n') for frase in text]
   bucket=[] 
   copy = False
   for i in range(len(text)):
      for line in infile:                         
          if line.strip()==text[i] and text[i].count('CEO')!=0 and text[i].count('CEO')!= -1:                                                          
              copy=True                          
          elif line.strip()== text[i+1] and text[i+1].count('CEO')==0 or text[i+1].count('CEO')==-1:
              for strings in bucket:
                  outfile.write(strings + '\n')
          elif copy:
              bucket.append(line.strip())

但是,"test.txt"文件为空.你能帮我吗?

However, the 'test.txt' file is empty. Could you help me?

P.S. :我是python的初学者,所以如果代码混乱,我想道歉

P.S. : I'm a beginner in python, so I'd like to apologize if the code is messy

推荐答案

您需要在代码中进行一些更改.

There are a few things that you need to change in your code.

首先,这里的关键是在每次读取一次该文件之后将其重置为文件的开头.由于尚未执行此操作,因此在嵌套for循环的第一次迭代之后,您的代码从不从头开始读取文件. 您可以使用infile.seek(0).

Firstly, the key here is to reset the line back to the start of the file after every iteration of reading it once. Since you haven't done this, your code never reads the file from the beginning after the first iteration of the nested for loop. You can do this using infile.seek(0).

第二,完成写入文件后,需要将标志"copy"的值重置为False.这样可以确保您不会将不需要的文本写到文件中.此外,您还需要清空存储区,以避免在输出中多次写入相同的行.

Secondly, you need to reset the value of your flag "copy" to False once you are done writing to the file. This ensures that you don't write the text that you don't need to the file. Additionally, you also need to empty your bucket to avoid writing the same lines multiple times in your output.

第三,您在elif语句中包含了许多不必要的字符串检查.

Thirdly, you have included a lot of string checks in the elif statement that are not necessary.

我已经在下面的代码中进行了更改:

I have made the changes in the code below:

with open('testoestratto.txt','r', encoding='UTF-8') as infile, 
open('test.txt','a', encoding='UTF-8') as outfile, open('lista.txt', 'r', 
encoding='UTF-8') as mylist:
    text= mylist.readlines()
    text= [frase.strip('\n') for frase in text]
    bucket=[]
    copy = False
    for i in range(len(text)):
        for line in infile:
            if line.strip('\n')==text[i] and text[i].count('CEO') > 0:
                copy=True
            elif copy and line.strip('\n') == text[i+1]:
                for strings in bucket:
                    outfile.write(strings + '\n')
                copy = False
                bucket = list()
            elif copy:
                bucket.append(line.strip())
        infile.seek(0)

话虽如此,您还可以优化代码.如您所见,此代码在O(n ^ 3)中运行.

With that being said, you can also optimize your code. As you can see, this code runs in O(n^3).

这篇关于从文本文件中提取两个定界符之间的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆