在文件python 3中查找和删除行 [英] find and delete lines in file python 3

查看:75
本文介绍了在文件python 3中查找和删除行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用python 3

I use python 3

好的,我得到了一个像这样锁定的文件:

Okay, I got a file that lock like this:

id:1
1
34
22
52
id:2
1
23
22
31
id:3
2
12
3
31
id:4
1
21
22
11

如何仅查找和删除文件的这一部分?

how can I find and delete only this part of the file?

id:2
1
23
22
31

我已经做了很多尝试,但是无法正常工作.

I have been trying a lot to do this but can't get it to work.

推荐答案

是用于删除序列的决策的ID,还是用于决策的值列表?

Is the id used for the decision to delete the sequence, or is the list of values used for the decision?

您可以构建一个字典,其中ID号为键(由于以后的排序而转换为int),并且以下各行被转换为作为键值的字符串列表.然后,您可以使用键2删除该项目,遍历按该键排序的项目,然后输出新的id:key以及字符串的格式化列表.

You can build a dictionary where the id number is the key (converted to int because of the later sorting) and the following lines are converted to the list of strings that is the value for the key. Then you can delete the item with the key 2, and traverse the items sorted by the key, and output the new id:key plus the formated list of the strings.

或者您可以构建订单受保护的列表的列表.如果要保护ID的顺序(即不重新编号),您还可以记住内部列表中的ID:n.

Or you can build the list of lists where the order is protected. If the sequence of the id's is to be protected (i.e. not renumbered), you can also remember the id:n in the inner list.

可以对大小合理的文件执行此操作.如果文件很大,则应将源复制到目标,并快速跳过不需要的序列.对于小文件,最后一种情况也很容易.

This can be done for a reasonably sized file. If the file is huge, you should copy the source to the destination and skip the unwanted sequence on the fly. The last case can be fairly easy also for the small file.

[澄清后添加]

我建议学习以下在许多此类情况下有用的方法.它使用所谓的有限自动机来实现绑定到从一种状态到另一种状态的转换的动作(请参见小型机).

I recommend to learn the following approach that is usefull in many such cases. It uses so called finite automaton that implements actions bound to transitions from one state to another (see Mealy machine).

文本行是此处的输入元素.代表上下文状态的节点在此编号. (我的经验是,给它们起名字是不值得的-保持它们只是愚蠢的数字.)这里只使用了两种状态,并且status可以很容易地由布尔变量代替.但是,如果情况变得更加复杂,则会导致引入另一个布尔变量,并且代码将更容易出错.

The text line is the input element here. The nodes that represent the context status are numbered here. (My experience is that it is not worth to give them names -- keep them just stupid numbers.) Here only two states are used and the status could easily be replaced by a boolean variable. However, if the case becomes more complicated, it leads to introduction of another boolean variable, and the code becomes more error prone.

乍一看,代码可能看起来很复杂,但是当您知道可以分别考虑每个if status == number时,它很容易理解.这是捕获了先前处理的提到的上下文.不要尝试优化,让代码那样.实际上,以后可以对其进行人工解码,并且您可以绘制类似于节制器示例"的图片 .如果这样做,那就更容易理解了.

The code may look very complicated at first, but it is fairly easy to understand when you know that you can think about each if status == number separately. This is the mentioned context that captured the previous processing. Do not try to optimize, let the code that way. It can actually be human-decoded later, and you can draw the picture similar to the Mealy machine example. If you do, then it is much more understandable.

所需功能有点泛化-可以将一组忽略的部分作为第一个参数传递:

The wanted functionality is a bit generalized -- a set of ignored sections can be passed as the first argument:

import re

def filterSections(del_set, fname_in, fname_out):
    '''Filtering out the del_set sections from fname_in. Result in fname_out.'''

    # The regular expression was chosen for detecting and parsing the id-line.
    # It can be done differently, but I consider it just fine and efficient.
    rex_id = re.compile(r'^id:(\d+)\s*$')

    # Let's open the input and output file. The files will be closed
    # automatically.
    with open(fname_in) as fin, open(fname_out, 'w') as fout:
        status = 1                 # initial status -- expecting the id line
        for line in fin:
            m = rex_id.match(line) # get the match object if it is the id-line

            if status == 1:      # skipping the non-id lines
                if m:              # you can also write "if m is not None:"
                    num_id = int(m.group(1))  # get the numeric value of the id
                    if num_id in del_set:     # if this id should be deleted
                        status = 1            # or pass (to stay in this status)
                    else:
                        fout.write(line)      # copy this id-line
                        status = 2            # to copy the following non-id lines
                #else ignore this line (no code needed to ignore it :)

            elif status == 2:      # copy the non-id lines
                if m:                         # the id-line found
                    num_id = int(m.group(1))  # get the numeric value of the id
                    if num_id in del_set:     # if this id should be deleted
                        status = 1            # or pass (to stay in this status)
                    else:
                        fout.write(line)      # copy this id-line
                        status = 2            # to copy the following non-id lines
                else:
                    fout.write(line)          # copy this non-id line


if __name__ == '__main__':
    filterSections( {1, 3}, 'data.txt', 'output.txt')
    # or you can write the older set([1, 3]) for the first argument.

在这里,输出id-line在给定原始编号的情况下.如果要重新编号部分,可以通过简单的修改来完成.尝试输入代码并询问详细信息.

Here the output id-lines where given the original number. If you want to renumber the sections, it can be done via a simple modification. Try the code and ask for details.

当心,有限自动机的功能有限.它们不能用于常规的编程语言,因为它们无法捕获嵌套的配对结构(如括号).

Beware, the finite automata have limited power. They cannot be used for the usual programming languages as they are not able to capture nested paired structures (like parenteses).

P.S.从计算机的角度来看,这7000行实际上是一个很小的文件;)

P.S. The 7000 lines is actually a tiny file from a computer perspective ;)

这篇关于在文件python 3中查找和删除行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆