处理两条标记线之间的文本文件行 [英] Processing lines of text file between two marker lines

查看:74
本文介绍了处理两条标记线之间的文本文件行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的代码处理从文本文件读取的行(请参阅最后的文本处理详细信息").我需要修改代码,以使其执行相同的任务,但仅在某些点之间使用单词.

代码不应打扰此文本.跳过它.

*****这是标记,用于说明从何处开始使用文本.在最后三个星号之后,请勿执行任何操作.> ***

使用本节中的所有代码

*****当看到前三个星号时停止使用文本*****

代码不应打扰此文本.跳过它.

所有情况的标记都是三个星号.标记仅在它们出现在行的开头和结尾时才计数.

我应该使用什么使我的代码仅在第二和第三组星号之间工作?

文本处理详细信息

我的代码读取一个文本文件,将所有单词都转换为小写,然后将单词拆分为一个列表:

infile = open(filename, 'r', encoding="utf-8")
text = infile.read().lower().split()

然后将单词中所有语法符号的列表去除:

list_of_words = [word.strip('\n"-:\';,.') for word in text]

最后,对于该列表中的每个单词,如果仅包含字母符号,则将其附加到新列表中.然后返回该列表:

for word in list_of_words:
    if word.isalpha():
        list_2.append(word)
return list_2

解决方案

计数两条标记线之间的单词"实际上是一项任务.将不同的任务和决策分成单独的函数和生成器,这将非常容易.

步骤1:将文件I/O与字计数分开.为什么单词计数代码应该关心单词的来源?

第2步:从文件处理单词计数中分别选择要处理的行.为什么应该给单词计数代码赋予不应计数的单词?对于一个功能而言,这仍然是一项艰巨的任务,因此将进一步细分. (这是您要询问的部分.)

第3步:处理文本.您已经或多或少地做到了. (我假设您的文本处理代码以名为words的函数结尾).

1.单独的文件I/O

从文件中读取文本实际上是两个步骤:首先,打开并读取文件,然后将换行符从每一行中删除.这是两个工作.

 def stripped_lines(lines):
    for line in lines:
        stripped_line = line.rstrip('\n')
        yield stripped_line

def lines_from_file(fname):
    with open(fname, 'rt', encoding='utf8') as flines:
        for line in stripped_lines(flines):
            yield line
 

这里没有提示您进行文本处理. lines_from_file生成器仅产生在文件中找到的任何字符串...去除其尾随的换行符. (请注意,普通的strip()也会删除开头和结尾的空格,必须保留这些空格以标识标记线.)

2.仅选择标记之间的线.

这实际上不仅仅是一个步骤.首先,您必须知道什么是标记线,什么不是标记线.那只是一个功能.

然后,您必须前进经过第一个标记(同时丢弃遇到的任何行),最后前进至第二个标记(同时保持遇到的任何行).第二个标记之后的所有内容甚至都不会被读取,更不用说处理了.

Python的生成器可以几乎为您解决步骤2的其余部分.唯一的症结是结束标记...详细信息如下.

2a.什么是标记线,不是标记线?

确定标记线是一个是或否的问题,显然是布尔函数的工作:

 def is_marker_line(line, start='***', end='***'):
    '''
    Marker lines start and end with the given strings, which may not
    overlap.  (A line containing just '***' is not a valid marker line.)
    '''
    min_len = len(start) + len(end)
    if len(line) < min_len:
        return False
    return line.startswith(start) and line.endswith(end)
 

请注意,根据我的要求,标记行不必在起始标记和结束标记之间包含任何文本---六个星号('******')是有效的标记行.

2b.越过第一条标记线.

此步骤现在很容易:只要丢掉每一行,直到找到标记行(也将其垃圾)即可.此功能无需担心第二条标记线,也不必担心 没有标记线,或者其他任何事情.

 def advance_past_next_marker(lines):
    '''
    Advances the given iterator through the first encountered marker
    line, if any.
    '''
    for line in lines:
        if is_marker_line(line):
            break
 

2c.越过第二条标记线,保存内容行.

生成器可以轻松产生开始"标记之后的每一行,但是如果发现 没有结束"标记,则无法返回并取消这些行.因此,既然您终于遇到了您(可能)真正关心的行,则必须将它们全部保存在列表中,直到您知道它们是否有效为止.

 def lines_before_next_marker(lines):
    '''
    Yields all lines up to but not including the next marker line.  If
    no marker line is found, yields no lines.
    '''
    valid_lines = []
    for line in lines:
        if is_marker_line(line):
            break
        valid_lines.append(line)
    else:
        # `for` loop did not break, meaning there was no marker line.
        valid_lines = []
    for content_line in valid_lines:
        yield content_line
 

2d.在一起粘贴步骤2.

越过第一个标记,然后产生一切直到第二个标记.

 def lines_between_markers(lines):
    '''
    Yields the lines between the first two marker lines.
    '''
    # Must use the iterator --- if it's merely an iterable (like a list
    # of strings), the call to lines_before_next_marker will restart
    # from the beginning.
    it = iter(lines)
    advance_past_next_marker(it)
    for line in lines_before_next_marker(it):
        yield line
 

使用一堆输入文件来测试像这样的功能很烦人.用字符串列表测试它很容易,但是列表不是生成器迭代器,它们是可迭代的.额外增加一条it = iter(...)行是值得的.

3.处理选定的行.

同样,我假设您的文本处理代码安全地包装在名为words的函数中.唯一的变化是,您被赋予行,而不是打开文件并读取它来产生行列表:

 def words(lines):
    text = '\n'.join(lines).lower().split()
    # Same as before...
 

...除了words可能也应该是生成器.

现在,调用words很简单:

 def words_from_file(fname):
    for word in words(lines_between_markers(lines_from_file(fname))):
        yield word
 

要获取words_from_file fname,请生成从lines_from_file中选择的lines_between_markers中找到的words.

4.从程序中调用words_from_file.

无论已经定义了filename的地方---大概是在main内的某个地方---调用words_from_file一次可以得到一个单词:

 filename = ...  # However you defined it before.
for word in words_from_file(filename):
    print(word)
 

或者,如果您真的需要在list中使用这些单词:

 filename = ...
word_list = list(words_from_file(filename))
 

结论

试图将它们全部压缩为一个或两个函数要困难得多.这不仅仅是一项任务或决定,而是许多.关键是将其分解为一些细小的工作,每个工作都易于理解和测试.

生成器摆脱了很多样板代码.如果没有生成器,几乎每个函数都需要一个for循环到some_list.append(next_item),就像在lines_before_next_marker中一样.

如果您使用的是Python 3.3以上版本,则 yield from ...构造,甚至删除更多样板.每个生成器都包含这样的循环:

 for line in stripped_lines(flines):
    yield line
 

可以改写为:

 yield from stripped_lines(flines)
 

我数了其中四个.

有关可迭代器,生成器和使用可迭代器的函数的更多信息,请参见Ned Batchelder的" Loop Like本地",可作为30分钟的来自PyCon US 2013的视频

My code processes lines read from a text file (see "Text Processing Details" at end). I need to amend my code so that it carries out the same task, but only with words in between certain points.

Code should not bother about this text. Skip it.

*****This is the marker to say where to start working with text. Don't do anything until after these last three asterisks.>***

Work with all of the code in this section

*****Stop working with the text when the first three asterisks are seen*****

Code should not bother about this text. Skip it.

The markers for all situations are three asterisks. Markers only count when they appear at the beginning and the end of the line.

What should I use to make my code only work in between the second and third set of asterisks?

Text Processing Details

My code reads a text file, makes all the words lowercase, and splits the words, putting them into a list:

infile = open(filename, 'r', encoding="utf-8")
text = infile.read().lower().split()

It then strips that list of all grammatical symbols in the words:

list_of_words = [word.strip('\n"-:\';,.') for word in text]

Finally, for each word in that list, if it only contains alphabetic symbols, it gets appended to a new list. That list is then returned:

for word in list_of_words:
    if word.isalpha():
        list_2.append(word)
return list_2

解决方案

What appears to be one task, "count the words between two marker lines", is actually several. Separate the different tasks and decisions into separate functions and generators, and it will be vastly easier.

Step 1: Separate the file I/O from the word counting. Why should the word-counting code care where the words came from?

Step 2: Separate selecting the lines to process from the file handling and the word counting. Why should the word-counting code be given words it's not supposed to count? This is still far too big a job for one function, so it will be broken down further. (This is the part you're asking about.)

Step 3: Process the text. You've already done that, more or less. (I'll assume your text-processing code ends up in a function called words).

1. Separate file I/O

Reading text from a file is really two steps: first, open and read the file, then strip the newline off each line. These are two jobs.

def stripped_lines(lines):
    for line in lines:
        stripped_line = line.rstrip('\n')
        yield stripped_line

def lines_from_file(fname):
    with open(fname, 'rt', encoding='utf8') as flines:
        for line in stripped_lines(flines):
            yield line

Not a hint of your text processing here. The lines_from_file generator just yield whatever strings were found in the file... after stripping their trailing newline. (Note that a plain strip() would also remove leading and trailing whitespace, which you have to preserve to identify marker lines.)

2. Select only the lines between markers.

This is really more than one step. First, you have to know what is and isn't a marker line. That's just one function.

Then, you have to advance past the first marker (while throwing away any lines encountered), and finally advance to the second marker (while keeping any lines encountered). Anything after that second marker won't even be read, let alone processed.

Python's generators can almost solve the rest of Step 2 for you. The only sticking point is that closing marker... details below.

2a. What is and is not a marker line?

Identifying a marker line is a yes-or-no question, obviously the job of a Boolean function:

def is_marker_line(line, start='***', end='***'):
    '''
    Marker lines start and end with the given strings, which may not
    overlap.  (A line containing just '***' is not a valid marker line.)
    '''
    min_len = len(start) + len(end)
    if len(line) < min_len:
        return False
    return line.startswith(start) and line.endswith(end)

Note that a marker line need not (from my reading of your requirements) contain any text between the start and end markers --- six asterisks ('******') is a valid marker line.

2b. Advance past the first marker line.

This step is now easy: just throw away every line until we find a marker line (and junk it, too). This function doesn't need to worry about the second marker line, or what if there are no marker lines, or anything else.

def advance_past_next_marker(lines):
    '''
    Advances the given iterator through the first encountered marker
    line, if any.
    '''
    for line in lines:
        if is_marker_line(line):
            break

2c. Advance past the second marker line, saving content lines.

A generator could easily yield every line after the "start" marker, but if it discovers there is no "end" marker, there's no way to go back and un-yield those lines. So, now that you've finally encountered lines you (might) actually care about, you'll have to save them all in a list until you know whether they're valid or not.

def lines_before_next_marker(lines):
    '''
    Yields all lines up to but not including the next marker line.  If
    no marker line is found, yields no lines.
    '''
    valid_lines = []
    for line in lines:
        if is_marker_line(line):
            break
        valid_lines.append(line)
    else:
        # `for` loop did not break, meaning there was no marker line.
        valid_lines = []
    for content_line in valid_lines:
        yield content_line

2d. Gluing Step 2 together.

Advance past the first marker, then yield everything until the second marker.

def lines_between_markers(lines):
    '''
    Yields the lines between the first two marker lines.
    '''
    # Must use the iterator --- if it's merely an iterable (like a list
    # of strings), the call to lines_before_next_marker will restart
    # from the beginning.
    it = iter(lines)
    advance_past_next_marker(it)
    for line in lines_before_next_marker(it):
        yield line

Testing functions like this with a bunch of input files is annoying. Testing it with lists of strings is easy, but lists are not generators or iterators, they're iterables. The one extra it = iter(...) line was worth it.

3. Process the selected lines.

Again, I'm assuming your text processing code is safely wrapped up in a function called words. The only change is that, instead of opening a file and reading it to produce a list of lines, you're given the lines:

def words(lines):
    text = '\n'.join(lines).lower().split()
    # Same as before...

...except that words should probably be a generator, too.

Now, calling words is easy:

def words_from_file(fname):
    for word in words(lines_between_markers(lines_from_file(fname))):
        yield word

To get the words_from_file fname, you yield the words found in the lines_between_markers, selected from the lines_from_file... Not quite English, but close.

4. Call words_from_file from your program.

Wherever you already have filename defined --- presumably inside main somewhere --- call words_from_file to get one word at a time:

filename = ...  # However you defined it before.
for word in words_from_file(filename):
    print(word)

Or, if you really need those words in a list:

filename = ...
word_list = list(words_from_file(filename))

Conclusion

That this would have been much harder trying to squeeze it all into one or two functions. It wasn't just one task or decision, but many. The key was breaking it into tiny jobs, each of which was easy to understand and test.

The generators got rid of a lot of boilerplate code. Without generators, almost every function would have required a for loop just to some_list.append(next_item), like in lines_before_next_marker.

If you have Python 3.3+, the yield from ... construct, erases even more boilerplate. Every generator containing a loop like this:

for line in stripped_lines(flines):
    yield line

Could be re-written as:

yield from stripped_lines(flines)

I counted four of them.

For more on the subject of iterables, generators, and functions that use them, see Ned Batchelder's "Loop Like a Native", available as a 30-minute video from PyCon US 2013.

这篇关于处理两条标记线之间的文本文件行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆