如何根据正则表达式模式将文本文件拆分为较小的文件? [英] How to split a text file into smaller files based on regex pattern?
问题描述
我有一个类似以下的文件:
I have a file like the following:
SCN DD1251
UPSTREAM DOWNSTREAM FILTER
NODE LINK NODE LINK LINK
DD1271 C DD1271 R
DD1351 D DD1351 B
E
SCN DD1271
UPSTREAM DOWNSTREAM FILTER
NODE LINK NODE LINK LINK
DD1301 T DD1301 A
DD1251 R DD1251 C
SCN DD1301
UPSTREAM DOWNSTREAM FILTER
NODE LINK NODE LINK LINK
DD1271 A DD1271 T
B
C
D
SCN DD1351
UPSTREAM DOWNSTREAM FILTER
NODE LINK NODE LINK LINK
A DD1251 D
DD1251 B
C
SCN DD1451
UPSTREAM DOWNSTREAM FILTER
NODE LINK NODE LINK LINK
A
B
C
SCN DD1601
UPSTREAM DOWNSTREAM FILTER
NODE LINK NODE LINK LINK
A
B
C
D
SCN GA0101
UPSTREAM DOWNSTREAM FILTER
NODE LINK NODE LINK LINK
B GC4251 D
GC420A C GA127A S
GA127A T
SCN GA0151
UPSTREAM DOWNSTREAM FILTER
NODE LINK NODE LINK LINK
C GA0401 R G
GA0201 D GC0051 E H
GA0401 B GA0201 W
GC0051 A
每条记录之间的间隔有一个换行符,后跟 81 个空格.
Where the gap between each record has a newline character followed by 81 spaces.
我使用 regex101.com 创建了以下正则表达式,它似乎匹配了每条记录之间的差距:
I have created the following regex expression using regex101.com which seems to match the gaps between each record:
\s{81}\n
结合下面的短循环打开文件,然后将每个部分写入一个新文件:
Combined with the short loop below to open the file and then write each section to a new file:
delimiter_pattern = re.compile(r"\s{81}\n")
with open("Junctions.txt", "r") as f:
i = 1
for line in f:
if delimiter_pattern.match(line) == False:
output = open('%d.txt' % i,'w')
output.write(line)
else:
i+=1
但是,不是输出,而是像下面预期的那样说 2.txt:
However, instead of outputting, say 2.txt as expected below:
SCN DD1271
UPSTREAM DOWNSTREAM FILTER
NODE LINK NODE LINK LINK
DD1301 T DD1301 A
DD1251 R DD1251 C
相反,它似乎什么都不返回.我试过像这样修改代码:
It instead seems to return nothing at all. I have tried modifying the code like so:
with open("Clean-Junction-Links1.txt", "r") as f:
i = 1
output = open('%d.txt' % i,'w')
for line in f:
if delimiter_pattern.match(line) == False:
output.write(line)
else:
i+=1
但这会返回数百个空白文本文件.
But this instead returns several hundred blank text files.
我的代码有什么问题,我该如何修改才能使其正常工作?如果做不到这一点,是否有更简单的方法可以在不使用正则表达式的情况下在空行上拆分文件?
What is the issue with my code, and how could I modify it to make it work? Failing that, is there a simpler way to split the file on the blank lines without using regex?
推荐答案
您不需要使用正则表达式来执行此操作,因为您可以使用字符串 strip()
方法.
You don't need to use a regex to do this because you can detect the gap between blocks easily by using the string strip()
method.
input_file = 'Clean-Junction-Links1.txt'
with open(input_file, 'r') as file:
i = 0
output = None
for line in file:
if not line.strip(): # Blank line?
if output:
output.close()
output = None
else:
if output is None:
i += 1
print(f'Creating file "{i}.txt"')
output = open(f'{i}.txt','w')
output.write(line)
if output:
output.close()
print('-fini-')
另一种更简洁、更模块化的实现方式是将处理分成两个独立的任务,这些任务在逻辑上彼此几乎没有关系:
Another, cleaner and more modular, way to implement it would be to divide the processing up into two independent tasks that logically have very little to do with each other:
- 读取文件并将每条记录的行组合在一起.
- 将每组行写入一个单独的文件.
第一个可以实现为 生成器函数收集并生成包含记录的行组.它是下面名为 extract_records()
的那个.
The first can be implemented as a generator function which iteratively collects and yields groups of lines comprising a record. It's the one named extract_records()
below.
input_file = 'Clean-Junction-Links1.txt'
def extract_records(filename):
with open(filename, 'r') as file:
lines = []
for line in file:
if line.strip(): # Not blank?
lines.append(line)
else:
yield lines
lines = []
if lines:
yield lines
for i, record in enumerate(extract_records(input_file), start=1):
print(f'Creating file {i}.txt')
with open(f'{i}.txt', 'w') as output:
output.write(''.join(record))
print('-fini-')
这篇关于如何根据正则表达式模式将文本文件拆分为较小的文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!