如何提取文本文件中与正则表达式匹配的行号 [英] How to extract lines numbers that match a regular expression in a text file
问题描述
我正在做一个关于统计机器翻译的项目,其中我需要从匹配正则表达式(任何带有粒子out"的非分隔短语动词)的带 POS 标记的文本文件中提取行号,然后写文件的行号(在 python 中).
I'm doing a project on statistical machine translation in which I need to extract line numbers from a POS-tagged text file that match a regular expression (any non-separated phrasal verb with the particle 'out'), and write the line numbers to a file (in python).
我有这个正则表达式:'w*_VB.?sout_RP' 和我的 POS 标签文本文件:'Corpus.txt'.我想得到一个行号与上述正则表达式匹配的输出文件,并且输出文件每行应该只有一个行号(没有空行),例如:
I have this regular expression: 'w*_VB.?sout_RP' and my POS-tagged text file: 'Corpus.txt'. I would like to get an output file with the line numbers that match the above-mentioned regular expression, and the output file should just have one line number per line (no empty lines), e.g.:
2
5
44
到目前为止,我的脚本中只有以下内容:
So far all I have in my script is the following:
OutputLineNumbers = open('OutputLineNumbers', 'w')
with open('Corpus.txt', 'r') as textfile:
phrase='w*_VB.?sout_RP'
for phrase in textfile:
OutputLineNumbers.close()
知道如何解决这个问题吗?
Any idea how to solve this problem?
提前感谢您的帮助!
推荐答案
这应该可以解决您的问题,假设您在变量短语"中有正确的正则表达式
This should solve your problem, presuming you have correct regex in variable 'phrase'
import re
# compile regex
regex = re.compile('[0-9]+')
# open the files
with open('Corpus.txt','r') as inputFile:
with open('OutputLineNumbers', 'w') as outputLineNumbers:
# loop through each line in corpus
for line_i, line in enumerate(inputFile, 1):
# check if we have a regex match
if regex.search( line ):
# if so, write it the output file
outputLineNumbers.write( "%d
" % line_i )
这篇关于如何提取文本文件中与正则表达式匹配的行号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!