在python中删除大文本文件中的特定行 [英] Remove specific lines from a large text file in python

查看:701
本文介绍了在python中删除大文本文件中的特定行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有几个大文本文本文件都具有相同的结构,我想删除前3行,然后从第4行删除非法字符。我不希望读取整个数据集,然后修改每个文件超过100MB并记录超过400万条记录。

 范围150.0dB -64.9dBm 
移动单元1基数-17.19968 145.40369 999.8
固定单位2移动-17.20180 145.29514 533.0
纬度经度Rx(dB)最佳单位
-17.06694 145.23158 -050.5 2
-17.06695 145.23297 -044.1 2

所以1,2和3行应该被删除,并且在第4行中,Rx(db)应该只是Rx并且最佳单位被更改为Best_Unit。然后,我可以使用其他脚本对数据进行地理编码。



我不能使用像grep这样的命令行程序(,因为前三行不完全相同 - 数字(例如150.0dB,-64 *)将在每个文件中发生变化,因此您必须删除整行1-3,然后grep或类似的可以在第4行进行搜索替换。



感谢你们,

===编辑新的pythonic方式来处理来自@heltonbiker的大文件。错误。

  import os,re 
## infile = arcpy.GetParameter(0)
## chunk_size = arcpy.GetParameter(1)#每个数据集中记录的数量

infile ='trc_emerald.txt'
fc =打开(infile)
名称= infile [:infile .rfind('。')]
outfile = Name +'_ db.txt'

line4 = fc.readlines(100)[3]
line4 = re.sub(' \([^ \)]。*?\)','',line4)
line4 = re.sub('Best(\s。*?)','Best_',line4)
newfilestring =''.join(line4 + [line for line in fc.readlines [4:]])
fc.close()
newfile = open(outfile,'w')
newfile.write(newfilestring)
newfile.close()
$ b $ del行
del outfile
del名称
#return chunk_size,fl
#arcpy.SetParameterAsText(2,fl)
printCompleted







追溯(最近一次调用最后一次):文件P:\2012\Job_044_ DM_Radio_Propogation \Working\FinalPropogation\TRC_Emerald\working\clean_file_1c.py,
line 13,in
newfilestring =''.join(line4 + [line for line in fc.readlines [ 4:]])TypeError:'builtin_function_or_method'对象是
unsubscriptable






解决方案

正如wim在评论中所说, sed 是正确的工具。

  sed -i -e'4 s /(dB)//'-e '4 s / Best Unit / Best_Unit /'-e'1,3 d'yourfile.whatever 

稍微解释一下这个命令:

-i 就地执行该命令,也就是将输出写回进入输入文件

$ -c $ -c $执行一个命令

< '4 s /(dB)//'在线 4 ,替换' ' for '(dB)'



'4 s / Best Unit / Best_Unit /'与上面相同,但不同的查找和替换字符串除外



'1,3 d'从第1行到第3行(含)删除整行

sed 是一个非常强大的工具,它可以做的不仅仅是这一点,值得学习。


I have several large text text files that all have the same structure and I want to delete the first 3 lines and then remove illegal characters from the 4th line. I don't want to have to read the entire dataset and then modify as each file is over 100MB with over 4 million records.

Range   150.0dB -64.9dBm
Mobile unit 1   Base    -17.19968    145.40369  999.8
Fixed unit  2   Mobile  -17.20180    145.29514  533.0
Latitude    Longitude   Rx(dB)  Best unit
-17.06694    145.23158  -050.5  2
-17.06695    145.23297  -044.1  2

So lines 1,2 and 3 should be deleted and in line 4, "Rx(db)" should be just "Rx" and "Best Unit" be changed to "Best_Unit". Then I can use my other scripts to geocode the data.

I can't use commandline programs like grep (as in this question) as the first 3 lines are not all the same -the numbers (such as 150.0dB, -64*) will change in each file so you have to just delete the whole of lines 1-3 and then grep or similar can do the search-replace on line 4.

Thanks guys,

=== EDIT new pythonic way to handle larger files from @heltonbiker. Error.

import os, re
##infile = arcpy.GetParameter(0)
##chunk_size = arcpy.GetParameter(1) # number of records in each dataset

infile='trc_emerald.txt'
fc= open(infile)
Name = infile[:infile.rfind('.')]
outfile = Name+'_db.txt'

line4 = fc.readlines(100)[3]
line4 = re.sub('\([^\)].*?\)', '', line4)
line4 = re.sub('Best(\s.*?)', 'Best_', line4)
newfilestring = ''.join(line4 + [line for line in fc.readlines[4:]])
fc.close()
newfile = open(outfile, 'w')
newfile.write(newfilestring)
newfile.close()

del lines
del outfile
del Name
#return chunk_size, fl
#arcpy.SetParameterAsText(2, fl)
print "Completed"

Traceback (most recent call last): File "P:\2012\Job_044_DM_Radio_Propogation\Working\FinalPropogation\TRC_Emerald\working\clean_file_1c.py", line 13, in newfilestring = ''.join(line4 + [line for line in fc.readlines[4:]]) TypeError: 'builtin_function_or_method' object is unsubscriptable

解决方案

As wim said in the comments, sed is the right tool for this. The following command should do what you want:

sed -i -e '4 s/(dB)//' -e '4 s/Best Unit/Best_Unit/' -e '1,3 d' yourfile.whatever

To explain the command a little:

-i executes the command in place, that is it writes the output back into the input file

-e execute a command

'4 s/(dB)//' on line 4, subsitute '' for '(dB)'

'4 s/Best Unit/Best_Unit/' same as above, except different find and replace strings

'1,3 d' from line 1 to line 3 (inclusive) delete the entire line

sed is a really powerful tool, which can do much more than just this, well worth learning.

这篇关于在python中删除大文本文件中的特定行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆