从python中的大文本文件中删除特定行 [英] Remove specific lines from a large text file in python

查看:52
本文介绍了从python中的大文本文件中删除特定行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有几个大文本文本文件,它们都具有相同的结构,我想删除前 3 行,然后从第 4 行中删除非法字符.我不想读取整个数据集然后修改,因为每个文件都超过 100MB,有超过 400 万条记录.

范围 150.0dB -64.9dBm移动单元 1 底座 -17.19968 145.40369 999.8固定单元 2 移动 -17.20180 145.29514 533.0纬度经度 Rx(dB) 最佳单位-17.06694 145.23158 -050.5 2-17.06695 145.23297 -044.1 2

所以第 1,2 和 3 行应该被删除,在第 4 行中,Rx(db)"应该只是Rx",而Best Unit"应该改为Best_Unit".然后我可以使用我的其他脚本对数据进行地理编码.

我不能使用像 grep 这样的命令行程序(如在这个问题中),因为前 3 行并不完全相同 - 每个文件中的数字(例如 150.0dB、-64*)都会发生变化,因此您只需删除整行1-3 然后 grep 或类似的可以在第 4 行进行搜索替换.

谢谢各位

=== 编辑新的 Pythonic 方式来处理来自 @heltonbiker 的更大文件.错误.

import os, re##infile = arcpy.GetParameter(0)##chunk_size = arcpy.GetParameter(1) # 每个数据集中的记录数infile='trc_emerald.txt'fc= 打开(文件内)Name = infile[:infile.rfind('.')]outfile = 名称+'_db.txt'line4 = fc.readlines(100)[3]line4 = re.sub('([^)].*?)', '', line4)line4 = re.sub('Best(s.*?)', 'Best_', line4)newfilestring = ''.join(line4 + [line for line in fc.readlines[4:]])fc.close()新文件 = 打开(输出文件,'w')newfile.write(newfilestring)newfile.close()德尔线删除输出文件删除名称#return chunk_size, fl#arcpy.SetParameterAsText(2, fl)打印完成"

<块引用><块引用><块引用><块引用>

回溯(最近一次调用最后一次):文件P:2012Job_044_DM_Radio_PropogationWorkingFinalPropogationTRC_Emeraldworkingclean_file_1c.py",第 13 行,在newfilestring = ''.join(line4 + [line for line in fc.readlines[4:]]) TypeError: 'builtin_function_or_method' 对象是不可订阅

解决方案

正如 wim 在评论中所说,sed 是正确的工具.以下命令应该执行您想要的操作:

sed -i -e '4 s/(dB)//' -e '4 s/Best Unit/Best_Unit/' -e '1,3 d' yourfile.whatever

稍微解释一下命令:

-i 就地执行命令,即将输出写回输入文件

-e 执行命令

'4 s/(dB)//' on line 4,用 '' 代替 '(dB)'

'4 s/Best Unit/Best_Unit/' 同上,只是查找和替换字符串不同

'1,3 d' 从第1行到第3行(含)删除整行

sed 是一个非常强大的工具,它可以做的远不止这些,非常值得学习.

I have several large text text files that all have the same structure and I want to delete the first 3 lines and then remove illegal characters from the 4th line. I don't want to have to read the entire dataset and then modify as each file is over 100MB with over 4 million records.

Range   150.0dB -64.9dBm
Mobile unit 1   Base    -17.19968    145.40369  999.8
Fixed unit  2   Mobile  -17.20180    145.29514  533.0
Latitude    Longitude   Rx(dB)  Best unit
-17.06694    145.23158  -050.5  2
-17.06695    145.23297  -044.1  2

So lines 1,2 and 3 should be deleted and in line 4, "Rx(db)" should be just "Rx" and "Best Unit" be changed to "Best_Unit". Then I can use my other scripts to geocode the data.

I can't use commandline programs like grep (as in this question) as the first 3 lines are not all the same -the numbers (such as 150.0dB, -64*) will change in each file so you have to just delete the whole of lines 1-3 and then grep or similar can do the search-replace on line 4.

Thanks guys,

=== EDIT new pythonic way to handle larger files from @heltonbiker. Error.

import os, re
##infile = arcpy.GetParameter(0)
##chunk_size = arcpy.GetParameter(1) # number of records in each dataset

infile='trc_emerald.txt'
fc= open(infile)
Name = infile[:infile.rfind('.')]
outfile = Name+'_db.txt'

line4 = fc.readlines(100)[3]
line4 = re.sub('([^)].*?)', '', line4)
line4 = re.sub('Best(s.*?)', 'Best_', line4)
newfilestring = ''.join(line4 + [line for line in fc.readlines[4:]])
fc.close()
newfile = open(outfile, 'w')
newfile.write(newfilestring)
newfile.close()

del lines
del outfile
del Name
#return chunk_size, fl
#arcpy.SetParameterAsText(2, fl)
print "Completed"

Traceback (most recent call last): File "P:2012Job_044_DM_Radio_PropogationWorkingFinalPropogationTRC_Emeraldworkingclean_file_1c.py", line 13, in newfilestring = ''.join(line4 + [line for line in fc.readlines[4:]]) TypeError: 'builtin_function_or_method' object is unsubscriptable

解决方案

As wim said in the comments, sed is the right tool for this. The following command should do what you want:

sed -i -e '4 s/(dB)//' -e '4 s/Best Unit/Best_Unit/' -e '1,3 d' yourfile.whatever

To explain the command a little:

-i executes the command in place, that is it writes the output back into the input file

-e execute a command

'4 s/(dB)//' on line 4, subsitute '' for '(dB)'

'4 s/Best Unit/Best_Unit/' same as above, except different find and replace strings

'1,3 d' from line 1 to line 3 (inclusive) delete the entire line

sed is a really powerful tool, which can do much more than just this, well worth learning.

这篇关于从python中的大文本文件中删除特定行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆