如果一行中的特定列有数据,则提取它们 [英] extract data at specific columns in a line if there is any data at them
问题描述
我有一个包含如下几行数据的文件,我需要提取74-79和122-124处的字符,有些行在74-79处没有任何字符,我想跳过这些行.
I have a file with lines of data like below I need to pull out the characters at 74-79 and 122-124 some lines will not have any character at 74-79 and I want to skip those lines.
import re
def main():
file=open("CCDATA.TXT","r")
lines =file.readlines()
file.close()
for line in lines:
lines=re.sub(r" +", " ", line)
print(lines)
main()
CF214L214L1671310491084111159 Customer Name 46081 171638440 0000320800000000HCCCIUAW 0612170609170609170300000000003135
CF214L214L1671310491107111509 Customer Name 46144 171639547 0000421200000000DRNRIUAW 0612170613170613170300000000003135
CF214L214L1671380999999900002000007420
CF214L214L1671310491084111159 Customer Name 46081 171638440 0000320800000000DRCSIU 0612170609170609170300000000003135
CF214L214L1671380999999900001000003208
CF214L214L1671510446646410055 Customer Name 46436 171677320 0000027200000272AA 0616170623170623170300000050003001
CF214L214L1671510126566110169 Customer Name 46450 171677321 0000117900001179AA 0616170623170623170300000250003001
CF214L214L1671510063942910172 Customer Name 46413 171677322 0000159300001593AA 0616170623170623170300000150003001
CF214L214L1671510808861010253 Customer Name 46448 171677323 0000298600002986AA 0616170623170623170300000350003001
CF214L214L1671510077309510502 Customer Name 46434 171677324 0000294300002943AA 0616170622170622170300000150003001
CF214L214L1671580999999900029000077728
CF214L214L1671610049631611165 Customer Name 46221 171677648 0000178700000000 0616170619170619170300000000003000
CF214L214L1671610895609911978 Customer Name 46433 171677348 0000011800000118AC 0616170622170622170300000150003041
CF214L214L1671680999999900002000001905
推荐答案
简短答案:
只需采用line[74:79]
,如Roelant建议的那样 .由于输入中的行总长为230个字符,因此永远不会有IndexError
,因此您需要使用isspace()
:
Just take line[74:79]
and such as Roelant suggested. Since the lines in your input are always 230 chars long though, there'll never be an IndexError
, so you rather need to check if the result is all whitespace with isspace()
:
field=line[74:79]
<...>
if isspace(field): continue
一种更强大的方法(该方法还可以验证输入(检查是否需要这样做))是解析整行,并使用结果中的特定元素.
A more robust approach that would also validate input (check if you're required to do so) is to parse the entire line and use a specific element from the result.
一种方法是根据解析文本文件并提取特定列,通过python 获取{}内文件的路径.
One way is a regex as per Parse a text file and extract a specific column, Tips for reading in a complex file - Python and an example at get the path in a file inside {} by python .
但是对于您看来是一种古老的,打孔卡衍生的特定格式,其列号定义了数据的含义,该格式可能更方便地表示为与字段名称相关联的列号序列(您从未告诉过我们是什么意思,所以我使用的是通用名称):
But for your specific format that appears to be an archaic, punchcard-derived one, with column number defining the datum's meaning, the format can probably be more conveniently expressed as a sequence of column numbers associated with field names (you never told us what they mean so I'm using generic names):
fields=[
("id1",(0,39)),
("cname_text":(40,73)),
("num2":(74:79)),
("num3":(96,105)),
#whether to introduce a separate field at [122:125]
# or parse "id4" further after getting it is up to you.
# I'd suggest you follow the official format spec.
("id4":(106,130)),
("num5":(134,168))
]
line_end=230
并这样解析:
def parse_line(line,fields,end):
result={}
#for whitespace validation
# prev_ecol=0
for fname,(scol,ecol) in format.iteritems():
#optionally validate delimiting whitespace
# assert prev_ecol==scol or isspace(line[prev_ecol,scol])
#lines in the input are always `end' symbols wide, so IndexError will never happen for a valid input
field=line[scol:ecol]
#optionally do conversion and such, this is completely up to you
field=field.rstrip(' ')
if not field: field=None
result[fname]=field
#for whitespace validation
# prev_ecol=ecol
#optionally validate line end
# assert ecol==end or isspace(line[ecol:end])
剩下的就是字段为空的跳过行:
All that leaves is skip lines where the field is empty:
for line in lines:
data = parse_line(line,fields,line_end)
if any(data[fname] is None for fname in ('num2','id4')): continue
#handle the data
这篇关于如果一行中的特定列有数据,则提取它们的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!