将文件拆分为块 [英] Split file to chunk

查看:153
本文介绍了将文件拆分为块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



  @some 
@garbage
@
@target G0.S0
@type xy
-0.108847E + 02 0.489034E-04
-0.108711E + 02 0.491023E-04
-0.108574E +02 0.493062E-04
-0.108438E + 02 0.495075E-04
-0.108302E + 02 0.497094E-04
....不明行数...
&安培;
@target G0.S1
@type xy
-0.108847E + 02 0.315559E-04
-0.108711E + 02 0.316844E-04
-0.108574E + 02 0.318134E-04
....未知行号...
&
@target G1.S0
@type xy
-0.108847E + 02 0.350450E-04
-0.108711E + 02 0.351669E-04
-0.108574E + 02 0.352908E-04
&
@target G1.S1
@type xy
-0.108847E + 02 0.216396E-04
-0.108711E + 02 0.217122E-04
-0.108574E + 02 0.217843E-04
-0.108438E + 02 0.218622E-04



<$> @target Gx.Sy 组合是唯一的,每组数据总是由&



我已经成功将文件拆分为:
$ b

 #!/ usr / bin / env python3 
import os
import sys
import itertools as
import numpy as np
import matplotlib.pyplot as plt

try :
文件名= sys.argv [1]
打印(文件名)
,除了IndexError:
print(ERROR:必要的文件名未提供)

打开(文件名r)为f:
为f:
如果line.startswith(@ target):
print(line.split()[ - 1] .split(。))

x = []; y = []
开放(文件名,r)为f:
for key,group在它.groupby(f,lambda行:line.startswith('@ target')):
打印(键)
如果不是键:
group = list(group)
group.pop(0)
#group.pop(-1)
print (len(group)):
x.append(group [i] .split()[0])
y.append(group [i])。 split()[1])$ ​​b $ b nx = np.array(x)
ny = np.array(y)

我有两个问题:

1)真实数据之前的序言行也被分组,所以脚本不工作if有任何序言。预测会有多少条线是不可能的;但我试图在 将数组命名为G0 [S0,S0]和G1 [S1,S2];但我不能这样做。


请帮助

更新
我试图存储这些数据嵌套在G0 [S0,S1,...],G1 [S0,S1,...]的np数组中,以便我可以在matplotlib中使用它。



  import numpy as np $ b 

下面的函数可以完成工作: $ b from collections import defaultdict
$ b $ def read_without_preamble(filename):
with open(filename,'r')as f:
lines = f.readlines()
为i,行列举(行):
如果line.startswith('@ target'):
返回行[i:]

def split_into_chunks(lines):
chunks = defaultdict(dict)
对于行:
如果line.startswith('@ target'):
GS_str = line.strip()。split()[ -1] .split('。')
G,S = map(lambda x:int(x [1:]),GS_str)
chunks [G] [S] = []
elif line.startswith('@ type xy'):
通过
elif line.startswith('&'):
chunks [G] [S] = np.asarray(chunks [G] [S])
else:
xy_str = line.strip()。split()
chunks [G] [S] .append(map(float,xy_str))
return chunks
/ pre>

要将文件拆分为大块,只需运行以下代码:

  try:
filename = sys.argv [1]
print(filename)
除了IndexError:
print(ERROR:Required filename not provided)

data = read_without_preamble(filename)
chunks = split_into_chunks(data)



< h3>逐步演示

chunks 是一个字典,其中键是 G 0 1 ):

  In [415]:type(chunks)
Out [415]:dict

In [416]:for k in chunks.keys ():print(k)
0
1

块s 是另一个字典,其中的键是 S 0 ,<$ c $在本例中,c> 1 2 ),并且该值是包含 Gi的数字数据的NumPy数组。的Sn 。你可以像这样访问这块数据: chunks [i] [n] ,其中索引 i n 分别是 G S 的值。


$ b $ $ p $ 在[417]中:type(chunk [0])
Out [417]:dict

In [418]:for k in chunk [0] .keys():print(k)
0
1
2

In [ 419]:type(chunks [1] [2])
Out [419]:numpy.ndarray

In [420]:chunks [1] [2]
Out [420]:
array([[-1.08851000e + 01,2.53058000e-05],
[-1.08715000e + 01,2.55353000e-05],
[-1.08579000e + 01,2.57745000e-05],
[-1.08443000e + 01,2.60225000e-05],
[-1.08306000e + 01,2.62617000e-05],
[-1.08170000e +01,2.65097000e-05],
[-1.08034000e + 01,2.667666000e-05]])



$ chunks [i] [n] .shape [0] 2 c $ c> i
n ,但是 chunks [i] [n] .shape [1]

formatted_file.txt

$ b $可以取任何值,也就是说数值数据的行数可能会随着一个块而变化。 b

这是我在示例运行中使用的文件。它由六个块组成,即 G0.S0 G0.S1 G0.S2 G1.S0 G1.S1 G1 .S2

  @some 
@garbage
@lines
@target G0.S0
@type xy
-0.108851E + 02 0.127435E-03
-0.108715E + 02 0.127829E-03
-0.108579E + 02 0.128191 E-03
-0.108443E + 02 0.128502E-03
-0.108306E + 02 0.128726E-03
-0.108170E + 02 0.128838E-03
-0.108034E + 02 0.128751E-03
&
@target G0.S1
@type xy
-0.108851E + 02 0.472694E-04
-0.108715E + 02 0.474233E-04
-0.108579E + 02 0.475837E-04
-0.108443E + 02 0.477448E-04
-0.108306E + 02 0.479052E-04
-0.108170E + 02 0.480669E-04
-0.108034 E + 02 0.482279E-04
&
@target G0.S2
@type xy
-0.108851E + 02 0.253654E-04
-0.108715E + 02 0.255956E-04
-0.108579E + 02 0.258346E-04
-0.108443E + 02 0.260825E-04
-0.108306E + 02 0.263303E-04
-0.108170E + 02 0.265781E-04
-0.108034 E + 02 0.268349E-04
&
@target G1.S0
@type xy
-0.108851E + 02 0.108786E-03
-0.108715E + 02 0.109216E-03
-0.108579E + 02 0.109651E-03
-0.108443E + 02 0.110116E-03
-0.108306E + 02 0.110552E-03
-0.108170E + 02 0.111011E-03
-0.108034 E + 02 0.111489E-03
&
@target G1.S1
@type xy
-0.108851E + 02 0.278045E-04
-0.108715E + 02 0.278711E-04
-0.108579E + 02 0.279384E-04
-0.108443E + 02 0.280050E-04
-0.108306E + 02 0.280723E-04
-0.108170E + 02 0.281395E-04
-0.108034 E + 02 0.282074E-04
&
@target G1.S2
@type xy
-0.108851E + 02 0.253058E-04
-0.108715E + 02 0.255353E-04
-0.108579E + 02 0.257745E-04
-0.108443E + 02 0.260225E-04
-0.108306E + 02 0.262617E-04
-0.108170E + 02 0.265097E-04
-0.108034 E + 02 0.267666E-04
&


I am trying to split a file formatted as:

@some 
@garbage
@lines
@target G0.S0
@type xy
 -0.108847E+02  0.489034E-04
 -0.108711E+02  0.491023E-04
 -0.108574E+02  0.493062E-04
 -0.108438E+02  0.495075E-04
 -0.108302E+02  0.497094E-04
 ....Unknown line numbers...
&
@target G0.S1
@type xy
 -0.108847E+02  0.315559E-04
 -0.108711E+02  0.316844E-04
 -0.108574E+02  0.318134E-04
 ....Unknown line numbers...
&
@target G1.S0
@type xy
 -0.108847E+02  0.350450E-04
 -0.108711E+02  0.351669E-04
 -0.108574E+02  0.352908E-04
&
@target G1.S1
@type xy
 -0.108847E+02  0.216396E-04
 -0.108711E+02  0.217122E-04
 -0.108574E+02  0.217843E-04
 -0.108438E+02  0.218622E-04

The @target Gx.Sy combination is unique and each set of data is always termineted by &.

I have managed to split the file in chunk as:

#!/usr/bin/env python3
import os
import sys
import itertools as it
import numpy as np
import matplotlib.pyplot as plt

try:
  filename = sys.argv[1]
  print(filename)
except IndexError:
  print("ERROR: Required filename not provided")

with open(filename, "r") as f:
  for line in f:
    if line.startswith("@target"):
      print(line.split()[-1].split("."))

x=[];y=[]
with open(filename, "r") as f:
  for key,group in it.groupby(f,lambda line: line.startswith('@target')):
    print(key)
    if not key:
        group = list(group)
        group.pop(0)
        # group.pop(-1)
        print(group)
        for i in range(len(group)):
          x.append(group[i].split()[0])
          y.append(group[i].split()[1])
        nx=np.array(x)
        ny=np.array(y)

I have two problem:

1) The preamble lines before the real data is also grouped, so the script does not work if there is any preamble. It is impossible to predict how many lines that would be; But I am trying to group after the @target and

2) I want to name the arrays as G0[S0,S0] and G1[S1,S2]; but I cant do this.

Kindly Help

UPDATE: I am trying to store those data in a nested np array of G0[S0,S1,...], G1[S0,S1,..] so that I can use it in matplotlib.

解决方案

The functions below get the job done:

import numpy as np
from collections import defaultdict

def read_without_preamble(filename):
    with open(filename, 'r') as f:
        lines = f.readlines()
    for i, line in enumerate(lines):
        if line.startswith('@target'):
            return lines[i:]

def split_into_chunks(lines):
    chunks = defaultdict(dict)
    for line in lines:
        if line.startswith('@target'):
            GS_str = line.strip().split()[-1].split('.')
            G, S = map(lambda x: int(x[1:]), GS_str)
            chunks[G][S] = []
        elif line.startswith('@type xy'):
            pass
        elif line.startswith('&'):
            chunks[G][S] = np.asarray(chunks[G][S])
        else:
            xy_str = line.strip().split()
            chunks[G][S].append(map(float, xy_str))
    return chunks

To split your file into chunks you just need to run this code:

try:
  filename = sys.argv[1]
  print(filename)
except IndexError:
  print("ERROR: Required filename not provided")

data = read_without_preamble(filename)
chunks = split_into_chunks(data)

Stepwise demo

chunks is a dictionary in which the key is G (either 0 or 1):

In [415]: type(chunks)
Out[415]: dict

In [416]: for k in chunks.keys(): print(k)
0
1

The value of dictionary chunks is another dictionary in which the key is S (0, 1, or 2 in this example) and the value is a NumPy array containing the numeric data for Gi.Sn. You can access this chunk of data like this: chunks[i][n], where indices i and n are the values of G and S, respectively.

In [417]: type(chunks[0])
Out[417]: dict

In [418]: for k in chunks[0].keys(): print(k)
0
1
2

In [419]: type(chunks[1][2])
Out[419]: numpy.ndarray

In [420]: chunks[1][2]
Out[420]: 
array([[ -1.08851000e+01,   2.53058000e-05],
       [ -1.08715000e+01,   2.55353000e-05],
       [ -1.08579000e+01,   2.57745000e-05],
       [ -1.08443000e+01,   2.60225000e-05],
       [ -1.08306000e+01,   2.62617000e-05],
       [ -1.08170000e+01,   2.65097000e-05],
       [ -1.08034000e+01,   2.67666000e-05]])

chunks[i][n].shape[0] is 2 for any i and n, but chunks[i][n].shape[1] can take any value, i.e. the number of rows of numeric data may vary from one chunk to another.

formatted_file.txt

This is the file I used in the sample run. It consists of six chunks, namely G0.S0, G0.S1, G0.S2, G1.S0, G1.S1, and G1.S2.

@some 
@garbage
@lines
@target G0.S0
@type xy
 -0.108851E+02  0.127435E-03
 -0.108715E+02  0.127829E-03
 -0.108579E+02  0.128191E-03
 -0.108443E+02  0.128502E-03
 -0.108306E+02  0.128726E-03
 -0.108170E+02  0.128838E-03
 -0.108034E+02  0.128751E-03
&
@target G0.S1
@type xy
 -0.108851E+02  0.472694E-04
 -0.108715E+02  0.474233E-04
 -0.108579E+02  0.475837E-04
 -0.108443E+02  0.477448E-04
 -0.108306E+02  0.479052E-04
 -0.108170E+02  0.480669E-04
 -0.108034E+02  0.482279E-04
&
@target G0.S2
@type xy
 -0.108851E+02  0.253654E-04
 -0.108715E+02  0.255956E-04
 -0.108579E+02  0.258346E-04
 -0.108443E+02  0.260825E-04
 -0.108306E+02  0.263303E-04
 -0.108170E+02  0.265781E-04
 -0.108034E+02  0.268349E-04
&
@target G1.S0
@type xy
 -0.108851E+02  0.108786E-03
 -0.108715E+02  0.109216E-03
 -0.108579E+02  0.109651E-03
 -0.108443E+02  0.110116E-03
 -0.108306E+02  0.110552E-03
 -0.108170E+02  0.111011E-03
 -0.108034E+02  0.111489E-03
&
@target G1.S1
@type xy
 -0.108851E+02  0.278045E-04
 -0.108715E+02  0.278711E-04
 -0.108579E+02  0.279384E-04
 -0.108443E+02  0.280050E-04
 -0.108306E+02  0.280723E-04
 -0.108170E+02  0.281395E-04
 -0.108034E+02  0.282074E-04
&
@target G1.S2
@type xy
 -0.108851E+02  0.253058E-04
 -0.108715E+02  0.255353E-04
 -0.108579E+02  0.257745E-04
 -0.108443E+02  0.260225E-04
 -0.108306E+02  0.262617E-04
 -0.108170E+02  0.265097E-04
 -0.108034E+02  0.267666E-04
&

这篇关于将文件拆分为块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆