将文本数据库分成 N 个相等的块并保留标题 [英] chunk a text database into N equal blocks and retain header

查看:19
本文介绍了将文本数据库分成 N 个相等的块并保留标题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有几个大型(30 多万行)文本数据库,我正在使用以下代码对其进行清理,我需要将文件拆分为 100 万行或更少,并保留标题行.我已经查看了 chunk 和 itertools,但找不到明确的解决方案.用于arcgis模型中.

== 根据 icyrock.com 的回复更新代码

import arcpy, os#fc = arcpy.GetParameter(0)#chunk_size = arcpy.GetParameter(1) # 每个数据集中的记录数fc='input.txt'名称 = fc[:fc.rfind('.')]fl = 名称+'_db.txt'使用 open(fc) 作为 f:行 = f.readlines()行[:] = 行[3:]行[0] = 行[0].replace('Rx(db)', 'Rx_'+Name)行[0] = 行[0].replace('Best Unit', 'Best_Unit')记录 = len(行)with open(fl, 'w') as f: #where N 是块号f.write('\n'.join(lines))使用 open(fl) 作为文件:行 = file.readlines()标题 = 行 [0:1]休息 = 行 [1:]块大小 = 1000000定义块(lst,块大小):对于 xrange(0, len(lst), chunk_size) 中的 i:产量 lst[i:i + chunk_size]def write_rows(行,文件):对于行中的行:file.write('%s' % 行)部分 = 1对于块中的块(rest,chunk_size):以 open(Name+'_%d' % part+'.txt', 'w') 作为文件:write_rows(标题,文件)write_rows(块,文件)部分 += 1

参见 从大文本文件中删除特定行在 python拆分大文本(xyz) 将数据库分成 x 等份作为背景.我不再需要基于 cygwin 的解决方案,因为它使模型过于复杂.我需要一个pythonic的方式.我们可以使用记录"进行迭代,但不清楚的是如何在 db #1 中指定第 1:999,999 行,在 db#2 中指定第 1,000,0000 到 1,999,999 行等.如果最后一个数据集小于 1m 就可以了记录.

500mb 文件出错(我有 16GB RAM).

<块引用>

回溯(最近一次调用最后一次):文件"P:\2012\Job_044_DM_Radio_Propogation\Working\test\clean_file.py",第 10 行,在lines = f.readlines() MemoryError

<块引用><块引用><块引用>

记录 2249878

上面的记录数量不是总记录数,它只是内存不足的地方(我认为).

=== 来自 Icyrock 的新代码.

块似乎工作正常,但在 Arcgis 中使用时出错.

<块引用>

开始时间:2012 年 3 月 9 日星期五 17:20:04 警告 000594:输入功能1945882430:落在输出几何域之外.警告 000595:d:\Temp\cb_vhn007_1.txt_Features1.fid 包含完整列表无法复制的功能.

我知道这是分块的问题,因为制作事件层"过程适用于完整的预分块数据集.

有什么想法吗?

解决方案

你可以这样做:

 with open('file') as file:行 = file.readlines()标题 = 行 [0:1]休息 = 行 [1:]块大小 = 4定义块(lst,块大小):对于 xrange(0, len(lst), chunk_size) 中的 i:产量 lst[i:i + chunk_size]def write_rows(行,文件):对于行中的行:file.write('%s' % 行)部分 = 1对于块中的块(rest,chunk_size):使用 open('part%d' % part, 'w') 作为文件:write_rows(标题,文件)write_rows(块,文件)部分 += 1

这是一个测试运行:

$ cat 文件 &&python mkt.py &&对于 p 部分*;做回声---- $p;猫 $p;完毕标题1234567891011121314-  -  第1部分标题1234-  -  第2部分标题5678---- 第 3 部分标题9101112---- 第 4 部分标题1314

显然,根据它们的数量更改 chunk_size 的值以及获取 headers 的方式.

学分:

编辑 - 要逐行执行此操作以避免内存问题,您可以执行以下操作:

from itertools import isliceheaders_count = 5块大小 = 250000以 open('file') 作为 fin:headers = list(islice(fin, headers_count))部分 = 1而真:line_iter = islice(fin, chunk_size)尝试:first_line = line_iter.next()除了停止迭代:休息以 open('part%d' % part, 'w') 作为 fout:对于标题中的行:fout.write(行)fout.write(first_line)对于 line_iter 中的行:fout.write(行)部分 += 1

学分:

测试用例(将上述内容放在名为mkt2.py的文件中):

制作一个包含 5 行标题和 1234567 行的文件:

 with open('file', 'w') as fout:对于范围内的 i (5):fout.write(10 * ('header %d ' % i) + '\n')对于我在范围内(1234567):fout.write(10 * ('line %d ' % i) + '\n')

要测试的 Shell 脚本(放入名为 rt.sh 的文件中):

rm 部分*回声----文件head -n7 文件tail -n2 文件蟒蛇mkt2.py对于我部分*;做回声 ---- $i头 -n7 $i尾巴 -n2 $i完毕

示例输出:

$ sh rt.sh-  -  文件头0 头0 头0 头0 头0 头0 头0 头0 头0 头0标题1 标题1 标题1 标题1 标题1 标题1 标题1 标题1 标题1 标题1头2 头2 头2 头2 头2 头2 头2 头2 头2 头2头3 头3 头3 头3 头3 头3 头3 头3 头3 头3头4 头4 头4 头4 头4 头4 头4 头4 头4 头4线0线0线0线0线0线0线0线0线0线0线0线0线0线0线0线0线0线0线0线0线0线0线0线0线0线0线0线0线0线线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线线1234565线1234565线1234565线1234565线1234565线1234565线1234565线1234565线1234565线1234565线1234566线1234566线1234566线1234566线1234566线1234566线1234566线1234566线1234566线1234566-  -  第1部分头0 头0 头0 头0 头0 头0 头0 头0 头0 头0标题1 标题1 标题1 标题1 标题1 标题1 标题1 标题1 标题1 标题1头2 头2 头2 头2 头2 头2 头2 头2 头2 头2头3 头3 头3 头3 头3 头3 头3 头3 头3 头3头4 头4 头4 头4 头4 头4 头4 头4 头4 头4线0线0线0线0线0线0线0线0线0线0线0线0线0线0线0线0线0线0线0线0线0线0线0线0线0线0线0线0线0线线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线1线线 249998 线 249998 线 249998 线 249998 线 249998 线 249998 线 249998 线 249998 线 249998 线 249998线249999线249999线249999线249999线249999线249999线249999线249999线249999线249999-  -  第2部分头0 头0 头0 头0 头0 头0 头0 头0 头0 头0标题1 标题1 标题1 标题1 标题1 标题1 标题1 标题1 标题1 标题1头2 头2 头2 头2 头2 头2 头2 头2 头2 头2头3 头3 头3 头3 头3 头3 头3 头3 头3 头3头4 头4 头4 头4 头4 头4 头4 头4 头4 头4250000 线 250000 线 250000 线 250000 线 250000 线 250000 线 250000 线 250000 线 250000 线 250000线250001线250001线250001线250001线250001线250001线250001线250001线250001线250001线499998线499998线499998线499998线499998线499998线499998线499998线499998线499998线线499999线499999线499999线499999线499999线499999线499999线499999线499999线499999---- 第 3 部分头0 头0 头0 头0 头0 头0 头0 头0 头0 头0标题1 标题1 标题1 标题1 标题1 标题1 标题1 标题1 标题1 标题1头2 头2 头2 头2 头2 头2 头2 头2 头2 头2头3 头3 头3 头3 头3 头3 头3 头3 头3 头3头4 头4 头4 头4 头4 头4 头4 头4 头4 头4500000 线 500000 线 500000 线 500000 线 500000 线 500000 线 500000 线 500000 线 500000 线 500000线500001线500001线500001线500001线500001线500001线500001线500001线500001线500001线 749998 线 749998 线 749998 线 749998 线 749998 线 749998 线 749998 线 749998 线 749998 线 749998线路749999线路749999线路749999线路749999线路749999线路749999线路749999线路749999线路749999线路749999---- 第 4 部分头0 头0 头0 头0 头0 头0 头0 头0 头0 头0标题1 标题1 标题1 标题1 标题1 标题1 标题1 标题1 标题1 标题1头2 头2 头2 头2 头2 头2 头2 头2 头2 头2头3 头3 头3 头3 头3 头3 头3 头3 头3 头3头4 头4 头4 头4 头4 头4 头4 头4 头4 头4线 750000 线 750000 线 750000 线 750000 线 750000 线 750000 线 750000 线 750000 线 750000 线 750000线 750001 线 750001 线 750001 线 750001 线 750001 线 750001 线 750001 线 750001 线 750001 线 750001999998线999998线999998线999998线999998线999998线999998线999998线999998线999998线999999线999999线999999线999999线999999线999999线999999线999999线999999线999999---- 第 5 部分头0 头0 头0 头0 头0 头0 头0 头0 头0 头0标题1 标题1 标题1 标题1 标题1 标题1 标题1 标题1 标题1 标题1头2 头2 头2 头2 头2 头2 头2 头2 头2 头2头3 头3 头3 头3 头3 头3 头3 头3 头3 头3头4 头4 头4 头4 头4 头4 头4 头4 头4 头41000000 线 1000000 线 1000000 线 1000000 线 1000000 线 1000000 线 1000000 线 1000000 线 1000000 线 1000000线1000001线1000001线1000001线1000001线1000001线1000001线1000001线1000001线1000001线1000001线1234565线1234565线1234565线1234565线1234565线1234565线1234565线1234565线1234565线1234565线1234566线1234566线1234566线1234566线1234566线1234566线1234566线1234566线1234566线1234566

以上时间:

real 0m0.935s用户 0m0.708s系统 0m0.200s

希望这会有所帮助.

I have several large (30+ million lines) text databases which I am cleaning up with the following code, I need to split the file into 1 million lines or less and retain the header line. I have looked at chunk and itertools but can't get a clear solution. It is to use in an arcgis model.

== updated code as per response from icyrock.com

import arcpy, os
#fc = arcpy.GetParameter(0)
#chunk_size = arcpy.GetParameter(1) # number of records in each dataset

fc='input.txt'
Name = fc[:fc.rfind('.')]
fl = Name+'_db.txt'

with open(fc) as f:
  lines = f.readlines()
lines[:] = lines[3:]
lines[0] = lines[0].replace('Rx(db)', 'Rx_'+Name)
lines[0] = lines[0].replace('Best Unit', 'Best_Unit')
records = len(lines)
with open(fl, 'w') as f: #where N is the chunk number
  f.write('\n'.join(lines))

with open(fl) as file:
  lines = file.readlines()

headers = lines[0:1]
rest = lines[1:]
chunk_size = 1000000

def chunks(lst, chunk_size):
  for i in xrange(0, len(lst), chunk_size):
    yield lst[i:i + chunk_size]

def write_rows(rows, file):
  for row in rows:
    file.write('%s' % row)

part = 1
for chunk in chunks(rest, chunk_size):
  with open(Name+'_%d' % part+'.txt', 'w') as file:
    write_rows(headers, file)
    write_rows(chunk, file)
  part += 1

See Remove specific lines from a large text file in python and split a large text (xyz) database into x equal parts for background. I don't want a cygwin based solution any longer as it over complicates the model. I need a pythonic way. We can use the "records" to iterate through but what is not clear is how to specify line 1:999,999 in db #1, lines 1,000,0000 to 1,999,999 in db#2 etc. It's fine if the last dataset has less than 1m records.

Error with 500mb file (I have 16GB RAM).

Traceback (most recent call last): File "P:\2012\Job_044_DM_Radio_Propogation\Working\test\clean_file.py", line 10, in lines = f.readlines() MemoryError

records 2249878

The records amount above is not the total record count it just where it went out of memory (I think).

=== With the new code from Icyrock.

The chunk seems to work ok but gives errors when used in Arcgis.

Start Time: Fri Mar 09 17:20:04 2012 WARNING 000594: Input feature 1945882430: falls outside of output geometry domains. WARNING 000595: d:\Temp\cb_vhn007_1.txt_Features1.fid contains the full list of features not able to be copied.

I know it is an issue with chunking as the "Make Event Layer" process works fine with full pre-chunk dataset.

Any ideas?

解决方案

You can do something like this:

with open('file') as file:
  lines = file.readlines()

headers = lines[0:1]
rest = lines[1:]
chunk_size = 4

def chunks(lst, chunk_size):
  for i in xrange(0, len(lst), chunk_size):
    yield lst[i:i + chunk_size]

def write_rows(rows, file):
  for row in rows:
    file.write('%s' % row)

part = 1
for chunk in chunks(rest, chunk_size):
  with open('part%d' % part, 'w') as file:
    write_rows(headers, file)
    write_rows(chunk, file)
  part += 1

Here's a test run:

$ cat file && python mkt.py && for p in part*; do echo ---- $p; cat $p; done
header
1
2
3
4
5
6
7
8
9
10
11
12
13
14
---- part1
header
1
2
3
4
---- part2
header
5
6
7
8
---- part3
header
9
10
11
12
---- part4
header
13
14

Obviously, change the values of the chunk_size and how you fetch headers depending on their count.

Credits:

Edit - to do this line-by-line to avoid memory issues, you can do something like this:

from itertools import islice

headers_count = 5
chunk_size = 250000

with open('file') as fin:
  headers = list(islice(fin, headers_count))

  part = 1
  while True:
    line_iter = islice(fin, chunk_size)
    try:
      first_line = line_iter.next()
    except StopIteration:
      break
    with open('part%d' % part, 'w') as fout:
      for line in headers:
        fout.write(line)
      fout.write(first_line)
      for line in line_iter:
        fout.write(line)
    part += 1

Credits:

Test case (put the above in the file called mkt2.py):

Make a file containing 5-line header and 1234567 lines in it:

with open('file', 'w') as fout:
  for i in range(5):
    fout.write(10 * ('header %d ' % i) + '\n')
  for i in range(1234567):
    fout.write(10 * ('line %d ' % i) + '\n')

Shell script to test (put in file called rt.sh):

rm part*
echo ---- file
head -n7 file
tail -n2 file

python mkt2.py

for i in part*; do
  echo ---- $i
  head -n7 $i
  tail -n2 $i
done

Sample output:

$ sh rt.sh 
---- file
header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 
header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 
header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 
header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 
header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 
line 0 line 0 line 0 line 0 line 0 line 0 line 0 line 0 line 0 line 0 
line 1 line 1 line 1 line 1 line 1 line 1 line 1 line 1 line 1 line 1 
line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 
line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 
---- part1
header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 
header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 
header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 
header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 
header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 
line 0 line 0 line 0 line 0 line 0 line 0 line 0 line 0 line 0 line 0 
line 1 line 1 line 1 line 1 line 1 line 1 line 1 line 1 line 1 line 1 
line 249998 line 249998 line 249998 line 249998 line 249998 line 249998 line 249998 line 249998 line 249998 line 249998 
line 249999 line 249999 line 249999 line 249999 line 249999 line 249999 line 249999 line 249999 line 249999 line 249999 
---- part2
header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 
header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 
header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 
header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 
header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 
line 250000 line 250000 line 250000 line 250000 line 250000 line 250000 line 250000 line 250000 line 250000 line 250000 
line 250001 line 250001 line 250001 line 250001 line 250001 line 250001 line 250001 line 250001 line 250001 line 250001 
line 499998 line 499998 line 499998 line 499998 line 499998 line 499998 line 499998 line 499998 line 499998 line 499998 
line 499999 line 499999 line 499999 line 499999 line 499999 line 499999 line 499999 line 499999 line 499999 line 499999 
---- part3
header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 
header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 
header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 
header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 
header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 
line 500000 line 500000 line 500000 line 500000 line 500000 line 500000 line 500000 line 500000 line 500000 line 500000 
line 500001 line 500001 line 500001 line 500001 line 500001 line 500001 line 500001 line 500001 line 500001 line 500001 
line 749998 line 749998 line 749998 line 749998 line 749998 line 749998 line 749998 line 749998 line 749998 line 749998 
line 749999 line 749999 line 749999 line 749999 line 749999 line 749999 line 749999 line 749999 line 749999 line 749999 
---- part4
header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 
header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 
header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 
header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 
header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 
line 750000 line 750000 line 750000 line 750000 line 750000 line 750000 line 750000 line 750000 line 750000 line 750000 
line 750001 line 750001 line 750001 line 750001 line 750001 line 750001 line 750001 line 750001 line 750001 line 750001 
line 999998 line 999998 line 999998 line 999998 line 999998 line 999998 line 999998 line 999998 line 999998 line 999998 
line 999999 line 999999 line 999999 line 999999 line 999999 line 999999 line 999999 line 999999 line 999999 line 999999 
---- part5
header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 
header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 
header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 
header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 
header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 
line 1000000 line 1000000 line 1000000 line 1000000 line 1000000 line 1000000 line 1000000 line 1000000 line 1000000 line 1000000 
line 1000001 line 1000001 line 1000001 line 1000001 line 1000001 line 1000001 line 1000001 line 1000001 line 1000001 line 1000001 
line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 
line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 

Timing of the above:

real    0m0.935s
user    0m0.708s
sys     0m0.200s

Hope this helps.

这篇关于将文本数据库分成 N 个相等的块并保留标题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆