将大文本(xyz)数据库拆分为x个相等部分 [英] split a large text (xyz) database into x equal parts

查看:259
本文介绍了将大文本(xyz)数据库拆分为x个相等部分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想分割一个大型文本数据库(〜1000万行)。我可以使用命令像

  $ sed -i -e'4 s /(dB)//'-e'4 s / Best \ unit / Best_Unit /'-e'1,3 d''/ cygdrive / c / Radio Mobile / Output / TRC_TestProcess / trc_longlands.txt'

$ split -l 1000000 / cygdrive /P/2012/Job_044_DM_Radio_Propogation/Working/FinalPropogation/TRC_Longlands/trc_longlands.txt 1

第一行是清除数据库,下一个是拆分 -
,但输出文件没有字段名。如何将字段名称合并到每个数据集中,并管理具有原始文件,新文件名和行号(从原始文件)的列表。这是因为它可以在arcgis模型中重新连接最终的简化多边形数据集。



可替换和更多的 - 因为这需要进入arcgis模型,基于python的解决方案是最好的。有关详情,请参阅 http:// gis.stackexchange.com/questions/21420/large-point-to-polygon-by-buffer-join-buffer-dissolve-issues#comment29062_21420 从python中的大文本文件中删除特定行



根据icyrock.com的答案,使用基于CYGWIN的Python解决方案



我们有process_text.sh

  cd / cygdrive / P / 2012 / Job_044_DM_Radio_Propogation / Working / FinalPropogation / TRC_Longlands 
mkdir processing
cp trc_longlands.txt processing / trc_longlands.txt
cd txt_processing
sed -i -e'4 s /(dB)//'-e'4 s / Best\ unit / Best_Unit /'-e'1,3 d''trc_longlands.txt'
split -l 1000000 trc_longlands.txt trc_longlands_
cat> a
h
1
2
3
4
5
6
7
8
9
^ D
split -l 3
split -l 3 a 1
mv 1aa 21aa
for i in 1 *; do head -n1 21aa | cat - $ i> 2 $ i; done
for i in 21 *; do echo ---- $ i; cat $ i; done

如何用TRC_Longlands和路径替换输入文件名-in python我们有% path%/%name。
在最后一行是do echo必要的?



这是由python使用

调用

  import os 
os.system(process_text.bat)


b $ b

其中process_text.bat基本上是

  bash process_text.sh 



从dos ...运行时出现以下错误


Microsoft Windows [版本6.1.7601]版权所有(c)2009 Microsoft
公司。保留所有权利。



C:\Users\georgec> bash
P:\2012\Job_044_DM_Radio_Propogation\Working\FinalPropogat
ion\TRC_Longlands\process_text.sh'bash'不被识别为
内部或外部命令,可操作的程序或批处理文件。


当我从cygwin -I get运行bash命令时


georgec @ ATGIS25
/ cygdrive / P / 2012 / Job_044_DM_Radio_Propogation / Working / FinalPropogation / TRC_Longlands
$ bash process_text.sh:没有此类文件或目录:
/ cygdrive / P / 2012 / Job_044_DM_Radio_Propogation / Working / FinalPropogation / TRC_Longlands
cp :不能创建常规文件`processing / trc_longlands.txt\r':没有
这样的文件或目录:没有这样的文件或目录:txt_processing:
没有这样的文件或directoryds.txt


但文件是在根目录中创建的。



为什么会有。后的目录名?如何给他们一个.txt扩展名?

解决方案

如果你想把原始文件的第一行添加到所有但是第一个分割,你可以这样做:

  $ cat> a 
h
1
2
3
4
5
6
7
^ D
$ split -l 3
$ split -l 3 a 1
$ ls
1aa 1ab 1ac a
$ mv 1aa 21aa
$ for i in 1 *; do head -n1 21aa | cat - $ i> 2 $ i; done
$ for i in 21 *; do echo ---- $ i; cat $ i; done
---- 21aa
h
1
2
---- 21ab
h
3
4
5
---- 21ac
h
6
7


$ b b

显然,第一个文件将有一行少于中间部分,最后一部分也可能更短,但如果这不是一个问题,这应该工作很好。当然,如果你的标题有更多行,只需将 head -n1 更改为 head -nX ,<$ c $



希望这有帮助。


I want to split a large text database (~10 million lines). I can use a command like

$ sed -i -e '4 s/(dB)//' -e '4 s/Best\ unit/Best_Unit/' -e '1,3 d' '/cygdrive/c/                                                                                                                      Radio Mobile/Output/TRC_TestProcess/trc_longlands.txt'

$ split -l 1000000  /cygdrive/P/2012/Job_044_DM_Radio_Propogation/Working/FinalPropogation/TRC_Longlands/trc_longlands.txt 1

The first line is to clean the databse and the next is to split it - but then the output files do not have the field names. How can I incorporate the field names into each dataset and pipe a list which has the original file, new file name and line numbers (from original file) in it. This is so that it can be used in the arcgis model to re-join the final simplified polygon datasets.

ALTERNATIVELY AND MORE USEFULLY -as this needs to go into a arcgis model, a python based solution is best. More details are in http://gis.stackexchange.com/questions/21420/large-point-to-polygon-by-buffer-join-buffer-dissolve-issues#comment29062_21420 and Remove specific lines from a large text file in python

SO GOING WITH A CYGWIN based Python solution as per answer by icyrock.com

we have process_text.sh

cd  /cygdrive/P/2012/Job_044_DM_Radio_Propogation/Working/FinalPropogation/TRC_Longlands
mkdir processing
cp trc_longlands.txt processing/trc_longlands.txt
cd txt_processing
sed -i -e '4 s/(dB)//' -e '4 s/Best\ unit/Best_Unit/' -e '1,3 d' 'trc_longlands.txt'
split -l 1000000  trc_longlands.txt trc_longlands_
cat > a
h
1
2
3
4
5
6
7
8
9
^D
split -l 3
split -l 3 a 1
mv 1aa 21aa
for i in 1*; do head -n1 21aa|cat - $i > 2$i; done
for i in 21*; do echo ---- $i; cat $i; done

how can "TRC_Longlands" and the path be replaced with the input filename -in python we have %path%/%name for this. in the last line is "do echo" necessary?

and this is called by python using

import os
os.system("process_text.bat")

where process_text.bat is basically

bash process_text.sh

I get the following error when run from dos...

Microsoft Windows [Version 6.1.7601] Copyright (c) 2009 Microsoft Corporation. All rights reserved.

C:\Users\georgec>bash P:\2012\Job_044_DM_Radio_Propogation\Working\FinalPropogat ion\TRC_Longlands\process_text.sh 'bash' is not recognized as an internal or external command, operable program or batch file.

also when I run the bash command from cygwin -I get

georgec@ATGIS25 /cygdrive/P/2012/Job_044_DM_Radio_Propogation/Working/FinalPropogation/TRC_Longlands $ bash process_text.sh : No such file or directory: /cygdrive/P/2012/Job_044_DM_Radio_Propogation/Working/FinalPropogation/TRC_Longlands cp: cannot create regular file `processing/trc_longlands.txt\r': No such file or directory : No such file or directory: txt_processing : No such file or directoryds.txt

but the files are created in the root directory.

why is there a "." after the directory name? how can they be given a .txt extension?

解决方案

If you want to just prepend the first line of the original file to all but the first of the splits, you can do something like:

$ cat > a
h
1
2
3
4
5
6
7
^D
$ split -l 3
$ split -l 3 a 1
$ ls
1aa 1ab 1ac a
$ mv 1aa 21aa
$ for i in 1*; do head -n1 21aa|cat - $i > 2$i; done
$ for i in 21*; do echo ---- $i; cat $i; done
---- 21aa
h
1
2
---- 21ab
h
3
4
5
---- 21ac
h
6
7

Obviously, the first file will have one line less then the middle parts and the last part might be shorter, too, but if that's not a problem, this should work just fine. Of course, if your header has more lines, just change head -n1 to head -nX, X being the number of header lines.

Hope this helps.

这篇关于将大文本(xyz)数据库拆分为x个相等部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆