将大型文本 (xyz) 数据库拆分为 x 个相等的部分 [英] split a large text (xyz) database into x equal parts

查看:25
本文介绍了将大型文本 (xyz) 数据库拆分为 x 个相等的部分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想拆分一个大型文本数据库(约 1000 万行).我可以使用像

这样的命令

$ sed -i -e '4 s/(dB)//' -e '4 s/Best\ unit/Best_Unit/' -e '1,3 d' '/cygdrive/c/Radio移动/输出/TRC_TestProcess/trc_longlands.txt'$ split -l 1000000/cygdrive/P/2012/Job_044_DM_Radio_Propogation/Working/FinalPropogation/TRC_Longlands/trc_longlands.txt 1

第一行是清理数据库,接下来是拆分——但是输出文件没有字段名称.如何将字段名称合并到每个数据集中并通过管道传输一个列表,其中包含原始文件、新文件名和行号(来自原始文件).这样就可以在 arcgis 模型中使用它来重新连接最终的简化多边形数据集.

替代且更有用 - 由于这需要进入 arcgis 模型,因此最好使用基于 Python 的解决方案.更多细节在 https://gis.stackexchange.com/questions/21420/large-point-to-polygon-by-buffer-join-buffer-dissolve-issues#comment29062_21420从python中的大文本文件中删除特定行

根据 icyrock.com 的回答,使用基于 CYGWIN 的 Python 解决方案

我们有 process_text.sh

cd/cygdrive/P/2012/Job_044_DM_Radio_Propogation/Working/FinalPropogation/TRC_Longlandsmkdir 处理cp trc_longlands.txt 处理/trc_longlands.txtcd txt_processingsed -i -e '4 s/(dB)//' -e '4 s/Best\ unit/Best_Unit/' -e '1,3 d' 'trc_longlands.txt'split -l 1000000 trc_longlands.txt trc_longlands_猫 >一种H123456789^D拆分 -l 3拆分 -l 3 a 1MV 1aa 21aa对于我在 1*;做 head -n1 21aa|cat - $i >2$i;完毕对于我在 21*;做回声---- $i;猫 $i;完毕

如何用输入文件名替换TRC_Longlands"和路径 - 在 python 中,我们有 %path%/%name 用于此.最后一行是否需要do echo"?

这是由python使用

调用的

导入操作系统os.system("process_text.bat")

process_text.bat 基本上是哪里

bash process_text.sh

从 dos 运行时出现以下错误...

<块引用>

Microsoft Windows [版本 6.1.7601] 版权所有 (c) 2009 Microsoft公司.保留所有权利.

C:\Users\georgec>bashP:\2012\Job_044_DM_Radio_Propogation\Working\FinalPropogation\TRC_Longlands\process_text.sh 'bash' 不被识别为内部或外部命令、可运行的程序或批处理文件.

同样,当我从 cygwin 运行 bash 命令时 - 我得到 <块引用>

乔治@ATGIS25/cygdrive/P/2012/Job_044_DM_Radio_Propogation/Working/FinalPropogation/TRC_Longlands$ bash process_text.sh :没有这样的文件或目录:/cygdrive/P/2012/Job_044_DM_Radio_Propogation/Working/FinalPropogation/TRC_Longlandscp: 无法创建常规文件 `processing/trc_longlands.txt\r': 否这样的文件或目录:没有这样的文件或目录:txt_processing:没有这样的文件或 directoryds.txt

但是文件是在根目录下创建的.

为什么有一个."在目录名之后?如何给他们一个 .txt 扩展名?

解决方案

如果您只想将原始文件的第一行添加到除第一个拆分之外的所有拆分之前,您可以执行以下操作:

$ cat >一种H1234567^D$拆分-l 3$ split -l 3 a 1$ls1aa 1ab 1ac$ mv 1aa 21aa$ for i in 1*;做 head -n1 21aa|cat - $i >2$i;完毕$ for i in 21*;做回声---- $i;猫 $i;完毕---- 21aaH12---- 21abH345---- 21acH67

显然,第一个文件比中间部分少一行,最后一部分也可能更短,但如果这不是问题,这应该可以正常工作.当然,如果你的头部有更多行,只需将 head -n1 改为 head -nXX 是头部行数.

希望这会有所帮助.

I want to split a large text database (~10 million lines). I can use a command like

$ sed -i -e '4 s/(dB)//' -e '4 s/Best\ unit/Best_Unit/' -e '1,3 d' '/cygdrive/c/                                                                                                                      Radio Mobile/Output/TRC_TestProcess/trc_longlands.txt'

$ split -l 1000000  /cygdrive/P/2012/Job_044_DM_Radio_Propogation/Working/FinalPropogation/TRC_Longlands/trc_longlands.txt 1

The first line is to clean the databse and the next is to split it - but then the output files do not have the field names. How can I incorporate the field names into each dataset and pipe a list which has the original file, new file name and line numbers (from original file) in it. This is so that it can be used in the arcgis model to re-join the final simplified polygon datasets.

ALTERNATIVELY AND MORE USEFULLY -as this needs to go into a arcgis model, a python based solution is best. More details are in https://gis.stackexchange.com/questions/21420/large-point-to-polygon-by-buffer-join-buffer-dissolve-issues#comment29062_21420 and Remove specific lines from a large text file in python

SO GOING WITH A CYGWIN based Python solution as per answer by icyrock.com

we have process_text.sh

cd  /cygdrive/P/2012/Job_044_DM_Radio_Propogation/Working/FinalPropogation/TRC_Longlands
mkdir processing
cp trc_longlands.txt processing/trc_longlands.txt
cd txt_processing
sed -i -e '4 s/(dB)//' -e '4 s/Best\ unit/Best_Unit/' -e '1,3 d' 'trc_longlands.txt'
split -l 1000000  trc_longlands.txt trc_longlands_
cat > a
h
1
2
3
4
5
6
7
8
9
^D
split -l 3
split -l 3 a 1
mv 1aa 21aa
for i in 1*; do head -n1 21aa|cat - $i > 2$i; done
for i in 21*; do echo ---- $i; cat $i; done

how can "TRC_Longlands" and the path be replaced with the input filename -in python we have %path%/%name for this. in the last line is "do echo" necessary?

and this is called by python using

import os
os.system("process_text.bat")

where process_text.bat is basically

bash process_text.sh

I get the following error when run from dos...

Microsoft Windows [Version 6.1.7601] Copyright (c) 2009 Microsoft Corporation. All rights reserved.

C:\Users\georgec>bash P:\2012\Job_044_DM_Radio_Propogation\Working\FinalPropogat ion\TRC_Longlands\process_text.sh 'bash' is not recognized as an internal or external command, operable program or batch file.

also when I run the bash command from cygwin -I get

georgec@ATGIS25 /cygdrive/P/2012/Job_044_DM_Radio_Propogation/Working/FinalPropogation/TRC_Longlands $ bash process_text.sh : No such file or directory: /cygdrive/P/2012/Job_044_DM_Radio_Propogation/Working/FinalPropogation/TRC_Longlands cp: cannot create regular file `processing/trc_longlands.txt\r': No such file or directory : No such file or directory: txt_processing : No such file or directoryds.txt

but the files are created in the root directory.

why is there a "." after the directory name? how can they be given a .txt extension?

解决方案

If you want to just prepend the first line of the original file to all but the first of the splits, you can do something like:

$ cat > a
h
1
2
3
4
5
6
7
^D
$ split -l 3
$ split -l 3 a 1
$ ls
1aa 1ab 1ac a
$ mv 1aa 21aa
$ for i in 1*; do head -n1 21aa|cat - $i > 2$i; done
$ for i in 21*; do echo ---- $i; cat $i; done
---- 21aa
h
1
2
---- 21ab
h
3
4
5
---- 21ac
h
6
7

Obviously, the first file will have one line less then the middle parts and the last part might be shorter, too, but if that's not a problem, this should work just fine. Of course, if your header has more lines, just change head -n1 to head -nX, X being the number of header lines.

Hope this helps.

这篇关于将大型文本 (xyz) 数据库拆分为 x 个相等的部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆