按XML标签拆分文件 [英] Split file by XML tag
问题描述
我有一个非常大的xml文件(1.25 GB),我需要拆分成更小的文件,以便能够处理它们。该文件包含由标签标题和脚注的语言数据:
I have a very large xml file (1.25 GB) that I need to split into smaller files to be able to process them. The file contains linguistic data that is headed and footed by the tags:
text id =www.example.com>
< text id="www.example.com>
和
< / text>
< /text>
我想用这些标签分割较大的文件,示例,
I would like to split the larger file by these tags. So that, for example,
< text id =www.example.com>
< text id="www.example.com>
您好
< / text>
< /text>
< text id =www.example.com>
< text id="www.example.com>
这是
; / text>
< /text>
< text id =www.example.com>
< text id="www.example.com>
示例
< / text>
< /text>
本质上是三个不同的文件:由text标签标记的开头和结尾。
例如:
Would essentially be three different files: with the beginning and end marked by the "text" tags. For example:
文件1
< text id =www.example.com>
< text id="www.example.com>
您好
/ text>
文件2
< text id =www.example.com>
< text id="www.example.com>
这是
< / text>
< /text>
文件3
< text id =www.example.com>
< text id="www.example.com>
示例
; / text>
< /text>
我想这可以通过Perl中的脚本来完成,但我想知道是否有一站式方式使用unix拆分此文件。
I suppose this could be done by scripting in Perl, for instance, but I'm wondering if there's any kind of "one stop shop" way to split this file using unix.
我知道拆分命令可用于根据行或文件将大文件拆分为较小的文件
I know that the splitting command is useful to split a large file into smaller files depending on lines or file size. However, is there a similar command that permits the splitting by xml tag?
预先感谢任何帮助!
推荐答案
以下是PERL程序:根据分隔符将一个文件拆分为多个文件
The following PERL program found here: Split one file into multiple files based on delimiter
#!/usr/bin/perl
open(FI,"file.txt") or die;
$cur=0;
open(FO,">res.$cur.txt") or die;
while(<FI>)
{
print FO $_;
if(/^<\/text>/) # Added \
{
close(FO);
$cur++;
open(FO,">res.$cur.txt") or die;
}
}
close(FO);
也似乎做到了,没有最大上限。
Also seems to do the trick, with no maximum cap.
干杯。
这篇关于按XML标签拆分文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!