按XML标签拆分文件 [英] Split file by XML tag

查看:279
本文介绍了按XML标签拆分文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常大的xml文件(1.25 GB),我需要拆分成更小的文件,以便能够处理它们。该文件包含由标签标题和脚注的语言数据:

I have a very large xml file (1.25 GB) that I need to split into smaller files to be able to process them. The file contains linguistic data that is headed and footed by the tags:


text id =www.example.com>

< text id="www.example.com>


< / text>

< /text>

我想用这些标签分割较大的文件,示例,

I would like to split the larger file by these tags. So that, for example,


< text id =www.example.com>

< text id="www.example.com>

您好

< / text>

< /text>

< text id =www.example.com>

< text id="www.example.com>

这是

; / text>

< /text>

< text id =www.example.com>

< text id="www.example.com>

示例

< / text>

< /text>

本质上是三个不同的文件:由text标签标记的开头和结尾。
例如:

Would essentially be three different files: with the beginning and end marked by the "text" tags. For example:

文件1


< text id =www.example.com>

< text id="www.example.com>

您好

/ text>

文件2


< text id =www.example.com>

< text id="www.example.com>

这是

< / text>

< /text>

文件3


< text id =www.example.com>

< text id="www.example.com>

示例

; / text>

< /text>

我想这可以通过Perl中的脚本来完成,但我想知道是否有一站式方式使用unix拆分此文件。

I suppose this could be done by scripting in Perl, for instance, but I'm wondering if there's any kind of "one stop shop" way to split this file using unix.

我知道拆分命令可用于根据行或文件将大文件拆分为较小的文件

I know that the splitting command is useful to split a large file into smaller files depending on lines or file size. However, is there a similar command that permits the splitting by xml tag?

预先感谢任何帮助!

推荐答案

以下是PERL程序:根据分隔符将一个文件拆分为多个文件

The following PERL program found here: Split one file into multiple files based on delimiter

#!/usr/bin/perl
open(FI,"file.txt") or die;
$cur=0;
open(FO,">res.$cur.txt") or die;
while(<FI>)
{
    print FO $_;
    if(/^<\/text>/) # Added \
    {
        close(FO);
        $cur++;
        open(FO,">res.$cur.txt") or die;
    }
}
close(FO);

也似乎做到了,没有最大上限。

Also seems to do the trick, with no maximum cap.

干杯。

这篇关于按XML标签拆分文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆