如何在java或php中解析一个非常大的xml文件并插入到mysql DB中 [英] How to parse an extremely large xml file in either java or php and insert into a mysql DB

查看:156
本文介绍了如何在java或php中解析一个非常大的xml文件并插入到mysql DB中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将一个庞大的xml文件解析到我的MySQL数据库中。该文件是4.7gb。 我知道,它疯了。



数据来自这里: http://www.discogs.com/data/ (最新专辑xml是700mb压缩和4.7gb解压缩)



<我可以使用java或php来解析和更新数据库。我认为java是更明智的想法。



我需要找到一种方法来解析xml而不填充我的4gb ram,并将其加载到db中。



最聪明的方法是什么?我听说过SAX解析器,我在想正确的方向吗?



现在,我不在乎从这些网址下载图片,我只是想要我数据库中的数据。我还没有设计表格,但我现在对xml方面更感兴趣。



我使用php的fread()打开文件的前1000个叮咬,所以至少我可以看到它的样子,这里是文件中第一张专辑结构的样本:

  <释放> 
< release id =1status =Accepted>
< images>
< image height =600type =primaryuri =http://s.dsimg.com/image/R-1-1193812031.jpeguri150 =http://s.dsimg .com / image / R-150-1-1193812031.jpegwidth =600/>
< image height =600type =secondaryuri =http://s.dsimg.com/image/R-1-1193812053.jpeguri150 =http://s.dsimg .com / image / R-150-1-1193812053.jpegwidth =600/>
< image height =600type =secondaryuri =http://s.dsimg.com/image/R-1-1193812072.jpeguri150 =http://s.dsimg .com / image / R-150-1-1193812072.jpegwidth =600/>
< image height =600type =secondaryuri =http://s.dsimg.com/image/R-1-1193812091.jpeguri150 =http://s.dsimg .com / image / R-150-1-1193812091.jpegwidth =600/>
< / images>
< artists>
< artist>
< name> Persuader,< / name>
< / artist>
< / artists>
< title> Stockholm< / title>
< labels>
< label catno =SK032name =Svek/>
< / labels>
< formats>
< format name =Vinylqty =2>
< description>
< description> 12< / description>
< / description>
< / format>
< / formats>
< genres>
< genre>电子< / genre>
< / genres>
< styles>
< style> Deep House< / style>
< / styles>
< country>瑞典< / country>
<已发布> 1999-03-00< / released>
< notes>在斯德哥尔摩的Globe工作室录制。标题是斯德哥尔摩地区的名称。< / notes>
< master_id> 5427< / master_id>
< tracklist>
< track>
< position> A< / position>
< title>Östermalm< / title>
< duration> 4:45< / duration>
< / track>
< track>
< position> B1< / position>
< title> Vasasta巢穴< /标题>
< duration> 6:11< / duration>
< / track>
< track>
< position> B2< / position>
< title> Kungsholmen< / title>
< duration> 2:49< / duration>
< / track>
< track>
< position> C1< / position>
< title>Södermalm< / title>
< duration> 5:38< / duration>
< / track>
< track>
< position> C2< / position>
< title> Norrmalm< / title>
< duration> 4:52< / duration>
< / track>
< track>
< position> D< / position>
< title> Gamla Stan< / title>
< duration> 5:16< / duration>
< / track>
< / tracklist>
< / release>



谢谢。

解决方案

我前段时间遇到类似的问题。这里是导入大约28MB文件的脚本巫婆的一部分,而不是将整个数据读入内存。它可能应该工作:)。它由XML节点读取,在内存中只保留XML的一小部分。脚本将需要很少的修改来满足您的需求。

  $ reader = new XMLReader(); 
$ reader-> open(< path_to_large_xml_file>);
while($ reader-> read()){
switch($ reader-> nodeType){
case(XMLREADER :: ELEMENT):
if($ reader) - > localName ==Table){

$ node = $ reader-> expand();
$ dom = new DomDocument();
$ n = $ dom-> importNode($ node,true);
$ dom-> appendChild($ n);
$ sxe = simplexml_import_dom($ n);

$ Data = array();
$ DataColumns = array();

foreach($ columns as $ key => $ column)
{

if(in_array($ key,$ DateColumns))
{
$ DateArray = explode('/',substr(trim($ sxe-> $ column),0,10));
$ ValueColumn = date('Y-m -d H:i:s',mktime(0,0,0,$ DateArray [1],$ DateArray [0],$ DateArray [2]));
$ Data [] ='\''。$ ValueColumn.'\'';
$ DataColumns [] = $ key;

if($ SplitDateInsert =='enabled')
{
$ Data [] ='\''。$ DateArray [2]。'\'';
$ Data [] ='\''。$ DateArray [1]。'\'';
$ Data [] ='\''。$ DateArray [0]。'\'';

$ DataColumns [] = $ key。' - year';
$ DataColumns [] = $ key .'_ month';
$ DataColumns [] = $ key .'_ day';
}

} else {
$ ValueColumn = addslashes(trim($ sxe-> $ column));
$ Data [] ='\''。$ ValueColumn.'\'';
$ DataColumns [] = $ key;
}

}
$ SQL =INSERT INTO {$ tableName}(。implode(',',$ DataColumns)。)VALUES(.implode(' ,$数据) )。
$ db-> query($ SQL);

} // END IF表
}
}


I'm trying to parse a massive xml file into my MySQL database. the file is 4.7gb. I know, its insane.

The data comes from here: http://www.discogs.com/data/ (the newest album xml is 700mb zipped and 4.7gb unzipped)

I can either use java or php to parse and update the database. I assume that java is the smarter idea.

I need to find a way to parse the xml without filling my 4gb of ram, and load it into the db.

What is the smartest way of doing this? I've heard of SAX parsers, am I thinking in the right direction?

For now, I don't care about downloading the images from those urls, I just want the data in my database. I have not yet designed the tables yet, but I'm more interested in the xml side right now.

I used php's fread() to open the file's first 1000 bites, so at least I can see what it looks like, here's a sample of the structure of the first album in the file:

<releases>
<release id="1" status="Accepted">
    <images>
        <image height="600" type="primary" uri="http://s.dsimg.com/image/R-1-1193812031.jpeg" uri150="http://s.dsimg.com/image/R-150-1-1193812031.jpeg" width="600" />
        <image height="600" type="secondary" uri="http://s.dsimg.com/image/R-1-1193812053.jpeg" uri150="http://s.dsimg.com/image/R-150-1-1193812053.jpeg" width="600" />
        <image height="600" type="secondary" uri="http://s.dsimg.com/image/R-1-1193812072.jpeg" uri150="http://s.dsimg.com/image/R-150-1-1193812072.jpeg" width="600" />
        <image height="600" type="secondary" uri="http://s.dsimg.com/image/R-1-1193812091.jpeg" uri150="http://s.dsimg.com/image/R-150-1-1193812091.jpeg" width="600" />
    </images>
    <artists>
        <artist>
            <name>Persuader, The</name>
        </artist>
    </artists>
    <title>Stockholm</title>
    <labels>
        <label catno="SK032" name="Svek" />
    </labels>
    <formats>
        <format name="Vinyl" qty="2">
            <descriptions>
                <description>12"</description>
            </descriptions>
        </format>
    </formats>
    <genres>
        <genre>Electronic</genre>
    </genres>
    <styles>
        <style>Deep House</style>
    </styles>
    <country>Sweden</country>
    <released>1999-03-00</released>
    <notes>Recorded at the Globe studio in Stockholm. The titles are the names of Stockholm's districts.</notes>
    <master_id>5427</master_id>
    <tracklist>
        <track>
            <position>A</position>
            <title>Östermalm</title>
            <duration>4:45</duration>
        </track>
        <track>
            <position>B1</position>
            <title>Vasastaden</title>
            <duration>6:11</duration>
        </track>
        <track>
            <position>B2</position>
            <title>Kungsholmen</title>
            <duration>2:49</duration>
        </track>
        <track>
            <position>C1</position>
            <title>Södermalm</title>
            <duration>5:38</duration>
        </track>
        <track>
            <position>C2</position>
            <title>Norrmalm</title>
            <duration>4:52</duration>
        </track>
        <track>
            <position>D</position>
            <title>Gamla Stan</title>
            <duration>5:16</duration>
        </track>
    </tracklist>
</release>

Thanks.

解决方案

I have faced some time ago with similar problem. Here is part of script witch imported around 28MB file, not reading whole data into memory. It should work perhaps :). It reads it by XML nodes, in memory stays only little part of XML. Script will need little modications to fit your needs.

$reader = new XMLReader();
$reader->open(<path_to_large_xml_file>);    
while ($reader->read()) {
    switch ($reader->nodeType) {
        case (XMLREADER::ELEMENT):
        if ($reader->localName == "Table") {

            $node = $reader->expand();
            $dom = new DomDocument();
            $n = $dom->importNode($node,true);
            $dom->appendChild($n);
            $sxe = simplexml_import_dom($n);

            $Data = array();
            $DataColumns = array();

            foreach ($columns as $key => $column)
            {

                if (in_array($key,$DateColumns))
                {
                    $DateArray = explode('/',substr(trim($sxe->$column),0,10));   
                    $ValueColumn = date('Y-m-d H:i:s',mktime(0,0,0,$DateArray[1],$DateArray[0],$DateArray[2]));
                    $Data[] = '\''.$ValueColumn.'\'';
                    $DataColumns[] = $key;

                    if ($SplitDateInsert == 'enabled')
                    {
                        $Data[] = '\''.$DateArray[2].'\'';
                        $Data[] = '\''.$DateArray[1].'\'';
                        $Data[] = '\''.$DateArray[0].'\'';

                        $DataColumns[] = $key.'_year';
                        $DataColumns[] = $key.'_month';
                        $DataColumns[] = $key.'_day';                            
                    }

                } else {
                    $ValueColumn = addslashes(trim($sxe->$column));
                    $Data[] = '\''.$ValueColumn.'\'';
                    $DataColumns[] = $key;
                }                   

            }               
                $SQL = "INSERT INTO {$tableName} (".implode(',',$DataColumns).") VALUES (".implode(',',$Data).")";                  
                $db->query($SQL);                       

        } // END IF table
    }
}

这篇关于如何在java或php中解析一个非常大的xml文件并插入到mysql DB中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆