通过17gb xml文件进行解析/扫描 [英] parsing/scanning through a 17gb xml file

查看:89
本文介绍了通过17gb xml文件进行解析/扫描的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试分析stackoverflow转储文件(Posts.xml-17gb),其格式为:

I am trying to parse the stackoverflow dump file (Posts.xml- 17gb) .It is of the form:

<posts>
<row Id="15228715" PostTypeId="1" />
.
<row Id="15228716" PostTypeId="2" ParentId="1600647" LastActivityDate="2013-03-05T16:13:24.897"/>
</posts>

我必须将每个问题与答案分组".基本上找到一个问题(posttypeid = 1),使用另一行的parentId来找到答案,并将其存储在db中.

I have to 'group' each question with their answers. Basically find a question (posttypeid=1) find its answers using parentId of another row and store it in db .

我尝试使用querypath(DOM)进行此操作,但它一直存在exiting(139).我的猜测是由于文件很大,即使进行大量交换,我的PC也无法处理它.

I tried doing this using querypath (DOM), but it kept exiting(139) . My guess is because of the large size of the file, my PC couldn't handle it, even with huge swap.

我考虑了xmlreader,但是正如我使用xmlreader看到的那样,该程序将读取文件很多次(查找问题,寻找答案,重复很多次),因此不可行.我错了吗?

I considered xmlreader, but as I see it using xmlreader, the program would be reading through the file a whole lot of times(find question, look for answers, repeat a lot of times) and hence is not viable. Am I wrong ?

还有其他方法/方式吗?

Is there any other method/way ?

帮助!

这是一次解析.

推荐答案

我考虑了xmlreader,但是正如我使用xmlreader所看到的那样,该程序将读取文件很多次(查找问题,寻找答案,重复很多次),因此不可行.我错了吗?

I considered xmlreader, but as I see it using xmlreader, the program would be reading through the file a whole lot of times(find question, look for answers, repeat a lot of times) and hence is not viable. Am I wrong ?

是的,您错了.使用XMLReader,您可以指定自己遍历文件的频率(通常一次一次).对于您的情况,我看不出为什么您甚至不能在每个<row>元素上插入1:1.您可以根据属性确定要插入哪个数据库(表?).

Yes you are wrong. With XMLReader you specify your own how often your want to traverse the file (you normally do it once). For your case I see no reason why you should not be able to even insert this 1:1 on each <row> element. You can decide per the attribute which database (table?) you would like to insert into.

我通常建议使用一组迭代器,这些迭代器使使用XMLReader进行遍历变得更加容易.它称为 XMLReaderIterator ,并允许在 ,这样代码通常更易于读写:

I normally suggest a set of Iterators that make traversing with XMLReader easier. It's called XMLReaderIterator and allows to foreach over the XMLReader so that the code is often easier to read and write:

$reader = new XMLReader();
$reader->open($xmlFile);

/* @var $users XMLReaderNode[] - iterate over all <post><row> elements */
$posts = new XMLElementIterator($reader, 'row');
foreach ($posts as $post)
{
    $isAnswerInsteadOfQuestion = (bool)$post->getAttribute('ParentId')

    $importer = $isAnswerInsteadOfQuestion 
                ? $importerAnswers 
                : $importerQuestions;

    $importer->importRowNode($post);
}

如果您担心订单的顺序(例如,您可能担心某些答案在答案可用时无法得到父母的答案),那么我会在导入器层而不是遍历范围内注意.

If you are concerned about the order (e.g. you might fear that some answers parent's aren't available while the answers are), I would take care inside the importer layer, not inside the traversal.

取决于这种情况是否经常发生,非常频繁,永远不会或永远不会,我会使用其他策略.例如.对于从不,我将直接在激活了外键约束的情况下插入数据库表中.如果经常,我将为整个导入创建一个插入事务,在该事务中取消关键约束并在最后将其重新激活.

Depending if that happens often, very often, never or quite never I would use a different strategy. E.g. for never I would insert directly into database tables with foreign key constraints activated. If often, I would create an insert transaction for the whole import in which the key constraints are lifted and re-activated at the end.

这篇关于通过17gb xml文件进行解析/扫描的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆