如何改进XML导入到mongodb? [英] How to improve a XML import into mongodb?

查看:123
本文介绍了如何改进XML导入到mongodb?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些大型XML文件(每个5GB),我正在导入mongodb数据库。我正在使用Expat来解析文档,进行一些数据操作(删除一些字段,单位转换等)然后插入数据库。我的脚本基于以下内容: https://github.com/bgianfo / stackoverflow-mongodb / blob / master / so-import

I have some large XML files (5GB ~ each) that I'm importing to a mongodb database. I'm using Expat to parse the documents, doing some data manipulation (deleting some fields, unit conversion, etc) and then inserting into the database. My script is based on this one: https://github.com/bgianfo/stackoverflow-mongodb/blob/master/so-import

我的问题是:有没有办法通过批量插入来改善这一点?在插入之前将这些文档存储在数组上是个好主意?插入前我应该存储多少个文件呢?将jsons写入文件然后使用mongoimport会更快吗?

My question is: is there a way to improve this with a batch insert ? Storing these documents on an array before inserting would be a good idea ? How many documents should I store before inserting, then ? Writing the jsons into a file and then using mongoimport would be faster ?

我感谢任何建议。

推荐答案


在插入之前将这些文档存储在数组上是个好主意?

Storing these documents on an array before inserting would be a good idea?

是的,这很有可能。它减少了到数据库的往返次数。您应该监视您的系统,由于IO等待(即开销和线程同步比实际数据传输花费的时间多得多),插入时可能会空闲很多。

Yes, that's very likely. It reduces the number of round-trips to the database. You should monitor your system, it's probably idling a lot when inserting because of IO wait (that is, the overhead and thread synchronization is taking a lot more time than the actual data transfer).


插入之前我应该​​存储多少个文件呢?

How many documents should I store before inserting, then?

这很难说,因为它取决于很多因素。经验法则:1,000 - 10,000。你将不得不尝试一点。在旧版本的mongodb中,整个批次不得大于16MB的文档大小限制。

That's hard to say, because it depends on so many factors. Rule of thumb: 1,000 - 10,000. You will have to experiment a little. In older versions of mongodb, the entire batch must not be larger than the document size limit of 16MB.


将jsons写入文件中那么使用mongoimport会更快吗?

Writing the jsons into a file and then using mongoimport would be faster?

不,除非你的代码有缺陷。这意味着你必须复制数据两次,整个操作应该是IO绑定。

No, unless your code has a flaw. That would mean you have to copy the data twice and the entire operation should be IO bound.

另外,最好先添加所有文件,然后添加任何索引,而不是相反(因为每次插入都必须修复索引)

Also, it's a good idea to add all documents first, then add any indexes, not the other way around (because then the index will have to be repaired with every insert)

这篇关于如何改进XML导入到mongodb?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆