将大型XML文件转换为关系数据库 [英] Converting large XML file to relational database

查看:110
本文介绍了将大型XML文件转换为关系数据库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试找出实现以下目标的最佳方法:

I'm trying to figure out the best way to accomplish the following:


  1. 每天下载大型XML(1GB)文件来自第三方网站的基础

  2. 将该XML文件转换为我服务器上的关系数据库

  3. 添加搜索数据库的功能

对于第一部分,这是需要手动完成的,还是可以用cron完成?

For the first part, is this something that would need to be done manually, or could it be accomplished with a cron?

与XML和关系数据库相关的大多数问题和答案都是指Python或PHP。这可以用javascript / nodejs完成吗?

Most of the questions and answers related to XML and relational databases refer to Python or PHP. Could this be done with javascript/nodejs as well?

如果这个问题更适合不同的StackExchange论坛,请告诉我,我会把它移到那里。

If this question is better suited for a different StackExchange forum, please let me know and I will move it there instead.

以下是xml代码示例:

Below is a sample of the xml code:

<case-file>
  <serial-number>123456789</serial-number>
    <transaction-date>20150101</transaction-date>
      <case-file-header>
       <filing-date>20140101</filing-date>
      </case-file-header>
      <case-file-statements>
       <case-file-statement>
        <code>AQ123</code>
        <text>Case file statement text</text>
       </case-file-statement>
       <case-file-statement>
        <code>BC345</code>
        <text>Case file statement text</text>
       </case-file-statement>
     </case-file-statements>
   <classifications>
  <classification>
   <international-code-total-no>1</international-code-total-no>
   <primary-code>025</primary-code>
  </classification>
 </classifications>
</case-file>

以下是有关如何使用这些文件的更多信息:

所有XML文件的格式都相同。每条记录中可能有几十个元素。这些文件每天由第三方更新(并在第三方网站上以压缩文件的形式提供)。每天的文件代表新的案例文件以及更新的案例文件。

All XML files will be in the same format. There are probably a few dozen elements within each record. The files are updated by a third party on a daily basis (and are available as zipped files on the third-party website). Each day's file represents new case files as well as updated case files.

目标是允许用户搜索信息并在页面上组织这些搜索结果(或在生成的pdf / excel文件中)。例如,用户可能希望查看包含< text> 元素中特定单词的所有案例文件。或者用户可能希望查看包含主代码025(< primary-code> 元素)的所有案例文件,并且这些案例文件是在特定日期之后提交的(< filing-date> 元素)。

The goal is to allow a user to search for information and organize those search results on the page (or in a generated pdf/excel file). For example, a user might want to see all case files that include a particular word within the <text> element. Or a user might want to see all case files that include primary code 025 (<primary-code> element) and that were filed after a particular date (<filing-date> element).

输入数据库的唯一数据来自XML文件 - 用户不会将任何自己的信息添加到数据库中。

The only data entered into the database will be from the XML files--users won't be adding any of their own information to the database.

推荐答案

所有步骤当然可以使用 node.js 来完成。有些模块可以帮助您完成以下任务:

All steps could certainly be accomplished using node.js. There are modules available that will help you with each of these tasks:



    • node-cron :可让您在节点程序中轻松设置cron任务。另一种选择是在您的操作系统上设置一个cron任务(为您喜爱的操作系统提供大量资源)。

    • 下载:模块可以轻松地从URL下载文件。

    • node-cron: lets you easily set up cron tasks in your node program. Another option would be to set up a cron task on your operating system (lots of resources available for your favourite OS).
    • download: module to easily download files from a URL.

xml-stream :允许您流式传输文件并注册解析器遇到某些XML元素时触发的事件。我已成功使用此模块解析KML文件(授权它们比文件小得多)。

xml-stream: allows you to stream a file and register events that fire when the parser encounters certain XML elements. I have successfully used this module to parse KML files (granted they were significantly smaller than your files).

node-postgres :PostgreSQL的节点客户端(我确信有许多其他常见RDBMS的客户端,PG是我到目前为止唯一使用过的客户端。)

node-postgres: node client for PostgreSQL (I am sure there are clients for many other common RDBMS, PG is the only one I have used so far).

这些模块中的大多数都有很好的例子可以帮助你入门。以下是您可能设置XML流媒体部分的方法:

Most of these modules have pretty great examples that will get you started. Here's how you would probably set up the XML streaming part:

var XmlStream = require('xml-stream');
var xml = fs.createReadStream('path/to/file/on/disk'); // or stream directly from your online source
var xmlStream = new XmlStream(xml);
xmlStream.on('endElement case-file', function(element) {
    // create and execute SQL query/queries here for this element
});
xmlStream.on('end', function() {
    // done reading elements
    // do further processing / query database, etc.
});

这篇关于将大型XML文件转换为关系数据库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆