NodeJS parseStream,定义块的起点和终点 [英] NodeJS parseStream, defining a start and end point for a chunk

查看:97
本文介绍了NodeJS parseStream,定义块的起点和终点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由Node的文件系统解析混淆。这是我的代码:

Confused by Node's filesystem parsing. Here's my code:

var fs = require('fs'),
    xml2js = require('xml2js');

var parser = new xml2js.Parser();

var stream = fs.createReadStream('xml/bigXML.xml');
stream.setEncoding('utf8');

stream.on('data', function(chunk){ 

    parser.parseString(chunk, function (err, result) {
        console.dir(result);
        console.log('Done');
    });
});


stream.on('end', function(chunk){
    // file have been read over,do something...
    console.log("IT'S OVER")
});

这导致......没有任何事情发生。根本没有来自XML2JS /解析器的输出。当我尝试 console.log(chunk)时,似乎没有以任何形式输出基于字节大小以外的任何其他内容的有意义的块。一个'块'的输出是:

This causes...nothing to happen. No output from XML2JS/the parser at all. When I try to console.log(chunk) it seems that the chunks aren't being output in any sort of meaningful chunks based on anything other than perhaps byte size. The output for one 'chunk' is:

<?xml version="1.0" encoding="UTF-8"?>
    <merchandiser xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="merchandiser.xsd">
    <header><merchantId>1237</merchantId><merchantName>NORDSTROM.com</merchantName><createdOn>12/13/2013 23:50:57</createdOn></header>
    <product product_id="52863929">// product info</product>
    <product product_id="26537849">// product info</product>
    <product product_id="25535647">// product info</product>

此块有很多很多< product> 来自XML内部的条目。该块将在< product> 条目的中间位置结束,下一个块将从此处停止。

This chunk has lots and lots of <product> entries from the XML inside of it. The chunk will end somewhere in the middle of a <product> entry and the next chunk will begin from where this left off.

主要问题是如何从< product createReadStream 输出块c>结束于< / product>

The main question is How do I get the createReadStream to output chunks starting at <product and ending at </product>?

编辑:为了获取正确的输出,这是从第一个< product> 开头到结尾的XML看起来像:

for the purposes of getting the proper output, here's what the XML from the beginning to the end of the first <product> looks like:

<?xml version="1.0" encoding="UTF-8" ?>
<merchandiser xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="merchandiser.xsd">
  <header>
    <merchantId>1237</merchantId>
    <merchantName>NORDSTROM.com</merchantName>
    <createdOn>12/13/2013 23:50:57</createdOn>
  </header>
  <product product_id="52863929" name="Teva 'Psyclone' Print Sandal (Baby, Walker &amp; Toddler) Camo/ Dark Olive 6 M" sku_number="52863929" manufacturer_name="Teva" part_number="1001701">
    <category>
      <primary>Toddler Unisex</primary>
      <secondary>Shoes~~Sandals/Slides</secondary>
    </category>
    <URL>
      <product>http://click.linksynergy.com/link?id=LUyP0GcLCGc&amp;offerid=276223.52863929&amp;type=15&amp;murl=http%3A%2F%2Fshop.nordstrom.com%2FS%2F3297406%3Fcm_cat%3Ddatafeed%26cm_pla%3Dshoes%3Asandals%252fslides%26cm_ite%3Dteva_%2527psyclone%2527_print_sandal_%2528baby%252c_walker_%2526_toddler%2529%3A503158_1%26cm_ven%3DLinkshare</product>
      <productImage>http://content.nordstrom.com/imagegallery/store/product/large/0/_6880020.jpg</productImage>
      <buy></buy>
    </URL>
    <description>
      <short>Rugged construction and stylish good looks define a sporty sandal, with the added convenience and security of hook-and-loop closures across the toe and at the instep.Rugged construction and stylish good looks define a sporty sandal, with the added
        convenience and security of h...</short>
      <long>Rugged construction and stylish good looks define a sporty sandal, with the added convenience and security of hook-and-loop closures across the toe and at the instep.Rugged construction and stylish good looks define a sporty sandal, with the added
        convenience and security of hook-and-loop closures across the toe and at the instep. Color(s): camo/ dark olive, daisy blue. Brand: Teva. Style Name: Teva 'Psyclone' Print Sandal (Baby, Walker &amp; Toddler). Style Number: 503158_1.</long>
    </description>
    <discount currency="USD">
      <amount></amount>
      <type>amount</type>
    </discount>
    <price currency="USD">
      <sale begin_date="" end_date="">24.95</sale>
      <retail>24.95</retail>
    </price>
    <brand>Teva</brand>
    <shipping>
      <cost currency="USD">
        <amount>0.00</amount>
        <currency>USD</currency>
      </cost>
      <information></information>
      <availability>Y</availability>
    </shipping>
    <keywords></keywords>
    <upc>737872649135</upc>
    <m1>503158_1.</m1>
    <pixel>http://ad.linksynergy.com/fs-bin/show?id=LUyP0GcLCGc&amp;bids=276223.52863929&amp;type=15&amp;subid=0</pixel>
    <attributeClass class_id="60">
      <Misc></Misc>
      <Product_Type>Shoes</Product_Type>
      <Size>6 M</Size>
      <Material></Material>
      <Color>CAMO/ DARK OLIVE</Color>
      <Gender>Unisex</Gender>
      <Style></Style>
      <Age></Age>
    </attributeClass>
  </product>


推荐答案

您有两种方法可以解决您的问题。

You have two possibilities to tackle your issue.

如wethat所述,XML2JS在解析数据之前需要完整的XML内容。但是你有一个文件流,它可以通过块来传输数据块。第一个解决方案是将此数据流转换为一个漂亮的大缓冲区,然后将其发送到XML2JS。为此,您可以使用 stream-to npm i stream-to )将文件流转换为缓冲区数组,然后我们将使用 Buffer.concat ,like这个:

As stated by damphat, XML2JS needs the full XML content before it can parse the data. But you have a file stream, which, well, streams data chunk by chunks. The first solution is to convert this stream of data into a nice big Buffer, and then send it to XML2JS. For this purpose, you can use the stream-to package (npm i stream-to) which will convert the file stream into an array of buffers, which we'll then concatenate into one single buffer using Buffer.concat, like this:

var fs = require('fs')
var streamTo = require('stream-to')
var xml2js = require('xml2js')

var file = fs.createReadStream('input.xml')

streamTo.array(file, function (err, arr) {
    if (err) return console.log(err.message)

    var content = Buffer.concat(arr)
    var parser = new xml2js.Parser()
    parser.parseString(content, function (err, res) {
        if (err) return console.log(err.message)
        console.log(res.merchandiser.product)
    })
})

这很有效,但自从它需要将整个文件保存到内存中,如果您的输入文件非常大,它将无法工作。要处理非常大的文件,您需要使用流式XML解析器,例如 sax 。但是 sax 不会创建Javascript对象,但它是一个EventEmitter,并且由于您必须处理所有相关事件以动态构建对象,因此更难使用。

This works quite well, but since it needs to hold the full file into memory, it won't work if your input files are really big. To handle really big files, you need to use a streaming XML parser, such as sax. However sax doesn't create Javascript objects, but is an EventEmitter, and is a bit harder to use since you have to handle all relevant events to build your object on the fly.

您可以使用例如 SaXPath库 ,它支持XPath语法的一小部分。每次匹配XPath模式时,此库都会发出匹配事件。这是一个例子:

You can use for instance the SaXPath library, which supports a small subset of the XPath syntax. This library emits a match event every time it matches the XPath pattern. Here's an example:

var saxpath = require('saxpath')
var fs = require('fs')
var sax = require('sax')

var saxParser = sax.createStream(true)
var streamer = new saxpath.SaXPath(saxParser, '/merchandiser/product')

streamer.on('match', function(xml) {
    console.log(xml);
});

fs.createReadStream('input.xml').pipe(saxParser)

您有两个选择:


  1. 由于您现在拥有的XML一次只能匹配一个产品,因此您可以使用 xml2js 一次解析单个产品

  2. SaXPath支持多个记录器:默认记录器监听sax事件并重新创建相应的XML (这是允许我们使用第一个解决方案的原因),但你可以推出自己的录音机,听取萨克斯事件并动态创建javascript对象。

  1. Since you now have the XML that matches only one product at a time, you can use xml2js to parse a single product at a time
  2. SaXPath supports multiple recorders: the default recorder listens to sax events and re-creates the corresponding XML (which is what allowed us to use the first solution), but you can roll out your own recorder, that listens to sax events and creates on the fly javascript objects.

这篇关于NodeJS parseStream,定义块的起点和终点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆