如何使用Microdata提取/解析HTML [英] How to extract/parse HTML using Microdata

查看:92
本文介绍了如何使用Microdata提取/解析HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我有一个带有Microdata的HTML字符串。我试图找出是否可以使用带有JS或jQuery的Microdata动态地提取所需的信息。有没有人以前做过这件事?



示例HTML字符串:我正在尝试为项目prop-name'Blendmagic'获取itemprop'ratingValue'对应的'content' p>

 < html> 
< div itemscope itemtype =http://schema.org/Offer>
< span itemprop =name> Blendmagic< / span>
< span itemprop =价格> $ 19.95< / span>
< div itemprop =reviewsitemscope itemtype =http://schema.org/AggregateRating>
< img src =four-stars.jpg/>
< meta itemprop =ratingValuecontent =4/>
< meta itemprop =bestRatingcontent =5/>
基于< span itemprop =ratingCount> 25< / span>用户评分
< / div>
< / div>
< div itemscope itemtype =http://schema.org/Offer>
< span itemprop =name> testMagic< / span>
< span itemprop =价格> $ 10.95< / span>
< div itemprop =reviewsitemscope itemtype =http://schema.org/AggregateRating>
< img src =four-stars.jpg/>
< meta itemprop =ratingValuecontent =4/>
< meta itemprop =bestRatingcontent =5/>
基于< span itemprop =ratingCount> 25< / span>用户评分
< / div>
< / div>
< / html>


解决方案

尝试从根目录开始 itemscope 节点,过滤具有 itemprop 属性的后代元素;返回对象结果包含数组项目持有 Microdata 项目秒。

该解决方案基于 Microdata


<7>将HTML转换为其他格式



<给定一个Document中节点节点的列表,用户代理必须运行
以下算法以从这些节点中提取微数据到一个
JSON形式:

让结果成为一个空对象。

让项目成为一个空数组。

对于节点中的每个节点,检查元素是否是顶级微数据
项目,如果是,则获取该元素的对象以及将它添加到
项目中。



添加一个条目,称为条目,其值为数组条目。



以最短的
可能的方式将序列化结果的结果返回给JSON(这意味着在令牌之间没有空白,没有不必要的
是数字中的零位数字,并且对于没有专用转义序列的
字符的字符串只能使用Unicode转义符),并在适当时使用
小写字母e代表任何
数字。 [JSON]



该算法返回一个具有单个属性的对象,该属性是
数组,而不是仅仅返回一个数组,这样就有可能$

当用户代理程序要获取项目项目的对象时,可以使用
列表元素内存,它必须运行以下子步骤:



让结果成为空对象。



如果没有内存被传递给算法,让内存成为一个空列表。



将内容添加到内存中。



如果item具有任何项目类型,为结果添加一个条目,称为type
,其值是一个数组,列出项目的项目类型,按照
的顺序指定itemtype属性。



如果项目具有全局标识符,请将结果添加到名为
id的结果中,该结果的值是项目的全局标识符。



让属性为空对于每个具有一个或多个属性名称并且是
的元素元素,项目项目的属性之一是这些元素的顺序
由返回项目属性的算法给出,运行
以下子步骤:



设值为元素的属性值。 p>

如果value是一个item,那么:如果value在内存中,那么让value为
字符串ERROR。否则,获取对象的值,传递一个
的内存副本,然后用从
返回的对象替换值。



如果在属性中没有名称为name的条目,则添加一个名为$ b $的条目,如下所示:


b名称到其值为空数组的属性。



将值附加到属性中名为name的条目中。



添加一个名为properties的结果,其值为对象
属性。



返回结果。



var result = {}; var items = []; document.querySelectorAll( (函数(el,i)){var item = {type:[el.getAttribute(itemtype)] ,属性:{}}; var props = el.querySelectorAll([itemprop]); props.forEach(function(prop){item.properties [prop.getAttribute(itemprop)] = [prop.content || prop.textContent || prop.src]; if(prop.matches([itemscope] )&& prop.matches([itemprop])){var _item = {type:[prop.getAttribute(itemtype)],properties:{}}; prop.querySelectorAll( itemprop]).forEach(function(_prop){_item.properties [_prop.getAttribute(itemprop)] = [_prop.content || _prop.textContent || _prop.src];}); item.properties [prop .getAttribute(itemprop)] = [_item];}}); items.push(item)})result.items = items; console.log(result); document.body .insertAdjacentHTML(beforeend,< pre>+ JSON.stringify(result,null,2)+ < pre>); var props = [Blendmagic,ratingValue]; //获取itemprop'ratingValue'所对应的'content'//用于项目prop-name'Blendmagic'var data = result.items。 map(function(value,key){if(value.properties.name&&& value.properties.name [0] === props [0]){var prop = value.properties.reviews [0] .properties ; var res = {},_props = {}; _props [props [1]] = prop [props [1]]; res [props [0]] = _props return res};})[0]; console.log (data); document.querySelector(pre)。insertAdjacentHTML(beforebegin,< pre>+ JSON.stringify(result,null,2)+< pre>); <!DOCTYPE html>< html>< head>< /><头><身体GT; < div itemscope itemtype =http://schema.org/Offer> < span itemprop =name> Blendmagic< / span> < span itemprop =价格> $ 19.95< / span> < div itemprop =reviewsitemscope itemtype =http://schema.org/AggregateRating> < img data-src =four-stars.jpg/> < meta itemprop =ratingValuecontent =4/> < meta itemprop =bestRatingcontent =5/>基于< span itemprop =ratingCount> 25< / span>用户评分< / div> < / DIV> < div itemscope itemtype =http://schema.org/Offer> < span itemprop =name> testMagic< / span> < span itemprop =价格> $ 10.95< / span> < div itemprop =reviewsitemscope itemtype =http://schema.org/AggregateRating> < img data-src =four-stars.jpg/> < meta itemprop =ratingValuecontent =4/> < meta itemprop =bestRatingcontent =5/>基于< span itemprop =ratingCount> 25< / span>用户评分< / div> < / div>< / html>

p>另见 Microdata项目的递归和循环


I am pretty new to Microdata.

I have a HTML string with Microdata. I am trying to figure out if it's possible to extract the required information dynamically using Microdata with JS or jQuery. Has anyone done this before?

Example HTML string: I am trying to get the 'content' corresponding to itemprop 'ratingValue' for item prop-name 'Blendmagic'

<html>
    <div itemscope itemtype="http://schema.org/Offer">
        <span itemprop="name">Blendmagic</span>
        <span itemprop="price">$19.95</span>
        <div itemprop="reviews" itemscope itemtype="http://schema.org/AggregateRating">
            <img src="four-stars.jpg" />
            <meta itemprop="ratingValue" content="4" />
            <meta itemprop="bestRating" content="5" />
            Based on <span itemprop="ratingCount">25</span> user ratings
        </div>
    </div>
    <div itemscope itemtype="http://schema.org/Offer">
        <span itemprop="name">testMagic</span>
        <span itemprop="price">$10.95</span>
        <div itemprop="reviews" itemscope itemtype="http://schema.org/AggregateRating">
            <img src="four-stars.jpg" />
            <meta itemprop="ratingValue" content="4" />
            <meta itemprop="bestRating" content="5" />
            Based on <span itemprop="ratingCount">25</span> user ratings
        </div>
    </div>
</html>

解决方案

Try beginning at the root itemscope node , filter descendant elements having itemprop attributes; return object result containing array items holding Microdata items.

This solution is based on the algorithm found at Microdata

7 Converting HTML to other formats

7.1 JSON

Given a list of nodes nodes in a Document, a user agent must run the following algorithm to extract the microdata from those nodes into a JSON form:

Let result be an empty object.

Let items be an empty array.

For each node in nodes, check if the element is a top-level microdata item, and if it is then get the object for that element and add it to items.

Add an entry to result called "items" whose value is the array items.

Return the result of serializing result to JSON in the shortest possible way (meaning no whitespace between tokens, no unnecessary zero digits in numbers, and only using Unicode escapes in strings for characters that do not have a dedicated escape sequence), and with a lowercase "e" used, when appropriate, in the representation of any numbers. [JSON]

This algorithm returns an object with a single property that is an array, instead of just returning an array, so that it is possible to extend the algorithm in the future if necessary.

When the user agent is to get the object for an item item, optionally with a list of elements memory, it must run the following substeps:

Let result be an empty object.

If no memory was passed to the algorithm, let memory be an empty list.

Add item to memory.

If the item has any item types, add an entry to result called "type" whose value is an array listing the item types of item, in the order they were specified on the itemtype attribute.

If the item has a global identifier, add an entry to result called "id" whose value is the global identifier of item.

Let properties be an empty object.

For each element element that has one or more property names and is one of the properties of the item item, in the order those elements are given by the algorithm that returns the properties of an item, run the following substeps:

Let value be the property value of element.

If value is an item, then: If value is in memory, then let value be the string "ERROR". Otherwise, get the object for value, passing a copy of memory, and then replace value with the object returned from those steps.

For each name name in element's property names, run the following substeps:

If there is no entry named name in properties, then add an entry named name to properties whose value is an empty array.

Append value to the entry named name in properties.

Add an entry to result called "properties" whose value is the object properties.

Return result.

var result = {};
var items = [];
document.querySelectorAll("[itemscope]")
  .forEach(function(el, i) {
    var item = {
      "type": [el.getAttribute("itemtype")],
      "properties": {}
    };
    var props = el.querySelectorAll("[itemprop]");
    props.forEach(function(prop) {
      item.properties[prop.getAttribute("itemprop")] = [
        prop.content || prop.textContent || prop.src
      ];
      if (prop.matches("[itemscope]") && prop.matches("[itemprop]")) {
        var _item = {
          "type": [prop.getAttribute("itemtype")],
          "properties": {}
        };
        prop.querySelectorAll("[itemprop]")
          .forEach(function(_prop) {
            _item.properties[_prop.getAttribute("itemprop")] = [
              _prop.content || _prop.textContent || _prop.src
            ];
          });
        item.properties[prop.getAttribute("itemprop")] = [_item];
      }
    });
    items.push(item)
  })

result.items = items;

console.log(result);

document.body
  .insertAdjacentHTML("beforeend", "<pre>" + JSON.stringify(result, null, 2) + "<pre>");

var props = ["Blendmagic", "ratingValue"];

// get the 'content' corresponding to itemprop 'ratingValue' 
// for item prop-name 'Blendmagic'
var data = result.items.map(function(value, key) {
  if (value.properties.name && value.properties.name[0] === props[0]) {
    var prop = value.properties.reviews[0].properties;
    var res = {},
      _props = {};
    _props[props[1]] = prop[props[1]];
    res[props[0]] = _props
    return res
  };
})[0];

console.log(data);
document.querySelector("pre").insertAdjacentHTML("beforebegin", "<pre>" + JSON.stringify(result, null, 2) + "<pre>");

<!DOCTYPE html>
<html>

<head>
</head>

<body>
  <div itemscope itemtype="http://schema.org/Offer">
    <span itemprop="name">Blendmagic</span>
    <span itemprop="price">$19.95</span>
    <div itemprop="reviews" itemscope itemtype="http://schema.org/AggregateRating">
      <img data-src="four-stars.jpg" />
      <meta itemprop="ratingValue" content="4" />
      <meta itemprop="bestRating" content="5" />Based on <span itemprop="ratingCount">25</span> user ratings
    </div>
  </div>
  <div itemscope itemtype="http://schema.org/Offer">
    <span itemprop="name">testMagic</span>
    <span itemprop="price">$10.95</span>
    <div itemprop="reviews" itemscope itemtype="http://schema.org/AggregateRating">
      <img data-src="four-stars.jpg" />
      <meta itemprop="ratingValue" content="4" />
      <meta itemprop="bestRating" content="5" />Based on <span itemprop="ratingCount">25</span> user ratings
    </div>
  </div>
</body>

</html>

See also Recursion and loops of Microdata items

这篇关于如何使用Microdata提取/解析HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆