按顺序提取文本标签-如何完成? [英] Extracting text tags in order - How can this be done?

查看:46
本文介绍了按顺序提取文本标签-如何完成?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在HTML中查找所有文本以及父标记.在下面的示例中,名为 html 的变量具有示例HTML,在该示例中,我尝试提取标记和文本.这可以正常工作,并且按预期给出了带有 text

I am trying to find all the text along with the parent tag in the HTML. In the example below, the variable named html has the sample HTML where I try to extract the tags and the text. This works fine and as expected gives out the tags with the text

在这里,我已经使用 cheerio 遍历DOM. cheerio jquery 完全相同.

Here I have used cheerio to traverse DOM. cheerio works exactly same as jquery.

const cheerio = require("cheerio");

const html = `
                    <html>
                <head></head>
                <body>
                <p>
                  Regular bail is the legal procedure through which a court can direct 
                  release of persons in custody under suspicion of having committed an offence, 
                  usually on some conditions which are designed to ensure 
                  that the person does not flee or otherwise obstruct the course of justice. 
                  These conditions may require executing a "personal bond", whereby a person
                  pledges a certain amount of money or property which may be forfeited if 
                  there is a breach of the bail conditions. Or, a court may require
                  executing a bond "with sureties", where a person is not seen as 
                  reliable enough and may have to present 
                  <em>other persons</em> to vouch for her, 
                  and the sureties must execute bonds pledging money / property which 
                  may be forfeited if the accused person breaches a bail condition.
                </p>
                </body>
            </html>

`;

const getNodeType = function (renderedHTML, el, nodeType) {
    const $ = cheerio.load(renderedHTML)

    return $(el).find(":not(iframe)").addBack().contents().filter(function () {
        return this.nodeType == nodeType;
    });
}

let allTextPairs = [];
const $ = cheerio.load(html);
getNodeType(html, $("html"), 3).map((i, node) => {
            const parent = node.parentNode.tagName;
            const nodeValue = node.nodeValue.trim();
            allTextPairs.push([parent, nodeValue])
});

console.log(allTextPairs);

如下所示

但是问题是提取的文本标签混乱.如果您看到上面的屏幕截图,则最终报告了其他人,尽管它应该出现在之前以担保她... .为什么会这样?我该如何预防?

But the problem is that the text tags extracted are out of order. If you see the above screenshot, other persons has been reported in the end, although it should occur before to vouch for her .... Why does this happen? How can I prevent this?

推荐答案

您可能只想按深度顺序遍历树.步行功能由此要点提供.

You might want to just walk the tree in depth order. Walk function courtesy of this gist.

function walk(el, fn, parents = []) {
  fn(el, parents);
  (el.children || []).forEach((child) => walk(child, fn, parents.concat(el)));
}
walk(cheerio.load(html).root()[0], (node, parents) => {
  if (node.type === "text" && node.data.trim()) {
    console.log(parents[parents.length - 1].name, node.data);
  }
});

这会打印出内容,但您也可以将其放入您的数组中.

This prints out the stuff, but you could just as well put it in that array of yours.

这篇关于按顺序提取文本标签-如何完成?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆