在< p>内保留换行符使用DOMXPath的标签? [英] Preserve line breaks inside <p> tags using DOMXPath?

查看:223
本文介绍了在< p>内保留换行符使用DOMXPath的标签?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用PHP和 DOMXPath 来获取所有< p> 元素的内容的网页:

I'm currently using PHP and DOMXPath to get the contents of all of the <p> elements of a web page:

<?php
...    
$doc = new DOMDocument();
$doc->loadHTML($html);

$xpath = new DOMXPath($doc);
$paragraphs = $xpath->evaluate("/html/body//p");

foreach ($paragraphs as $paragraph){
echo $paragraph->textContent . "<br />";
}

我的问题是,从 textContent 不尊重< p> 中存在的 元素。相反,它会删除换行符,并将通常在单独行上的单词相加。例如:

My problem is that the string resulting from textContent does not respect <br /> tags that exist within those <p> elements. Instead it removes the line break and pushes words together that would normally be on separate lines. For example:

示例HTML:

<p>
Some happy talk goes here talking about our great product.<br />
We would love for you to buy it!
</p>

<p>
Random information and what not<br />
Isn't that cool?
</p>

目前来自PHP的输出:

Current Output from PHP above:

Some happy talk about our great product.We would love for you to buy it!

Random information and what notIsn't that cool?

我尝试过 $ paragraph = $ doc-> getElementsByTagName(p ); 以及它给我一样的东西。

I have tried $paragraphs = $doc->getElementsByTagName("p"); as well and it gives me the same thing.

有没有办法使DOMXPath / DOMDocument保留换行符?我需要能够分隔一个段落中的每个单词,而当前的输出不允许。

Is there a way to make DOMXPath/DOMDocument preserve the line breaks? I need to be able to separate each of the words within a paragraph, and the current output disallows that.

如果有一个替代方法来检索< p> 元素,同时保留< br /> '\\\
' / code>这也是很棒的。

If there is an alternative method for retrieving the string within <p> elements while preserving <br /> or '\n' that would also be great.

编辑

进一步调查后,相关HTML实际上是由< br> 标签分隔的锚点列表,但没有实际的换行符:

Upon further investigation the HTML in question is actually a list of anchors separated by <br> tags but with no actual line breaks:

<p class="home_page_list"><a href="/home/personal-banking/checking/Category-Page-Classic-Checking/classic-checking.html">Classic Checking</a><br> <a href="/home/personal-banking/checking/Category-Page-Interest-Checking/interest-checking.html">Interest Checking</a><br> <a href="/home/personal-banking/checking/Category-Page-Interest-Checking/interest-premium-checking.html">Premium Checking</a><br> <a href="/home/personal-banking/Savings-Category-Page/Basic-Savings-Category-Page/basic-savings.html">Savings Plans</a><br> <a href="/home/personal-banking/Savings-Category-Page/Money-Market-Accounts-Category-Page/money-market-accounts.html">Money Market Accounts</a><br> <a href="/home/personal-banking/Savings-Category-Page/Certificates-of-Deposit-Category-Page/fixed-rate-CD.html">CDs</a><br> <a href="/home/personal-banking/Savings-Category-Page/Individual-Retirement-Account-Category-Page/individual-retirement-account.html">IRAs</a></p>

证明这可以与给定的原始HTML正常工作。

Turns out that this works properly with the original HTML given.

更新:解决

在@ ircmaxell的帮助下,并且@netcoder和@Gordon留下的评论已经解决了,它不是非常优雅,但现在将会做。

With the help of @ircmaxell's answer, and the comments left by @netcoder and @Gordon this has been solved, it's not very elegant but it will do for now.

示例:

foreach ($paragraphs as $paragraph){
    $p_text = new DOMDocument();
    $p_text->loadHTML(str_ireplace(array("<br>", "<br />"), "\r\n", DOMinnerHTML($paragraph)));
    //Do whatever, in this case get all of the words in an array.
    $words = explode(" ", str_ireplace(array(",", ".", "&", ":", "-", "\r\n"), " ", $p_text->textContent));
print_r($words);
}

这使用 DOMinnerHTML (由@netcoder建议)将< br> 的实例替换为 \\\\
(由@ircmaxell建议),然后可以在 textContent之后评估。

This makes use of DOMinnerHTML (as suggested by @netcoder) to replace the instances of <br> with "\r\n" (as suggested by @ircmaxell), which can then be evaluated post textContent.

显然有一些改进的空间,但它解决了我目前的问题。

Obviously there's some room for improvement, but it has solved my current issue.

感谢大家帮助,

Ben

推荐答案

嗯,我会做的是用文字换行替换换行符:

Well, what I would do is replace the line-breaks with literal linebreaks:

$doc = new DOMDocument();
$doc->loadHTML($html);

$brs = $doc->getElementsByTagName('br');
foreach ($brs as $node) {
    $node->parentNode->replaceChild($doc->createTextNode("\r\n"), $node);
}


$xpath = new DOMXPath($doc);
$paragraphs = $xpath->evaluate("/html/body//p");

foreach ($paragraphs as $paragraph){
    echo $paragraph->textContent . "<br />";
}

这篇关于在&lt; p&gt;内保留换行符使用DOMXPath的标签?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆