在内保留换行符使用DOMXPath的标签？ [英] Preserve line breaks inside tags using DOMXPath?

查看：223 发布时间：2017/6/25 3:22:31 php html dom xpath

本文介绍了在内保留换行符使用DOMXPath的标签？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用PHP和 DOMXPath 来获取所有 元素的内容的网页：

I'm currently using PHP and DOMXPath to get the contents of all of the  elements of a web page:

<?php
...    
$doc = new DOMDocument();
$doc->loadHTML($html);

$xpath = new DOMXPath($doc);
$paragraphs = $xpath->evaluate("/html/body//p");

foreach ($paragraphs as $paragraph){
echo $paragraph->textContent . "<br />";
}

我的问题是，从 textContent 不尊重 中存在的元素。相反，它会删除换行符，并将通常在单独行上的单词相加。例如：

My problem is that the string resulting from textContent does not respect   tags that exist within those  elements. Instead it removes the line break and pushes words together that would normally be on separate lines. For example:

示例HTML：

<p>
Some happy talk goes here talking about our great product.<br />
We would love for you to buy it!
</p>

<p>
Random information and what not<br />
Isn't that cool?
</p>

目前来自PHP的输出：

Current Output from PHP above:

Some happy talk about our great product.We would love for you to buy it!

Random information and what notIsn't that cool?

我尝试过 $ paragraph = $ doc-> getElementsByTagName（p ）; 以及它给我一样的东西。

I have tried $paragraphs = $doc->getElementsByTagName("p"); as well and it gives me the same thing.

有没有办法使DOMXPath / DOMDocument保留换行符？我需要能够分隔一个段落中的每个单词，而当前的输出不允许。

Is there a way to make DOMXPath/DOMDocument preserve the line breaks? I need to be able to separate each of the words within a paragraph, and the current output disallows that.

如果有一个替代方法来检索 元素，同时保留  或'\\\ ' / code>这也是很棒的。


If there is an alternative method for retrieving the string within <p> elements while preserving <br /> or '\n' that would also be great.
 编辑 
进一步调查后，相关HTML实际上是由< br> 标签分隔的锚点列表，但没有实际的换行符：
Upon further investigation the HTML in question is actually a list of anchors separated by <br> tags but with no actual line breaks:
<p class="home_page_list"><a href="/home/personal-banking/checking/Category-Page-Classic-Checking/classic-checking.html">Classic Checking</a><br> <a href="/home/personal-banking/checking/Category-Page-Interest-Checking/interest-checking.html">Interest Checking</a><br> <a href="/home/personal-banking/checking/Category-Page-Interest-Checking/interest-premium-checking.html">Premium Checking</a><br> <a href="/home/personal-banking/Savings-Category-Page/Basic-Savings-Category-Page/basic-savings.html">Savings Plans</a><br> <a href="/home/personal-banking/Savings-Category-Page/Money-Market-Accounts-Category-Page/money-market-accounts.html">Money Market Accounts</a><br> <a href="/home/personal-banking/Savings-Category-Page/Certificates-of-Deposit-Category-Page/fixed-rate-CD.html">CDs</a><br> <a href="/home/personal-banking/Savings-Category-Page/Individual-Retirement-Account-Category-Page/individual-retirement-account.html">IRAs</a></p>

证明这可以与给定的原始HTML正常工作。
Turns out that this works properly with the original HTML given.
 更新：解决 
在@ ircmaxell的帮助下，并且@netcoder和@Gordon留下的评论已经解决了，它不是非常优雅，但现在将会做。
With the help of @ircmaxell's answer, and the comments left by @netcoder and @Gordon this has been solved, it's not very elegant but it will do for now.
示例：
foreach ($paragraphs as $paragraph){
    $p_text = new DOMDocument();
    $p_text->loadHTML(str_ireplace(array("<br>", "<br />"), "\r\n", DOMinnerHTML($paragraph)));
    //Do whatever, in this case get all of the words in an array.
    $words = explode(" ", str_ireplace(array(",", ".", "&", ":", "-", "\r\n"), " ", $p_text->textContent));
print_r($words);
}

这使用 DOMinnerHTML （由@netcoder建议）将< br> 的实例替换为 \\\\
（由@ircmaxell建议），然后可以在 textContent之后评估。 
This makes use of DOMinnerHTML (as suggested by @netcoder) to replace the instances of <br> with "\r\n" (as suggested by @ircmaxell), which can then be evaluated post textContent.
显然有一些改进的空间，但它解决了我目前的问题。
Obviously there's some room for improvement, but it has solved my current issue.
感谢大家帮助，
 Ben 
推荐答案
嗯，我会做的是用文字换行替换换行符：
Well, what I would do is replace the line-breaks with literal linebreaks:
$doc = new DOMDocument();
$doc->loadHTML($html);

$brs = $doc->getElementsByTagName('br');
foreach ($brs as $node) {
    $node->parentNode->replaceChild($doc->createTextNode("\r\n"), $node);
}


$xpath = new DOMXPath($doc);
$paragraphs = $xpath->evaluate("/html/body//p");

foreach ($paragraphs as $paragraph){
    echo $paragraph->textContent . "<br />";
}


                        这篇关于在&lt; p&gt;内保留换行符使用DOMXPath的标签？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

在< p>内保留换行符使用DOMXPath的标签？ [英] Preserve line breaks inside <p> tags using DOMXPath?

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录关闭

在&lt; p&gt;内保留换行符使用DOMXPath的标签？ [英] Preserve line breaks inside &lt;p&gt; tags using DOMXPath?

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录 关闭

在< p>内保留换行符使用DOMXPath的标签？ [英] Preserve line breaks inside <p> tags using DOMXPath?

登录关闭