PHP提取身体标签的内容 [英] php extract body tag content

查看:44
本文介绍了PHP提取身体标签的内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试应该很简单的方法,但是我无法使其正常工作.这让我想知道我是否使用了正确的工作流程.

I'm trying what should be very easy, but I can't get it to work. Which makes me wonder if I'm using the right workflow.

我有一个简单的html页面,该页面作为帮助文件加载到桌面应用程序中.该页面没有菜单,仅包含内容.我想在我的网站上拥有一个更完善的帮助系统.因此,我想使用一个将显示菜单,面包屑以及页眉和页脚的php文件.为了不复制帮助内容,我想加载原始的HTML帮助文件并将其正文内容添加到增强的帮助页面中.

I have a simple html page which I load in my desktop application as a help file. This page has no menu just the content. On my website I want to have a more sophisticated help system. So I want to use a php file which will show a menu, breadcrums and a header and footer. To not duplicate my help content I want to load the original HTML help file and add its body content to my enhanced help page.

我正在使用以下代码提取标题:

I'm using this code to extract the title:

function getURLContent($filename){
    $url = realpath(dirname(__FILE__)) . DIRECTORY_SEPARATOR . $filename;
    $doc = new DOMDocument;
    $doc->preserveWhiteSpace = FALSE;
    @$doc->loadHTMLFile($url);
    return $doc;
}

function getSingleElementValue($element){
  if (!is_null($element)) {
    $node = $element->childNodes->item(0);
    return $node->nodeValue;
  }
} 

$doc = getURLContent("test.html");
$title = getSingleElementValue($doc->getElementsByTagName('title')->item(0));
echo $title;

标题已正确提取.

现在我尝试提取身体:

function getBodyContent($element){
  $mock = new DOMDocument;
  foreach ($element->childNodes as $child){
      $mock->appendChild($mock->importNode($child, true));
  }        
  return $mock->saveHTML();
}

$body = getBodyContent($doc->getElementsByTagName('body')->item(0));
echo $body;

getBodyContent()函数是我尝试过的几个选项之一.它们全部返回整个HTML标签,包括HEAD标签.

The getBodyContent() function is one of the several options I tried. All of them return the whole HTML tag, including the HEAD tag.

我的问题是:这是正确的工作流程还是我应该使用其他东西?

My question is: Is this a correct workflow or should I use something else?

谢谢.

更新:我的最终目标是拥有一个包含多个页面的网站,该网站具有可通过菜单访问的帮助文件.这些页面将使用generate.php?page = test.html之类的东西生成.我还没有这个部分.目标也是不重复test.html的内容,因为此文件将在我的桌面应用程序中使用(使用Web控件).在我的桌面应用程序中,不需要菜单等.

Update: My final goal is to have a website with multiple pages that has the help files accessible via a menu. These pages will be generated using something like generate.php?page=test.html. I'm not yet at this part. The goal is also to not duplicate the content of test.html because this file will be used in my desktop application (using a web control). In my desktop application I don't need the menu and such.

更新#2:我必须添加<元http-equiv ="Content-Type" content ="text/html; charset = utf-8"/> 到我想阅读的html文件,现在我得到了正文内容.不幸的是,所有标签都是条状的.我也需要修复它.

Update #2: I had to add <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> to the html-file I want to read and now I do get the body content. Unfortunaly all tags are strips. I'll need to fixed that as well.

推荐答案

问题是 saveHTML()将返回实际文档.你不要这个相反,您只需要输入内容即可.

The problem is that saveHTML() will return an actual document. You don't want this. Instead, you want just what you put in.

非常感谢,您可以更轻松地做到这一点.

Thankfully, you can do this much more easily.

function getBodyContent(DOMNode $element) {
    $doc = $element->ownerDocument;
    $wrapper = $doc->createElement('div');
    foreach( $element->childNodes as $child) {
        $wrapper->appendChild($child);
    }
    $element->appendChild($wrapper);
    $html = $doc->saveHTML($wrapper);
    return substr($html, strlen("<div>"), -strlen("</div>"));
}

这会将内容包装到已知标签表示形式的单个元素中(主体可能具有使其未知的属性),从该元素获取呈现的HTML ,并剥离该标签的已知标签包装器.

This wraps the contents into a single element of known tag representation (the body may have attributes that make it unknown), gets the rendered HTML from that element, and strips off the known tag of the wrapper.

我还想提出对 getSingleElementValue 的改进:

function getSingleElementValue(DOMNode $element) {
    return trim($element->textContent);
}

还请注意使用类型提示来确保您的函数确实得到了预期的效果-这很有用,因为这意味着我们不再需要检查是否需要 $ element-> ownerDocument 存在吗? $ element-> ownerDocument-> saveHTML()会执行我们认为的功能吗?"和其他这样的问题.它确保我们有一个 DOMNode ,所以我们知道它具有那些东西.

Note also the use of type hints to ensure that your functions are indeed getting the kind of thing that is expected - this is useful as it means we no longer need to check "does $element->ownerDocument exist? does $element->ownerDocument->saveHTML() do what we think it does?" and other such questions. It ensures we have a DOMNode, so we know it has those things.

这篇关于PHP提取身体标签的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆