DomDocument和html实体 [英] DomDocument and html entities
问题描述
我正在尝试解析包含一些HTML实体的HTML,例如
I'm trying to parse some HTML that includes some HTML entities, like ×
$str = '<a href="http://example.com/"> A × B</a>';
$dom = new DomDocument;
$dom -> substituteEntities = false;
$dom ->loadHTML($str);
$link = $dom ->getElementsByTagName('a') -> item(0);
$fullname = $link -> nodeValue;
$href = $link -> getAttribute('href');
echo "
fullname: $fullname \n
href: $href\n";
但是DomDocument用A?B代替文本。
but DomDocument substitutes the text for for A × B.
有没有办法让它不接受&对于一个html实体,让它只是离开它?我试图将substituteEntities设置为false,但它不做任何事情
Is there some way to keep it from taking the & for an html entity and make it just leave it alone? I tried to set substituteEntities to false but it doesn't do anything
推荐答案
从文档:
DOM扩展使用UTF-8编码。
使用utf8_encode()和utf8_decode()处理ISO-8859-1编码中的文本或Iconv for other编码。
The DOM extension uses UTF-8 encoding.
Use utf8_encode() and utf8_decode() to work with texts in ISO-8859-1 encoding or Iconv for other encodings.
假设您使用的是latin-1,请尝试:
Assuming you're using latin-1 try:
<?php
header('Content-type:text/html;charset=iso-8859-1');
$str = utf8_encode('<a href="http://example.com/"> A × B</a>');
$dom = new DOMDocument;
$dom -> substituteEntities = false;
$dom ->loadHTML($str);
$link = $dom ->getElementsByTagName('a') -> item(0);
$fullname = utf8_decode($link -> nodeValue);
$href = $link -> getAttribute('href');
echo "
fullname: $fullname \n
href: $href\n"; ?>
这篇关于DomDocument和html实体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!