PHP DOM UTF-8问题 [英] PHP DOM UTF-8 problem

查看:72
本文介绍了PHP DOM UTF-8问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,我的数据库使用Windows-1250作为本机字符集.我将数据输出为UTF-8.我在整个网站上都使用iconv()函数将Windows-1250字符串转换为UTF-8字符串,并且效果很好.

First of all, my database uses Windows-1250 as native charset. I am outputting the data as UTF-8. I'm using iconv() function all over my website to convert Windows-1250 strings to UTF-8 strings and it works perfect.

问题是当我使用PHP DOM解析存储在数据库中的某些HTML时(HTML是WYSIWYG编辑器的输出,并且无效,它没有html,head,body标签等).

The problem is when I'm using PHP DOM to parse some HTML stored in the database (the HTML is an output from a WYSIWYG editor and is not valid, it has no html, head, body tags etc).

HTML可能看起来像这样,例如:

The HTML could look something like this, for example:

<p>Hello</p>

这是我用来从数据库中解析某些HTML的方法:

Here is a method I use to parse a certain HTML from the database:

 private function ParseSlideContent($slideContent)
 {
        var_dump(iconv('Windows-1250', 'UTF-8', $slideContent)); // this outputs the HTML ok with all special characters

  $doc = new DOMDocument('1.0', 'UTF-8');

  // hack to preserve UTF-8 characters
  $html = iconv('Windows-1250', 'UTF-8', $slideContent);
  $doc->loadHTML('<?xml encoding="UTF-8">' . $html);
  $doc->preserveWhiteSpace = false;

  foreach($doc->getElementsByTagName('img') as $t) {
   $path = trim($t->getAttribute('src'));
   $t->setAttribute('src', '/clientarea/utils/locate-image?path=' . urlencode($path));
  }
  foreach ($doc->getElementsByTagName('object') as $o) {
   foreach ($o->getElementsByTagName('param') as $p) {
    $path = trim($p->getAttribute('value'));
    $p->setAttribute('value', '/clientarea/utils/locate-flash?path=' . urlencode($path));
   }
  }
  foreach ($doc->getElementsByTagName('embed') as $e) {
   if (true === $e->hasAttribute('pluginspage')) {
    $path = trim($e->getAttribute('src'));
    $e->setAttribute('src', '/clientarea/utils/locate-flash?path=' . urlencode($path));
   } else {
    $path = end(explode('data/media/video/', trim($e->getAttribute('src'))));
    $path = 'data/media/video/' . $path;
    $path = '/clientarea/utils/locate-video?path=' . urlencode($path);
    $width = $e->getAttribute('width') . 'px';
    $height = $e->getAttribute('height') . 'px';
    $a = $doc->createElement('a', '');
    $a->setAttribute('href', $path);
    $a->setAttribute('style', "display:block;width:$width;height:$height;");
    $a->setAttribute('class', 'player');
    $e->parentNode->replaceChild($a, $e);
    $this->slideContainsVideo = true;
   }
  }

  $html = trim($doc->saveHTML());

  $html = explode('<body>', $html);
  $html = explode('</body>', $html[1]);
  return $html[0];
 }

上述方法的输出是一个垃圾,所有特殊字符都替换为诸如ššš之类的奇怪内容.

The output from the method above is a garbage with all special characters replaced with weird stuff like Ú�.

还有一件事.它在我的开发服务器上可以正常工作.

One more thing. It does work on my development server.

虽然它在生产服务器上不起作用.

It does not work on the production server though.

有什么建议吗?

生产服务器的PHP版本:PHP版本5.2.0RC4-dev

PHP version of the production server: PHP Version 5.2.0RC4-dev

开发服务器的PHP版本:PHP 5.2.13

PHP version of the development server: PHP Version 5.2.13

更新:

我自己正在研究解决方案.我从此PHP错误报告(虽然不是真正的错误)中得到了启发: http://bugs.php.net/bug.php?id=32547

I'm working on a solution myself. I have an inspiration from this PHP bug report (not really a bug though): http://bugs.php.net/bug.php?id=32547

这是我建议的解决方案.我明天会尝试,让您知道它是否有效:

This is my proposed solution. I will try it tomorrow and let you know if it works:

 private function ParseSlideContent($slideContent)
 {
        var_dump(iconv('Windows-1250', 'UTF-8', $slideContent)); // this outputs the HTML ok with all special characters

  $doc = new DOMDocument('1.0', 'UTF-8');

  // hack to preserve UTF-8 characters
  $html = iconv('Windows-1250', 'UTF-8', $slideContent);
  $doc->loadHTML('<?xml encoding="UTF-8">' . $html);
  $doc->preserveWhiteSpace = false;

  // this might work
  // it basically just adds head and meta tags to the document
  $html = $doc->getElementsByTagName('html')->item(0);
  $head = $doc->createElement('head', '');
  $meta = $doc->createElement('meta', '');
  $meta->setAttribute('http-equiv', 'Content-Type');
  $meta->setAttribute('content', 'text/html; charset=utf-8');
  $head->appendChild($meta);
  $body = $doc->getElementsByTagName('body')->item(0);
  $html->removeChild($body);
  $html->appendChild($head);
  $html->appendChild($body);

  foreach($doc->getElementsByTagName('img') as $t) {
   $path = trim($t->getAttribute('src'));
   $t->setAttribute('src', '/clientarea/utils/locate-image?path=' . urlencode($path));
  }
  foreach ($doc->getElementsByTagName('object') as $o) {
   foreach ($o->getElementsByTagName('param') as $p) {
    $path = trim($p->getAttribute('value'));
    $p->setAttribute('value', '/clientarea/utils/locate-flash?path=' . urlencode($path));
   }
  }
  foreach ($doc->getElementsByTagName('embed') as $e) {
   if (true === $e->hasAttribute('pluginspage')) {
    $path = trim($e->getAttribute('src'));
    $e->setAttribute('src', '/clientarea/utils/locate-flash?path=' . urlencode($path));
   } else {
    $path = end(explode('data/media/video/', trim($e->getAttribute('src'))));
    $path = 'data/media/video/' . $path;
    $path = '/clientarea/utils/locate-video?path=' . urlencode($path);
    $width = $e->getAttribute('width') . 'px';
    $height = $e->getAttribute('height') . 'px';
    $a = $doc->createElement('a', '');
    $a->setAttribute('href', $path);
    $a->setAttribute('style', "display:block;width:$width;height:$height;");
    $a->setAttribute('class', 'player');
    $e->parentNode->replaceChild($a, $e);
    $this->slideContainsVideo = true;
   }
  }

  $html = trim($doc->saveHTML());

  $html = explode('<body>', $html);
  $html = explode('</body>', $html[1]);
  return $html[0];
 }

推荐答案

您的"hack"没有道理.

Your "hack" doesn't make sense.

您正在将Windows-1250 HTML文件转换为UTF-8,然后添加<?xml encoding="UTF-8">.这行不通.用于HTML文件的DOM扩展名:

You are converting a Windows-1250 HTML file into UTF-8 and then prepending <?xml encoding="UTF-8">. This won't work. The DOM extension, for HTML files:

  • 采用在元http等效条件中为内容类型"指定的字符集.
  • 否则采用ISO-8859-1

我建议您改为从Windows-1250转换为ISO-8859-1,并且不添加任何内容.

I suggest you instead convert from Windows-1250 into ISO-8859-1 and prepend nothing.

编辑的建议不是很好,因为Windows-1250的字符不在ISO-8859-1中.由于您要处理的内容类型没有meta元素的片段,因此您可以添加自己的片段以强制解释为UTF-8:

EDIT The suggestion is not very good because Windows-1250 has characters that are not in ISO-8859-1. Since you're dealing with fragments without meta elements for content-type, you can add your own to force interpretation as UTF-8:

<?php
//script and output are in UTF-8

/* Simulate HTML fragment in Windows-1250 */
$html = <<<XML
<p>ĄĽź ‰ ‡ … á (some exist on win-1250, but not LATIN1 or even win-1252)</p>
XML;
$htmlInterm = iconv("UTF-8", "Windows-1250", $html); //convert

/* Append meta header to force UTF-8 interpretation and convert into UTF-8 */
$htmlInterm =
    "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\" />" .
    iconv("Windows-1250", "UTF-8", $htmlInterm);

/* Omit libxml warnings */
libxml_use_internal_errors(true);

/* Build DOM */
$d = new domdocument;
$d->loadHTML($htmlInterm);
var_dump($d->getElementsByTagName("body")->item(0)->textContent); //correct UTF-8

给予:


string(79) "ĄĽź ‰ ‡ … á (some exist on win-1250, but not LATIN1 or even win-1252)"

这篇关于PHP DOM UTF-8问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆