domdocument字符集问题 [英] domdocument character set issue

查看:123
本文介绍了domdocument字符集问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是来自女巫的视频,我想获取 og:title

This the video from witch i want to get the og:title

http://www.youtube.com/watch?feature=player_embedded&v=A683kmvRH_8

Php代码

function file_get_contents_curl($url){
            $ch = curl_init();
            curl_setopt($ch, CURLOPT_HEADER, 0);
            curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
            curl_setopt($ch, CURLOPT_URL, $url);
            curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
            $data = curl_exec($ch);
            curl_close($ch);
            return $data;
        }

        $html = file_get_contents_curl($pageurl);

        $doc = new DOMDocument();
        @$doc->loadHTML($html);
        $nodes = $doc->getElementsByTagName('title');

        $titleBackUp = $nodes->item(0)->nodeValue;

        $metas = $doc->getElementsByTagName('meta');

        for ($i = 0; $i < $metas->length; $i++){
            $meta = $metas->item($i);
            if($meta->getAttribute('name') == 'title')
                $title = $meta->getAttribute('content');
        }

标题为Мастило-Връцететиенай-добре[ HQ] ,我得到

ÐаÑÑило-ÐÑÑÑеÑеÑиенай-доб Ñе [HQ]

我也尝试使用

 curl_setopt( $ch, CURLOPT_ENCODING, "UTF-8" );

但这确实有用。

I尝试使用 html_entity_decode 但不起作用

I try with html_entity_decode but is not working

推荐答案

如果文档本身不包含此错误一个<元http-equiv = Content-Type content = text / html; charset = utf-8 /> 标记。

This can happen if the document itself doesn't contain a <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> tag.

您可以尝试以下任一操作:

You can try either of the following:


  1. DomDocument 直接从服务器加载HTML(即使用-> loadHTMLFile()

  1. Let DomDocument load the HTML directly from the server (i.e. use ->loadHTMLFile())

在通过-> loadHTML()运行文件之前,使用前面提到的meta标记对文档进行前缀。

Prefix the document with aforementioned meta tag before running it through ->loadHTML().

例如,您可以执行以下操作:

For example, you could do this:

libxml_use_internal_errors(true);
$doc->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />' . $html);
libxml_clear_errors();

这是一种让libxml知道应该读取utf-8数据的黑客行为...这是不可能的通过-> loadHTML()传递编码。

It's a hack to let libxml know it's supposed to read utf-8 data ... it's not possible to pass that encoding via ->loadHTML().

这篇关于domdocument字符集问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆