domdocument字符集问题 [英] domdocument character set issue

查看：123 发布时间：2020/10/25 21:41:57 php domdocument

本文介绍了domdocument字符集问题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这是来自女巫的视频，我想获取 og：title

This the video from witch i want to get the og:title

http://www.youtube.com/watch?feature=player_embedded&v=A683kmvRH_8

Php代码

function file_get_contents_curl($url){
            $ch = curl_init();
            curl_setopt($ch, CURLOPT_HEADER, 0);
            curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
            curl_setopt($ch, CURLOPT_URL, $url);
            curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
            $data = curl_exec($ch);
            curl_close($ch);
            return $data;
        }

        $html = file_get_contents_curl($pageurl);

        $doc = new DOMDocument();
        @$doc->loadHTML($html);
        $nodes = $doc->getElementsByTagName('title');

        $titleBackUp = $nodes->item(0)->nodeValue;

        $metas = $doc->getElementsByTagName('meta');

        for ($i = 0; $i < $metas->length; $i++){
            $meta = $metas->item($i);
            if($meta->getAttribute('name') == 'title')
                $title = $meta->getAttribute('content');
        }

标题为Мастило-Връцететиенай-добре[ HQ] ，我得到

我也尝试使用

 curl_setopt( $ch, CURLOPT_ENCODING, "UTF-8" );

但这确实有用。

I尝试使用 html_entity_decode 但不起作用

I try with html_entity_decode but is not working

推荐答案

如果文档本身不包含此错误一个<元http-equiv = Content-Type content = text / html; charset = utf-8 /> 标记。

This can happen if the document itself doesn't contain a <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> tag.

您可以尝试以下任一操作：

You can try either of the following:

让 DomDocument 直接从服务器加载HTML（即使用-> loadHTMLFile（））

Let DomDocument load the HTML directly from the server (i.e. use ->loadHTMLFile())

在通过-> loadHTML（）运行文件之前，使用前面提到的meta标记对文档进行前缀。

Prefix the document with aforementioned meta tag before running it through ->loadHTML().

例如，您可以执行以下操作：

For example, you could do this:

libxml_use_internal_errors(true);
$doc->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />' . $html);
libxml_clear_errors();

这是一种让libxml知道应该读取utf-8数据的黑客行为...这是不可能的通过-> loadHTML（）传递编码。

It's a hack to let libxml know it's supposed to read utf-8 data ... it's not possible to pass that encoding via ->loadHTML().

这篇关于domdocument字符集问题的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

domdocument字符集问题 [英] domdocument character set issue

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录关闭

domdocument字符集问题 [英] domdocument character set issue

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录 关闭

登录关闭