file_get_contents( - 修复相对 url [英] file_get_contents( - Fix relative urls

查看:39
本文介绍了file_get_contents( - 修复相对 url的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试向用户显示一个网站,并使用 php 下载了它.这是我正在使用的脚本:

I am trying to display a website to a user, having downloaded it using php. This is the script I am using:

<?php
$url = 'http://stackoverflow.com/pagecalledjohn.php';
//Download page
$site = file_get_contents($url);
//Fix relative URLs
$site = str_replace('src="','src="' . $url,$site);
$site = str_replace('url(','url(' . $url,$site);
//Display to user
echo $site;
?>

到目前为止,除了 str_replace 函数的一些主要问题之外,这个脚本还可以工作.问题来自相对网址.如果我们在我们创建的页面上使用名为 john.php 的猫的图像(类似这样:).它是一个 png,正如我所见,它可以使用 6 个不同的 url 放置在页面上:

So far this script works a treat except for a few major problems with the str_replace function. The problem comes with relative urls. If we use an image on our made up pagecalledjohn.php of a cat (Something like this: ). It is a png and as I see it it can be placed on the page using 6 different urls:

1. src="//www.stackoverflow.com/cat.png"
2. src="http://www.stackoverflow.com/cat.png"
3. src="https://www.stackoverflow.com/cat.png"
4. src="somedirectory/cat.png" 

4 在这种情况下不适用,但还是添加了!

5. src="/cat.png"
6. src="cat.png"

有没有办法,使用 php,我可以搜索 src=" 并将其替换为正在下载的页面的 url(已删除文件名),但如果它是选项 1,2 或 3 和4,5 和 6 的程序略有改变?

Is there a way, using php, I can search for src=" and replace it with the url (filename removed) of the page being downloaded, but without sticking url in there if it is options 1,2 or 3 and change procedure slightly for 4,5 and 6?

推荐答案

与其尝试更改源代码中的每个路径引用,不如简单地注入一个 标记在您的标题中明确指出应该计算所有相对 URL 的基本 URL?

Rather than trying to change every path reference in the source code, why don't you simply inject a <base> tag in your header to specifically indicate the base URL upon which all relative URL's should be calculated?

https://developer.mozilla.org/en-US/docs/Web/HTML/Element/base

这可以使用您选择的 DOM 操作工具来实现.下面的示例将展示如何使用 DOMDocument 和相关类来执行此操作.

This can be achieved using your DOM manipulation tool of choice. The example below would show how to do this using DOMDocument and related classes.

$target_domain = 'http://stackoverflow.com/';
$url = $target_domain . 'pagecalledjohn.php';
//Download page
$site = file_get_contents($url);
$dom = DOMDocument::loadHTML($site);

if($dom instanceof DOMDocument === false) {
    // something went wrong in loading HTML to DOM Document
    // provide error messaging and exit
}

// find <head> tag
$head_tag_list = $dom->getElementsByTagName('head');
// there should only be one <head> tag
if($head_tag_list->length !== 1) {
    throw new Exception('Wow! The HTML is malformed without single head tag.');
}
$head_tag = $head_tag_list->item(0);

// find first child of head tag to later use in insertion
$head_has_children = $head_tag->hasChildNodes();
if($head_has_children) {
    $head_tag_first_child = $head_tag->firstChild;
}

// create new <base> tag
$base_element = $dom->createElement('base');
$base_element->setAttribute('href', $target_domain);

// insert new base tag as first child to head tag
if($head_has_children) {
    $base_node = $head_tag->insertBefore($base_element, $head_tag_first_child);
} else {
    $base_node = $head_tag->appendChild($base_element);
}

echo $dom->saveHTML();

至少,如果您真的想修改源代码中的所有路径引用,我强烈建议使用 DOM 操作工具(DOMDOcument、DOMXPath 等)而不是正则表达式来这样做.我想你会发现它是一个更稳定的解决方案.

At the very minimum, it you truly want to modify all path references in the source code, I would HIGHLY recommend doing so with DOM manipulation tools (DOMDOcument, DOMXPath, etc.) rather than regex. I think you will find it a much more stable solution.

这篇关于file_get_contents( - 修复相对 url的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆