如果< base href ...>会发生什么情况是用双斜线设置的? [英] What happens if <base href...> is set with a double slash?

查看:126
本文介绍了如果< base href ...>会发生什么情况是用双斜线设置的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我喜欢了解如何为我的网页抓取工具使用< base / base / base / base / base.cfg =< base_href =/> 值,所以我测试了几种主要浏览器的组合,发现了一些双斜杠我不明白。



如果您不喜欢阅读所有内容,跳转至 D 的测试结果, <强>电子即可。演示所有测试:

http://gutt.it/basehref.php



一步一步调用 http://example.com/images.html 的测试结果:



A - 多重基准href

  < HTML> 
< head>
< base target =_ blank/>
< base href =http://example.com/images//>
< base href =http://example.com//>
< / head>
< body>
< img src =/ images / image.jpg>
< img src =image.jpg>
< img src =./ image.jpg>
< img src =images / image.jpg>未找到
< img src =/ image.jpg>未找到
< img src =../ image.jpg>找不到
< / body>
< / html>

结论




  • 只有第一个< base> href / li>
  • / 开头的源以root为目标

  • .. / 上一个文件夹



  • B - 无尾斜杠

     < html> 
    < head>
    < base href =http://example.com/images/>
    < / head>
    < body>
    < img src =/ images / image.jpg>
    < img src =image.jpg>未找到
    < img src =./ image.jpg>未找到
    < img src =images / image.jpg>
    < img src =/ image.jpg>未找到
    < img src =../ image.jpg>找不到
    < / body>
    < / html>

    结论




    • < base href> 会忽略最后一个斜杠之后的所有内容,例如 http:// example。 com / images 变成 http://example.com/



    C - 应该如何

     < html> 
    < head>
    < base href =http://example.com//>
    < / head>
    < body>
    < img src =/ images / image.jpg>
    < img src =image.jpg>未找到
    < img src =./ image.jpg>未找到
    < img src =images / image.jpg>
    < img src =/ image.jpg>未找到
    < img src =../ image.jpg>找不到
    < / body>
    < / html>

    结论




    • 结果与测试B 中的结果相同


    D - Double Slash

     < html> 
    < head>
    < base href =http://example.com/images///>
    < / head>
    < body>
    < img src =/ images / image.jpg>
    < img src =image.jpg>
    < img src =./ image.jpg>
    < img src =images / image.jpg>未找到
    < img src =/ image.jpg>未找到
    < img src =../ image.jpg>
    < / body>
    < / html>

    E - 双斜线与空格

     < html> 
    < head>
    < base href =http://example.com/images/ //>
    < / head>
    < body>
    < img src =/ images / image.jpg>
    < img src =image.jpg>未找到
    < img src =./ image.jpg>未找到
    < img src =images / image.jpg>未找到
    < img src =/ image.jpg>未找到
    < img src =../ image.jpg>
    < / body>
    < / html>

    这两个都不是有效的网址,而是我的网络爬虫的真实结果。请解释 D E 中可能找到 ../ image.jpg 的原因以及为什么会导致空格有什么区别?



    仅限于您​​的兴趣:


    • < base href =http://example.com///> Test C
    • 相同
    • < base href =http://example.com/ //> 是完全不同的。只有 ../ image.jpg 被找到

    • < base href =a //> ; 仅查找 /images/image.jpg


    解决方案

    $ b


    base base 元素允许作者指定文档基于
    URL
    ,用于解析相对URL

    如测试A所示,如果有多个 base href 文档基址将是第一个。



    解析相对网址已完成方式:


    应用 作为基本URL,使用编码 -parser-0rel =nofollow noreferrer> URL解析器改为 url >作为编码。

    URL解析算法在URL规范中定义。



    这太复杂了,无法在这里详细解释。但基本上,这是发生了什么:


    • / 开头的相对URL是计算相对于基本URL的主机。

    • 否则,相对URL将根据基本URL的最后一个目录计算。
    • 请注意,如果基本路径不以 / 结尾,最后一部分将是一个文件,而不是目录。
    • ./ 是当前目录

    • ../ 向上移动一个目录



    • (可能目录和文件不是URL中的正确术语)

      一些例子:


      • http://example.com/images/a /./ http://example.com/images/a/

      • http://example.com/images/a/../ http://example.com/images/

      • http://example.com/images//./ http://example.com / images //

      • http://example.c om / images //../ http://example.com/images/

      • http://example.com/images/./ http://example.com/images/

      • http://example.com/images/../ http://example.com /



      请注意,在大多数情况下, // 会像 / 。正如通过@poncha说的


      <除非你使用某种URL重写(在这种情况下
      重写规则可能受到斜杠数量的影响),uri
      映射到磁盘上的路径,但是(大多数?)现代操作系统
      (Linux / Unix,Windows),一行中的多路径分隔符没有
      任何特殊含义,所以/ path / to / foo和/ path // // /// / foo会
      最终映射到同一个文件。

      然而,一般 / / 不会变成 //



      您可以使用以下代码片段解决您的问题相对URL列表为绝对值:



      var bases = [http://example.com/images/ ,http://example.com/images,http://example.com/,http://example.com/images//,http://example.com/images/ / ]; var urls = [/images/image.jpg,image.jpg,./image.jpg,images / image.jpg,/image.jpg,../image .jpg];函数newEl(type,contents){var el = document.createElement(type);如果(!内容)返回el; if(!(contentof instanceof Array))contents = [contents]; if(typeof contents [i] =='string')el.appendChild(document.createTextNode(contents [i]))else if(typeof contents [ i] =='object')// contents [i] instanceof Node el.appendChild(contents [i])return el;} function emoticon(str){return {'http://example.com/images/image。 jpg':'good','http://example.com/images//image.jpg':'neutral'} [str] || 'bad';} var base = document.createElement('base'),a = document.createElement('a'),output = document.createElement('ul'),head = document.getElementsByTagName('head')[ 0]; head.insertBefore(base,head.firstChild); for(var i = 0; i< bases.length; ++ i){base.href = bases [i]; var test = newEl('li',['Test'+(i + 1)+':',newEl('span',bases [i])]); test.className ='test'; var testItems = newEl('ul'); testItems.className ='test-items'; for(var j = 0; j code> span {background:#eef;}。test-items {display:table; border-spacing:.13em;填充左:1.1em; margin-bottom:.3em;}。test-item {display:table-row;位置:相对; list-style:none;}。test-item> span {display:table-cell;}。test-item:before {display:inline-block;宽度:1.1em;身高:1.1em; line-height:1em; text-align:center;边界半径:50%; margin-right:.4em;位置:绝对;左:-1.1em; top:0;}。good:before {content:':)';背景:#0f0;}。neutral:before {content:':|';背景:#ff0;}。bad:before {content:':('; background:#f00;}



      您也可以玩这个片段:

        var resolveURL =(function(){var base = document.createElement('base'),a = document.createElement('a'),head = document.getElementsByTagName('head') [0]; return function(url,baseurl){if(base){base.href = baseurl; head.insertBefore(base,head.firstChild);} a.href = url; var abs = a.cloneNode(false) .href; / *愚蠢的旧IE需要克隆https://stackoverflow.com/a/24437713/1529630 * / if(base)head.removeChild(base); return abs;};})(); var base = document .getElementById('base'),url = document.getElementById('url'),abs = docum如果(event.propertyName ==value)update()};(base.oninput = url.oninput = update)() ; function update(){abs.value = resolveURL(url.value,base.value);}  

        label {display:block; margin:1em 0;} input {width:100%;}  

      <标签>基本网址:< input id =basevalue =http://example.com/images//foo////bar/bazplaceholder =在此输入您的基本网址/>< / label> ;<标签>要解析的网址:< input id =urlvalue =./ a / b /../ cplaceholder =在此处输入您的网址>< / label>< label>结果网址:< input id =absolutereadonly>< / label>

      I like to understand how to use a <base href="" /> value for my web crawler, so I tested several combinations with major browsers and finally found something with double slashes I don't understand.

      If you don't like to read everything jump to the test results of D and E. Demonstration of all tests:
      http://gutt.it/basehref.php

      Step by step my test results on calling http://example.com/images.html:

      A - Multiple base href

      <html>
      <head>
      <base target="_blank" />
      <base href="http://example.com/images/" />
      <base href="http://example.com/" />
      </head>
      <body>
      <img src="/images/image.jpg">
      <img src="image.jpg">
      <img src="./image.jpg">
      <img src="images/image.jpg"> not found
      <img src="/image.jpg"> not found
      <img src="../image.jpg"> not found
      </body>
      </html>
      

      Conclusion

      • only the first <base> with href counts
      • a source starting with / targets the root
      • ../ goes one folder up

      B - Without trailing slash

      <html>
      <head>
      <base href="http://example.com/images" />
      </head>
      <body>
      <img src="/images/image.jpg">
      <img src="image.jpg"> not found
      <img src="./image.jpg"> not found
      <img src="images/image.jpg">
      <img src="/image.jpg"> not found
      <img src="../image.jpg"> not found
      </body>
      </html>
      

      Conclusion

      • <base href> ignores everything after the last slash so http://example.com/images becomes http://example.com/

      C - How it should be

      <html>
      <head>
      <base href="http://example.com/" />
      </head>
      <body>
      <img src="/images/image.jpg">
      <img src="image.jpg"> not found
      <img src="./image.jpg"> not found
      <img src="images/image.jpg">
      <img src="/image.jpg"> not found
      <img src="../image.jpg"> not found
      </body>
      </html>
      

      Conclusion

      • Same result as in Test B as expected

      D - Double Slash

      <html>
      <head>
      <base href="http://example.com/images//" />
      </head>
      <body>
      <img src="/images/image.jpg">
      <img src="image.jpg">
      <img src="./image.jpg">
      <img src="images/image.jpg"> not found
      <img src="/image.jpg"> not found
      <img src="../image.jpg">
      </body>
      </html>
      

      E - Double Slash with whitespace

      <html>
      <head>
      <base href="http://example.com/images/ /" />
      </head>
      <body>
      <img src="/images/image.jpg">
      <img src="image.jpg"> not found
      <img src="./image.jpg"> not found
      <img src="images/image.jpg"> not found
      <img src="/image.jpg"> not found
      <img src="../image.jpg">
      </body>
      </html>
      

      Both are not "valid" URLs, but real results of my web crawler. Please explain what happend in D and E that ../image.jpg could be found and why causes the whitespace a difference?

      Only for your interest:

      • <base href="http://example.com//" /> is the same as Test C
      • <base href="http://example.com/ /" /> is completely different. Only ../image.jpg is found
      • <base href="a/" /> finds only /images/image.jpg

      解决方案

      The behavior of base is explained in the HTML spec:

      The base element allows authors to specify the document base URL for the purposes of resolving relative URLs.

      As shown in your test A, if there are multiple base with href, the document base URL will be the first one.

      Resolving relative URLs is done this way:

      Apply the URL parser to url, with base as the base URL, with encoding as the encoding.

      The URL parsing algorithm is defined in the URL spec.

      It's too complex to be explained here in detail. But basically, this is what happens:

      • A relative URL starting with / is calculated with respect to base URL's host.
      • Otherwise, the relative URL is calculated with respect to base URL's last directory.
      • Be aware that if the base path doesn't end with /, the last part will be a file, not a directory.
      • ./ is the current directory
      • ../ goes one directory up

      (Probably, "directory" and "file" are not the proper terminology in URLs)

      Some examples:

      • http://example.com/images/a/./ is http://example.com/images/a/
      • http://example.com/images/a/../ is http://example.com/images/
      • http://example.com/images//./ is http://example.com/images//
      • http://example.com/images//../ is http://example.com/images/
      • http://example.com/images/./ is http://example.com/images/
      • http://example.com/images/../ is http://example.com/

      Note that, in most cases, // will be like /. As said by @poncha,

      Unless you're using some kind of URL rewriting (in which case the rewriting rules may be affected by the number of slashes), the uri maps to a path on disk, but in (most?) modern operating systems (Linux/Unix, Windows), multiple path separators in a row do not have any special meaning, so /path/to/foo and /path//to////foo would eventually map to the same file.

      However, in general / / won't become //.

      You can use the following snippet to resolve your list of relative URLs to absolute ones:

      var bases = [
        "http://example.com/images/",
        "http://example.com/images",
        "http://example.com/",
        "http://example.com/images//",
        "http://example.com/images/ /"
      ];
      var urls = [
        "/images/image.jpg",
        "image.jpg",
        "./image.jpg",
        "images/image.jpg",
        "/image.jpg",
        "../image.jpg"
      ];
      function newEl(type, contents) {
        var el = document.createElement(type);
        if(!contents) return el;
        if(!(contents instanceof Array))
          contents = [contents];
        for(var i=0; i<contents.length; ++i)
          if(typeof contents[i] == 'string')
            el.appendChild(document.createTextNode(contents[i]))
          else if(typeof contents[i] == 'object') // contents[i] instanceof Node
            el.appendChild(contents[i])
        return el;
      }
      function emoticon(str) {
        return {
          'http://example.com/images/image.jpg': 'good',
          'http://example.com/images//image.jpg': 'neutral'
        }[str] || 'bad';
      }
      var base = document.createElement('base'),
          a = document.createElement('a'),
          output = document.createElement('ul'),
          head = document.getElementsByTagName('head')[0];
      head.insertBefore(base, head.firstChild);
      for(var i=0; i<bases.length; ++i) {
        base.href = bases[i];
        var test = newEl('li', [
          'Test ' + (i+1) + ': ',
          newEl('span', bases[i])
        ]);
        test.className = 'test';
        var testItems = newEl('ul');
        testItems.className = 'test-items';
        for(var j=0; j<urls.length; ++j) {
          a.href = urls[j];
          var absURL = a.cloneNode(false).href;
            /* Stupid old IE requires cloning
               https://stackoverflow.com/a/24437713/1529630 */
          var testItem = newEl('li', [
            newEl('span', urls[j]),
            ' → ',
            newEl('span', absURL)
          ]);
          testItem.className = 'test-item ' + emoticon(absURL);
          testItems.appendChild(testItem);
        }
        test.appendChild(testItems);
        output.appendChild(test);
      }
      document.body.appendChild(output);

      span {
        background: #eef;
      }
      .test-items {
        display: table;
        border-spacing: .13em;
        padding-left: 1.1em;
        margin-bottom: .3em;
      }
      .test-item {
        display: table-row;
        position: relative;
        list-style: none;
      }
      .test-item > span {
        display: table-cell;
      }
      .test-item:before {
        display: inline-block;
        width: 1.1em;
        height: 1.1em;
        line-height: 1em;
        text-align: center;
        border-radius: 50%;
        margin-right: .4em;
        position: absolute;
        left: -1.1em;
        top: 0;
      }
      .good:before {
        content: ':)';
        background: #0f0;
      }
      .neutral:before {
        content: ':|';
        background: #ff0;
      }
      .bad:before {
        content: ':(';
        background: #f00;
      }

      You can also play with this snippet:

      var resolveURL = (function() {
        var base = document.createElement('base'),
            a = document.createElement('a'),
            head = document.getElementsByTagName('head')[0];
        return function(url, baseurl) {
          if(base) {
            base.href = baseurl;
            head.insertBefore(base, head.firstChild);
          }
          a.href = url;
          var abs = a.cloneNode(false).href;
          /* Stupid old IE requires cloning
             https://stackoverflow.com/a/24437713/1529630 */
          if(base)
            head.removeChild(base);
          return abs;
        };
      })();
      var base = document.getElementById('base'),
          url = document.getElementById('url'),
          abs = document.getElementById('absolute');
      base.onpropertychange = url.onpropertychange = function() {
        if (event.propertyName == "value")
          update()
      };
      (base.oninput = url.oninput = update)();
      function update() {
        abs.value = resolveURL(url.value, base.value);
      }

      label {
        display: block;
        margin: 1em 0;
      }
      input {
        width: 100%;
      }

      <label>
        Base url:
        <input id="base" value="http://example.com/images//foo////bar/baz"
               placeholder="Enter your base url here" />
      </label>
      <label>
        URL to be resolved:
        <input id="url" value="./a/b/../c"
               placeholder="Enter your URL here">
      </label>
      <label>
        Resulting url:
        <input id="absolute" readonly>
      </label>

      这篇关于如果&lt; base href ...&gt;会发生什么情况是用双斜线设置的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆