Javascript:REGEX将所有相对Urls更改为Absolute [英] Javascript: REGEX to change all relative Urls to Absolute

查看:168
本文介绍了Javascript:REGEX将所有相对Urls更改为Absolute的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在创建一个Node.js webscraper / proxy,但是我在解析源代码的脚本部分中发现的相对Url时遇到了问题,我认为REGEX可以解决这个问题。
虽然我不知道如何实现这一目标。

I'm currently creating a Node.js webscraper/proxy, but I'm having trouble parsing relative Urls found in the scripting part of the source, I figured REGEX would do the trick. Although it is unknown how I would achieve this.

无论如何,我可以解决这个问题吗?

Is there anyway I can go about this?

此外,我还有一种更简单的方法,因为我对其他代理如何解析网站感到非常困惑。我认为大多数只是美化的网站刮刀,它们可以读取网站的来源,将所有链接/表格转发回代理。

Also I'm open to an easier way of doing this, as I'm quite baffle about how other proxies parse websites. I figured that most are just glorified site scrapers that can read a site's source a relay all links/forms back to the proxy.

推荐答案

高级HTML字符串替换函数



注意OP,因为他请求了这样一个函数:更改 base_url 到您的代理的basE URL以获得所需的结果。

Advanced HTML string replacement functions

Note for OP, because he requested such a function: Change base_url to your proxy's basE URL in order to achieve the desired results.

下面将显示两个函数(使用指南包含在代码中)。请确保不要跳过此答案的任何部分解释,以完全理解函数的行为。

Two functions will be shown below (the usage guide is contained within the code). Make sure that you don't skip any part of the explanation of this answer to fully understand the function's behaviour.


  • rel_to_abs(urL) - 此函数返回绝对URL。当传递具有通用可信协议的绝对URL时,它将立即返回此URL。否则,将从 base_url 和函数参数生成绝对URL。相对URL已正确解析( ../ ; ./ ; ; // )。

  • replace_all_rel_by_abs - 此函数将解析所有在HTML中具有重要意义的网址,例如CSS url(),链接和外部资源。请参阅代码以获取已解析实例的完整列表。请参阅 此答案 ,以便从外部来源(嵌入文档中)清理HTML字符串进行调整后的实施。

  • 测试用例(在答案的底部):要测试函数的有效性,只需将书签粘贴到位置的栏上。

  • rel_to_abs(urL) - This function returns absolute URLs. When an absolute URL with a commonly trusted protocol is passed, it will immediately return this URL. Otherwise, an absolute URL is generated from the base_url and the function argument. Relative URLs are correctly parsed (../ ; ./ ; . ; //).
  • replace_all_rel_by_abs - This function will parse all occurences of URLs which have a significant meaning in HTML, such as CSS url(), links and external resources. See the code for a full list of parsed instances. See this answer for an adjusted implementation to sanitise HTML strings from an external source (to embed in the document).
  • Test case (at the bottom of the answer): To test the effectiveness of the function, simply paste the bookmarklet at the location's bar.



rel_to_abs - 解析相对网址


rel_to_abs - Parsing relative URLs

function rel_to_abs(url){
    /* Only accept commonly trusted protocols:
     * Only data-image URLs are accepted, Exotic flavours (escaped slash,
     * html-entitied characters) are not supported to keep the function fast */
  if(/^(https?|file|ftps?|mailto|javascript|data:image\/[^;]{2,9};):/i.test(url))
         return url; //Url is already absolute

    var base_url = location.href.match(/^(.+)\/?(?:#.+)?$/)[0]+"/";
    if(url.substring(0,2) == "//")
        return location.protocol + url;
    else if(url.charAt(0) == "/")
        return location.protocol + "//" + location.host + url;
    else if(url.substring(0,2) == "./")
        url = "." + url;
    else if(/^\s*$/.test(url))
        return ""; //Empty = Return nothing
    else url = "../" + url;

    url = base_url + url;
    var i=0
    while(/\/\.\.\//.test(url = url.replace(/[^\/]+\/+\.\.\//g,"")));

    /* Escape certain characters to prevent XSS */
    url = url.replace(/\.$/,"").replace(/\/\./g,"").replace(/"/g,"%22")
            .replace(/'/g,"%27").replace(/</g,"%3C").replace(/>/g,"%3E");
    return url;
}

个案/例子:


  • http://foo.bar 。已经是绝对URL,因此立即返回。

  • / doo 相对于root:返回当前root +提供的相对URL。

  • ./ meh 相对于当前目录。

  • ../ booh 相对于父目录。

  • http://foo.bar. Already an absolute URL, thus returned immediately.
  • /doo Relative to the root: Returns the current root + provided relative URL.
  • ./meh Relative to the current directory.
  • ../booh Relative to the parent directory.

该函数将相对路径转换为 ../ ,并执行搜索和替换( http://domain/sub/anything-but-a-slash/../me http:// domain / sub / me )。

The function converts relative paths to ../, and performs a search-and-replace (http://domain/sub/anything-but-a-slash/../me to http://domain/sub/me).



replace_all_rel_by_abs - 转换所有相关的事件网址实例(< script> ),事件处理程序被替换,因为几乎不可能创建一个快速安全的过滤器来解析JavaScript。


replace_all_rel_by_abs - Convert all relevant occurences of URLs
URLs inside script instances (<script>, event handlers are not replaced, because it's near-impossible to create a fast-and-secure filter to parse JavaScript.

这个脚本里面有一些注释。正则表达式是动态创建的,因为单个RE的大小可以是 3000 个字符。 < meta http-equiv = refresh content = ..> 可以通过各种方式进行模糊处理,因此RE的大小。

This script is served with some comments inside. Regular Expressions are dynamically created, because an individual RE can have a size of 3000 characters. <meta http-equiv=refresh content=.. > can be obfuscated in various ways, hence the size of the RE.

function replace_all_rel_by_abs(html){
    /*HTML/XML Attribute may not be prefixed by these characters (common 
       attribute chars.  This list is not complete, but will be sufficient
       for this function (see http://www.w3.org/TR/REC-xml/#NT-NameChar). */
    var att = "[^-a-z0-9:._]";

    var entityEnd = "(?:;|(?!\\d))";
    var ents = {" ":"(?:\\s|&nbsp;?|&#0*32"+entityEnd+"|&#x0*20"+entityEnd+")",
                "(":"(?:\\(|&#0*40"+entityEnd+"|&#x0*28"+entityEnd+")",
                ")":"(?:\\)|&#0*41"+entityEnd+"|&#x0*29"+entityEnd+")",
                ".":"(?:\\.|&#0*46"+entityEnd+"|&#x0*2e"+entityEnd+")"};
                /* Placeholders to filter obfuscations */
    var charMap = {};
    var s = ents[" "]+"*"; //Short-hand for common use
    var any = "(?:[^>\"']*(?:\"[^\"]*\"|'[^']*'))*?[^>]*";
    /* ^ Important: Must be pre- and postfixed by < and >.
     *   This RE should match anything within a tag!  */

    /*
      @name ae
      @description  Converts a given string in a sequence of the original
                      input and the HTML entity
      @param String string  String to convert
      */
    function ae(string){
        var all_chars_lowercase = string.toLowerCase();
        if(ents[string]) return ents[string];
        var all_chars_uppercase = string.toUpperCase();
        var RE_res = "";
        for(var i=0; i<string.length; i++){
            var char_lowercase = all_chars_lowercase.charAt(i);
            if(charMap[char_lowercase]){
                RE_res += charMap[char_lowercase];
                continue;
            }
            var char_uppercase = all_chars_uppercase.charAt(i);
            var RE_sub = [char_lowercase];
            RE_sub.push("&#0*" + char_lowercase.charCodeAt(0) + entityEnd);
            RE_sub.push("&#x0*" + char_lowercase.charCodeAt(0).toString(16) + entityEnd);
            if(char_lowercase != char_uppercase){
                /* Note: RE ignorecase flag has already been activated */
                RE_sub.push("&#0*" + char_uppercase.charCodeAt(0) + entityEnd);   
                RE_sub.push("&#x0*" + char_uppercase.charCodeAt(0).toString(16) + entityEnd);
            }
            RE_sub = "(?:" + RE_sub.join("|") + ")";
            RE_res += (charMap[char_lowercase] = RE_sub);
        }
        return(ents[string] = RE_res);
    }

    /*
      @name by
      @description  2nd argument for replace().
      */
    function by(match, group1, group2, group3){
        /* Note that this function can also be used to remove links:
         * return group1 + "javascript://" + group3; */
        return group1 + rel_to_abs(group2) + group3;
    }
    /*
      @name by2
      @description  2nd argument for replace(). Parses relevant HTML entities
      */
    var slashRE = new RegExp(ae("/"), 'g');
    var dotRE = new RegExp(ae("."), 'g');
    function by2(match, group1, group2, group3){
        /*Note that this function can also be used to remove links:
         * return group1 + "javascript://" + group3; */
        group2 = group2.replace(slashRE, "/").replace(dotRE, ".");
        return group1 + rel_to_abs(group2) + group3;
    }
    /*
      @name cr
      @description            Selects a HTML element and performs a
                                search-and-replace on attributes
      @param String selector  HTML substring to match
      @param String attribute RegExp-escaped; HTML element attribute to match
      @param String marker    Optional RegExp-escaped; marks the prefix
      @param String delimiter Optional RegExp escaped; non-quote delimiters
      @param String end       Optional RegExp-escaped; forces the match to end
                              before an occurence of <end>
     */
    function cr(selector, attribute, marker, delimiter, end){
        if(typeof selector == "string") selector = new RegExp(selector, "gi");
        attribute = att + attribute;
        marker = typeof marker == "string" ? marker : "\\s*=\\s*";
        delimiter = typeof delimiter == "string" ? delimiter : "";
        end = typeof end == "string" ? "?)("+end : ")(";
        var re1 = new RegExp('('+attribute+marker+'")([^"'+delimiter+']+'+end+')', 'gi');
        var re2 = new RegExp("("+attribute+marker+"')([^'"+delimiter+"]+"+end+")", 'gi');
        var re3 = new RegExp('('+attribute+marker+')([^"\'][^\\s>'+delimiter+']*'+end+')', 'gi');
        html = html.replace(selector, function(match){
            return match.replace(re1, by).replace(re2, by).replace(re3, by);
        });
    }
    /* 
      @name cri
      @description            Selects an attribute of a HTML element, and
                                performs a search-and-replace on certain values
      @param String selector  HTML element to match
      @param String attribute RegExp-escaped; HTML element attribute to match
      @param String front     RegExp-escaped; attribute value, prefix to match
      @param String flags     Optional RegExp flags, default "gi"
      @param String delimiter Optional RegExp-escaped; non-quote delimiters
      @param String end       Optional RegExp-escaped; forces the match to end
                                before an occurence of <end>
     */
    function cri(selector, attribute, front, flags, delimiter, end){
        if(typeof selector == "string") selector = new RegExp(selector, "gi");
        attribute = att + attribute;
        flags = typeof flags == "string" ? flags : "gi";
        var re1 = new RegExp('('+attribute+'\\s*=\\s*")([^"]*)', 'gi');
        var re2 = new RegExp("("+attribute+"\\s*=\\s*')([^']+)", 'gi');
        var at1 = new RegExp('('+front+')([^"]+)(")', flags);
        var at2 = new RegExp("("+front+")([^']+)(')", flags);
        if(typeof delimiter == "string"){
            end = typeof end == "string" ? end : "";
            var at3 = new RegExp("("+front+")([^\"'][^"+delimiter+"]*" + (end?"?)("+end+")":")()"), flags);
            var handleAttr = function(match, g1, g2){return g1+g2.replace(at1, by2).replace(at2, by2).replace(at3, by2)};
        } else {
            var handleAttr = function(match, g1, g2){return g1+g2.replace(at1, by2).replace(at2, by2)};
    }
        html = html.replace(selector, function(match){
             return match.replace(re1, handleAttr).replace(re2, handleAttr);
        });
    }

    /* <meta http-equiv=refresh content="  ; url= " > */
    cri("<meta"+any+att+"http-equiv\\s*=\\s*(?:\""+ae("refresh")+"\""+any+">|'"+ae("refresh")+"'"+any+">|"+ae("refresh")+"(?:"+ae(" ")+any+">|>))", "content", ae("url")+s+ae("=")+s, "i");

    cr("<"+any+att+"href\\s*="+any+">", "href"); /* Linked elements */
    cr("<"+any+att+"src\\s*="+any+">", "src"); /* Embedded elements */

    cr("<object"+any+att+"data\\s*="+any+">", "data"); /* <object data= > */
    cr("<applet"+any+att+"codebase\\s*="+any+">", "codebase"); /* <applet codebase= > */

    /* <param name=movie value= >*/
    cr("<param"+any+att+"name\\s*=\\s*(?:\""+ae("movie")+"\""+any+">|'"+ae("movie")+"'"+any+">|"+ae("movie")+"(?:"+ae(" ")+any+">|>))", "value");

    cr(/<style[^>]*>(?:[^"']*(?:"[^"]*"|'[^']*'))*?[^'"]*(?:<\/style|$)/gi, "url", "\\s*\\(\\s*", "", "\\s*\\)"); /* <style> */
    cri("<"+any+att+"style\\s*="+any+">", "style", ae("url")+s+ae("(")+s, 0, s+ae(")"), ae(")")); /*< style=" url(...) " > */
    return html;
}

私人功能的简短摘要:


  • rel_to_abs(url) - 将相对/未知网址转换为绝对网址

  • replace_all_rel_by_abs(html) - 用绝对URL替换HTML字符串中所有相关的URL。

  • rel_to_abs(url) - Converts relative / unknown URLs to absolute URLs
  • replace_all_rel_by_abs(html) - Replaces all relevant occurences of URLs within a string of HTML by absolute URLs.

  1. ae - A ny E ntity - 返回处理HTML实体的RE模式。

  2. by - 将替换为 - 这短函数请求实际的url替换( rel_to_abs )。此功能可称为数百次,如果不是数千次。小心不要在此功能中添加慢速算法(自定义)。

  3. cr - C reate R eplace - 创建并执行搜索和替换。
    示例: href =...(在任何HTML中)标签)。

  4. cri - C reate R eplace nline - 创建并执行搜索和替换。
    示例:所有中的 url(..) HTML标记中的样式属性。

  1. ae - Any Entity - Returns a RE-pattern to deal with HTML entities.
  2. by - replace by - This short function request the actual url replace (rel_to_abs). This function may be called hundreds, if not thousand times. Be careful to not add a slow algorithm to this function (customisation).
  3. cr - Create Replace - Creates and executes a search-and-replace.
    Example: href="..." (within any HTML tag).
  4. cri - Create Replace Inline - Creates and executes a search-and-replace.
    Example: url(..) within the all style attribute within HTML tags.


打开任何页面,并将以下书签粘贴到位置栏中:

Open any page, and paste the following bookmarklet in the location bar:

javascript:void(function(){var s=document.createElement("script");s.src="http://rob.lekensteyn.nl/rel_to_abs.js";document.body.appendChild(s)})();

注入的代码包含两个函数,如上所定义,加上测试用例,如下所示。 注意:测试用例修改页面的HTML,但在textarea中显示已解析的结果(可选)。

The injected code contains the two functions, as defined above, plus the test case, shown below. Note: The test case does not modify the HTML of the page, but shows the parsed results in a textarea (optionally).

var t=(new Date).getTime();
  var result = replace_all_rel_by_abs(document.documentElement.innerHTML);
  if(confirm((new Date).getTime()-t+" milliseconds to execute\n\nPut results in new textarea?")){
    var txt = document.createElement("textarea");
    txt.style.cssText = "position:fixed;top:0;left:0;width:100%;height:99%"
    txt.ondblclick = function(){this.parentNode.removeChild(this)}
    txt.value = result;
    document.body.appendChild(txt);
}

参见:

  • Answer: Parsing and sanitising HTML strings

这篇关于Javascript:REGEX将所有相对Urls更改为Absolute的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆