选择的元素添加到DOM通过脚本 [英] Select elements added to the DOM by a script

查看:100
本文介绍了选择的元素添加到DOM通过脚本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在试图让任何一个<采用标记;对象> 或<嵌入&GT

I've been trying to get either an <object> or an <embed> tag using:

HtmlNode videoObjectNode = doc.DocumentNode.SelectSingleNode("//object");
HtmlNode videoEmbedNode = doc.DocumentNode.SelectSingleNode("//embed");

这似乎并没有工作。

谁能告诉我如何获得这些标签和他们的innerHTML?

Can anyone please tell me how to get these tags and their InnerHtml?

一个YouTube的嵌入式视频看起来是这样的:

A YouTube embedded video looks like this:

    <embed height="385" width="640" type="application/x-shockwave-flash" 
src="http://s.ytimg.com/yt/swf/watch-vfl184368.swf" id="movie_player" flashvars="..." 
allowscriptaccess="always" allowfullscreen="true" bgcolor="#000000">

我有一种感觉中的JavaScript可能会停止工作的SWF播放器,希望不是...

I got a feeling the JavaScript might stop the swf player from working, hope not...

干杯

推荐答案

更新2010-08-26(响应OP的评论)

我想你想了错误的方式,亚历克斯。假设我写了一些C#code,它是这样的:

I think you're thinking about it the wrong way, Alex. Suppose I wrote some C# code that looked like this:

string codeBlock = "if (x == 1) Console.WriteLine(\"Hello, World!\");";

现在,如果我写了一个C#编译器,它应该认识到字符串字面上面C#code的内容,并突出显示它(或其他)这样? 后,因为在一个结构良好的C#文件的情况下,该文本重新presents一个字符串来该 codeBLOCK 变量被分配。

Now, if I wrote a C# parser, should it recognize the contents of the string literal above as C# code and highlight it (or whatever) as such? No, because in the context of a well-formed C# file, that text represents a string to which the codeBlock variable is being assigned.

同样,在YouTube的网页的HTML中,&LT;对象&gt; &LT;嵌入&GT; 元素是不是真的在当前的HTML文档的情况下在所有的元素。他们居住的JavaScript code中的字符串值中的内容。

Similarly, in the HTML on YouTube's pages, the <object> and <embed> elements are not really elements at all in the context of the current HTML document. They are the contents of string values residing within JavaScript code.

在事实上,如果 HtmlAgilityPack 没有的忽略了这一事实,并试图识别文本的所有部分可以的是HTML,它仍然不会有这些元素,因为作为内部的JavaScript,他们用巨资逃脱\\ 字符(注意precarious成功 UNESCAPE 在code我贴来解决这个问题的方法)。

In fact, if HtmlAgilityPack did ignore this fact and attempted to recognize all portions of text that could be HTML, it still wouldn't succeed with these elements because, being inside JavaScript, they're heavily escaped with \ characters (notice the precarious Unescape method in the code I posted to get around this issue).

我并不是说我下面哈克的解决方案是解决这个问题的正确方法;我只是解释为什么获得这些元素并不像 HtmlAgilityPack 抓住他们那样简单。

I'm not saying my hacky solution below is the right way to approach this problem; I'm just explaining why obtaining these elements isn't as straightforward as grabbing them with HtmlAgilityPack.

OK,亚历克斯:你自找的,所以在这儿呢。一些真正哈克code,以提取您的precious &LT;对象&gt; &LT;嵌入&GT; 元素出从JavaScript那大海。

OK, Alex: you asked for it, so here it is. Some truly hacky code to extract your precious <object> and <embed> elements out from that sea of JavaScript.

class YouTubeScraper
{
    public HtmlNode FindObjectElement(string url)
    {
        HtmlNodeCollection scriptNodes = FindScriptNodes(url);

        for (int i = 0; i < scriptNodes.Count; ++i)
        {
            HtmlNode scriptNode = scriptNodes[i];

            string javascript = scriptNode.InnerHtml;

            int objectNodeLocation = javascript.IndexOf("<object");

            if (objectNodeLocation != -1)
            {
                string htmlStart = javascript.Substring(objectNodeLocation);

                int objectNodeEndLocation = htmlStart.IndexOf(">\" :");

                if (objectNodeEndLocation != -1)
                {
                    string finalEscapedHtml = htmlStart.Substring(0, objectNodeEndLocation + 1);

                    string unescaped = Unescape(finalEscapedHtml);

                    var objectDoc = new HtmlDocument();

                    objectDoc.LoadHtml(unescaped);

                    HtmlNode objectNode = objectDoc.GetElementbyId("movie_player");

                    return objectNode;
                }
            }
        }

        return null;
    }

    public HtmlNode FindEmbedElement(string url)
    {
        HtmlNodeCollection scriptNodes = FindScriptNodes(url);

        for (int i = 0; i < scriptNodes.Count; ++i)
        {
            HtmlNode scriptNode = scriptNodes[i];

            string javascript = scriptNode.InnerHtml;

            int approxEmbedNodeLocation = javascript.IndexOf("<\\/object>\" : \"<embed");

            if (approxEmbedNodeLocation != -1)
            {
                string htmlStart = javascript.Substring(approxEmbedNodeLocation + 15);

                int embedNodeEndLocation = htmlStart.IndexOf(">\";");

                if (embedNodeEndLocation != -1)
                {
                    string finalEscapedHtml = htmlStart.Substring(0, embedNodeEndLocation + 1);

                    string unescaped = Unescape(finalEscapedHtml);

                    var embedDoc = new HtmlDocument();

                    embedDoc.LoadHtml(unescaped);

                    HtmlNode videoEmbedNode = embedDoc.GetElementbyId("movie_player");

                    return videoEmbedNode;
                }
            }
        }

        return null;
    }

    protected HtmlNodeCollection FindScriptNodes(string url)
    {
        var doc = new HtmlDocument();

        WebRequest request = WebRequest.Create(url);
        using (var response = request.GetResponse())
        using (var stream = response.GetResponseStream())
        {
            doc.Load(stream);
        }

        HtmlNode root = doc.DocumentNode;
        HtmlNodeCollection scriptNodes = root.SelectNodes("//script");

        return scriptNodes;
    }

    static string Unescape(string htmlFromJavascript)
    {
        // The JavaScript has escaped all of its HTML using backslashes. We need
        // to reverse this.

        // DISCLAIMER: I am a TOTAL Regex n00b; I make no claims as to the robustness
        // of this code. If you could improve it, please, I beg of you to do so. Personally,
        // I tested it on a grand total of three inputs. It worked for those, at least.
        return Regex.Replace(htmlFromJavascript, @"\\(.)", UnescapeFromBeginning);
    }

    static string UnescapeFromBeginning(Match match)
    {
        string text = match.ToString();

        if (text.StartsWith("\\"))
        {
            return text.Substring(1);
        }

        return text;
    }
}

而如果你有兴趣,这里有一个小的演示我扔在一起(超看中,我知道):

And in case you're interested, here's a little demo I threw together (super fancy, I know):

class Program
{
    static void Main(string[] args)
    {
        var scraper = new YouTubeScraper();

        HtmlNode davidAfterDentistEmbedNode = scraper.FindEmbedElement("http://www.youtube.com/watch?v=txqiwrbYGrs");
        Console.WriteLine("David After Dentist:");
        Console.WriteLine(davidAfterDentistEmbedNode.OuterHtml);
        Console.WriteLine();

        HtmlNode drunkHistoryObjectNode = scraper.FindObjectElement("http://www.youtube.com/watch?v=jL68NyCSi8o");
        Console.WriteLine("Drunk History:");
        Console.WriteLine(drunkHistoryObjectNode.OuterHtml);
        Console.WriteLine();

        HtmlNode jessicaDailyAffirmationEmbedNode = scraper.FindEmbedElement("http://www.youtube.com/watch?v=qR3rK0kZFkg");
        Console.WriteLine("Jessica's Daily Affirmation:");
        Console.WriteLine(jessicaDailyAffirmationEmbedNode.OuterHtml);
        Console.WriteLine();

        HtmlNode jazzerciseObjectNode = scraper.FindObjectElement("http://www.youtube.com/watch?v=VGOO8ZhWFR4");
        Console.WriteLine("Jazzercise - Move your Boogie Body:");
        Console.WriteLine(jazzerciseObjectNode.OuterHtml);
        Console.WriteLine();

        Console.Write("Finished! Hit Enter to quit.");
        Console.ReadLine();
    }
}


原来的答案

为什么不尝试使用元素的ID呢?


Original Answer

Why not try using the element's Id instead?

HtmlNode videoEmbedNode = doc.GetElementbyId("movie_player");

更新:哦,你正在寻找的是自己的的JavaScript的HTML标签?这绝对是为什么这是行不通的。 (他们是不是真的打标签从 HtmlAgilityPack 的角度分析,所有的JavaScript是&LT内部实际上是一个大的字符串;脚本&GT; 标记),也许有一些方法可以解析&LT;脚本&GT; 标签的内部文本本身的的HTML和去从那里。

Update: Oh man, you're searching for HTML tags that are themselves within JavaScript? That's definitely why this isn't working. (They aren't really tags to be parsed from the perspective of HtmlAgilityPack; all of that JavaScript is really one big string inside a <script> tag.) Maybe there's some way you can parse the <script> tag's inner text itself as HTML and go from there.

这篇关于选择的元素添加到DOM通过脚本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆