在R中,如何解析网页中的特定框架? [英] In R, how to parse specific frame within a webpage?

查看:64
本文介绍了在R中,如何解析网页中的特定框架?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问候所有

有没有办法只从网页中的特定框架读取HTML代码?

Is there a way to only read the HTML code from a specific frame within a webpage?

例如,如果我将网址提交给Google翻译,是否可以仅解析翻译后的页面框架?每次尝试时,我只能访问页面的顶部框架,而不能访问已翻译的框架.这是我自包含的示例代码:

For example, if I submit a url to google translate, is there a way to parse only the translated page frame? Whenever I try, I can only access the top frame on the page but not the translated frame. Here is my self-contained sample code:

library(XML)
url <- "http://www.baidu.com/s?wd=r+project"
url.google.translate <- URLencode(paste("http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&sl=zh-CN&tl=en&u=", url, sep=""))
htmlTreeParse(url.google.translate, useInternalNodes = FALSE)

上面的代码引用了该网址:

The above code refers to this url:

$file
[1] "http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&sl=zh-CN&tl=en&u=http://www.baidu.com/s?wd=r+project"

但是,输出只访问页面的顶部框架,而不访问我感兴趣的主框架.

The output however only access the top frame of the page and not the main frame, which is what I am interested in.

希望如此,并在此先感谢您的帮助.

Hope that made sense and thanks in advance for any help.

托尼

更新-感谢以下@kwantam的回答(已接受),我得以使用它来获得以下解决方案(自包含):

> # Load R packages
> library(RCurl)
> library(XML)
> 
> # STAGE 1 - find forward url in relevent frame
> ( url <- "http://www.baidu.com/s?wd=r+project" )
[1] "http://www.baidu.com/s?wd=r+project"
> gt.url <- URLencode(paste("http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&sl=zh-CN&tl=en&u=", url, sep=""))
> gt.doc <- getURL(gt.url)
> gt.html <- htmlTreeParse(gt.doc, useInternalNodes = TRUE, error=function(...){})
> nodes <- getNodeSet(gt.html, '//frameset//frame[@name="c"]')
> gt.parameters <- sapply(nodes, function(x) x <- xmlAttrs(x)[[1]])
> gt.url <- paste("http://translate.google.com", gt.parameters, sep = "")
> 
> # STAGE 2 - find forward url to translated page
> doc <- getURL(gt.url, followlocation = TRUE)
> html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function(...){})
> url.trans <- capture.output(getNodeSet(html, '//meta[@http-equiv="refresh"]')[[1]])
> url.trans <- strsplit(url.trans, "URL=", fixed = TRUE)[[1]][2]
> url.trans <- gsub("\"/>", "", url.trans, fixed = TRUE)
> url.trans <- xmlValue(getNodeSet(htmlParse(url.trans, asText = TRUE), "//p")[[1]])
> 
> # STAGE 3 - load translated page
> url.trans
[1] "http://translate.googleusercontent.com/translate_c?hl=en&ie=UTF-8&sl=zh-CN&tl=en&u=http://www.baidu.com/s%3Fwd%3Dr%2520project&prev=_t&rurl=translate.google.com&usg=ALkJrhiCMu1mKv-czCmEaB7PO925TJCa-A "
> #getURL(url.trans)

如果有人知道我上面给出的简单解决方案,请随时告诉我! :)

If anyone knows of a simpler solution to what I've given above then please feel free to let me know! :)

推荐答案

以下大多数答案针对的是Google翻译的特定情况.在大多数情况下,您只需要解析<frameset>并拉出您要查找的任何帧,尽管它可能不会立即显而易见,这是HTML中的主要帧(也许看一下HTML的相对大小)帧).

Most of the following answer is for the particular case of google translate. In most cases, you'll just need to parse the <frameset> and pull out whichever frame you're looking for, though it might not be immediately obvious which is the main one from the HTML (perhaps look at the relative sizing of the frames).

您似乎需要进行一些刷新才能获取实际内容.特别是,当您获取刚刚提到的URL时,您会看到类似

It looks like you're going to have to follow a few refreshes to get the actual content. In particular, when you grab the URL you just mentioned, you'll see something like

  *snip*
<noframes>
<script>
<!--document.location="/translate_p?hl=en&amp;ie=UTF-8&amp;sl=zh-CN&amp;tl=en&amp;u=http://www.baidu.com/s%3Fwd%3Dr%2520project&amp;prev=_t&amp;usg=asdf";-->
</script>
<a href="/translate_p?hl=en&amp;ie=UTF-8&amp;sl=zh-CN&amp;tl=en&amp;u=http://www.baidu.com/s%3Fwd%3Dr%2520project&amp;prev=_t&amp;usg=asdf">Translate
</a>
</noframes>
  *snip*

如果您点击此处的链接(请先记住先对'&'进行转义),它将为您提供另一个小的HTML片段,其中包括

If you follow the link here (remember to unescape '&' first), it'll give you another small HTML fragment which includes

<meta http-equiv="refresh" content="0;URL=http://translate.googleusercontent.com/translate_c?hl=en&amp;ie=UTF-8&amp;sl=zh-CN&amp;tl=en&amp;u=http://www.baidu.com/s%3Fwd%3Dr%2520project&amp;prev=_t&amp;rurl=translate.google.com&amp;usg=asdf">

再次,对&"进行转义然后刷新后,您将找到所需的翻译页面.

Again, unescaping the '&' and then following the refresh, you'll have the translated page that you're looking for.

在wget或curl中玩这个游戏,应该变得更加清楚.

Play with this in wget or curl and it should become more clear what you're going to need to do.

这篇关于在R中,如何解析网页中的特定框架?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆