解析html并遵循javascript链接 [英] parsing html and following a javascript link

查看:104
本文介绍了解析html并遵循javascript链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

一位学术同事要求我从一个网站中提取信息,我需要在该网站中将网页的内容链接到表格中-与只能访问的文本文件的内容不太难(据我所知)可以知道),方法是点击javascript链接...例如

I have been asked to extract info by an academic colleague from a website where I need to link the content of a webpage in a table - not too hard with the contents of a text file which is only reacheable (as far as I can tell) by clicking on a javascript link... e.g.

<a id="tk1" href="javascript:__doPostBack('tk1$ContentPlaceHolder1$grid$tk$OpenFileButton','')">

该表很方便地位于id ='tk1'的表内,这很不错...但是我该如何遵循提取文本文件的链接.

The table is conveniently inside a table with id='tk1' which is nice... but how do I follow the link which pulls the text file.

理想情况下,我想在R中执行此操作...我可以说出文本格式的相关表

Ideally I'd like to do this in R... I can grab the relevant table in text format by saying

u <- the url of interest...
library(XML)
tables = readHTMLTable(u)
interestingTable <- tables[grep('tk1', names(tables))]

这将给出表格中的文本,但是如何获取该特定表格的html?以及如何单击"按钮并获得其背后的文本文件?

And this will give the text in the table, but how do I grab the html for that particular table? and how do I "click" on the button and get the text file behind it?

我注意到,存在一种带有大量隐藏值的表单-该网站似乎是由asp.net驱动的,并使用难以穿透的URL.

I note that there is a form with massive hidden values - the site appears to be asp.net driven and uses impenetrable URLs.

非常感谢!

推荐答案

这有些棘手,并且没有完全集成到R中,但是有些system()困扰会让您入门.

This is somewhat tricky, and not fully integrated in R, but some system()-fiddling will get you started.

  • Download and install phantom javascript: http://code.google.com/p/phantomjs/
  • Check the short script on http://menne-biomed.de/uni/JavaButton.html, which emulates your case. When you click the javascript anchor, it redirects http://cran.at.r-project.org/ via doPostBack(inaccessibleJavascriptVar).
  • Save the following script locally as javabutton.js


var page = new WebPage();
page.open('http://www.menne-biomed.de/uni/JavaButton.html', function (status) {
    if (status !== 'success') {
        console.log('Unable to access network');
    } else {
        var ua = page.evaluate(function () {
            var t =  document.getElementById('tk1').href;
            var re = new RegExp('\((.*)\)');
            return eval(re.exec(t)[1]);
}); console.log(ua);// Outputs http://cran.at.r-project.org/ } phantom.exit(); });

  • 将phantomjs放在路径上,调用

  • With phantomjs on path, call

phantomjs javabutton.js

phantomjs javabutton.js

链接将显示在控制台上.使用任何方法将其放入Rcurl.

The link will be displayed on the console. Use any method to get it into Rcurl.

不优雅,但是也许有人有一天将phantomjs包装到R中.万一到JaveButton.html的链接丢失了,这里就是代码.

Not elegant, but maybe someones wraps phantomjs into R one day. In case the link to JaveButton.html should be lost, here it is as code.

<!DOCTYPE html >
<head>
<script>
inaccesibleJavascriptVar = 'http://' + 'cran.at.r-project.org/';
function doPostBack(myref)
          {
            window.location.href= myref;
            return false;
        }
</script>
</head>
<body>
<a id="tk1" href="javascript:doPostBack(inaccesibleJavascriptVar)" >Click here</a>
</body>
</html>

这篇关于解析html并遵循javascript链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆