解析html并遵循javascript链接 [英] parsing html and following a javascript link
问题描述
一位学术同事要求我从一个网站中提取信息,我需要在该网站中将网页的内容链接到表格中-与只能访问的文本文件的内容不太难(据我所知)可以知道),方法是点击javascript链接...例如
I have been asked to extract info by an academic colleague from a website where I need to link the content of a webpage in a table - not too hard with the contents of a text file which is only reacheable (as far as I can tell) by clicking on a javascript link... e.g.
<a id="tk1" href="javascript:__doPostBack('tk1$ContentPlaceHolder1$grid$tk$OpenFileButton','')">
该表很方便地位于id ='tk1'的表内,这很不错...但是我该如何遵循提取文本文件的链接.
The table is conveniently inside a table with id='tk1' which is nice... but how do I follow the link which pulls the text file.
理想情况下,我想在R中执行此操作...我可以说出文本格式的相关表
Ideally I'd like to do this in R... I can grab the relevant table in text format by saying
u <- the url of interest...
library(XML)
tables = readHTMLTable(u)
interestingTable <- tables[grep('tk1', names(tables))]
这将给出表格中的文本,但是如何获取该特定表格的html?以及如何单击"按钮并获得其背后的文本文件?
And this will give the text in the table, but how do I grab the html for that particular table? and how do I "click" on the button and get the text file behind it?
我注意到,存在一种带有大量隐藏值的表单-该网站似乎是由asp.net驱动的,并使用难以穿透的URL.
I note that there is a form with massive hidden values - the site appears to be asp.net driven and uses impenetrable URLs.
非常感谢!
推荐答案
这有些棘手,并且没有完全集成到R中,但是有些system()困扰会让您入门.
This is somewhat tricky, and not fully integrated in R, but some system()-fiddling will get you started.
- 下载并安装phantom javascript: http://code.google.com/p/phantomjs/
- 在 http://menne-biomed.de/uni/JavaButton.html上检查简短脚本,它可以模拟您的情况.当您单击JavaScript锚点时,它将重定向 http://cran.at.r-project.org/通过doPostBack(inaccessibleJavascriptVar).
- 将以下脚本本地保存为javabutton.js
- Download and install phantom javascript: http://code.google.com/p/phantomjs/
- Check the short script on http://menne-biomed.de/uni/JavaButton.html, which emulates your case. When you click the javascript anchor, it redirects http://cran.at.r-project.org/ via doPostBack(inaccessibleJavascriptVar).
- Save the following script locally as javabutton.js
var page = new WebPage();
page.open('http://www.menne-biomed.de/uni/JavaButton.html', function (status) {
if (status !== 'success') {
console.log('Unable to access network');
} else {
var ua = page.evaluate(function () {
var t = document.getElementById('tk1').href;
var re = new RegExp('\((.*)\)');
return eval(re.exec(t)[1]);
});
console.log(ua);// Outputs http://cran.at.r-project.org/
}
phantom.exit();
});
-
将phantomjs放在路径上,调用
With phantomjs on path, call
phantomjs javabutton.js
phantomjs javabutton.js
链接将显示在控制台上.使用任何方法将其放入Rcurl.
The link will be displayed on the console. Use any method to get it into Rcurl.
不优雅,但是也许有人有一天将phantomjs包装到R中.万一到JaveButton.html的链接丢失了,这里就是代码.
Not elegant, but maybe someones wraps phantomjs into R one day. In case the link to JaveButton.html should be lost, here it is as code.
<!DOCTYPE html >
<head>
<script>
inaccesibleJavascriptVar = 'http://' + 'cran.at.r-project.org/';
function doPostBack(myref)
{
window.location.href= myref;
return false;
}
</script>
</head>
<body>
<a id="tk1" href="javascript:doPostBack(inaccesibleJavascriptVar)" >Click here</a>
</body>
</html>
这篇关于解析html并遵循javascript链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!