网络抓取填写(和检索)搜索表单? [英] web scraping to fill out (and retrieve) search forms?

查看:175
本文介绍了网络抓取填写(和检索)搜索表单?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道是否可以自动化输入条目以搜索表单并从结果中提取匹配的任务。例如,我有一份我希望获得DOI(数字对象标识符)的期刊文章列表;手动为此我会去期刊文章搜索页面(例如, http://pubs.acs.org/搜索/高级),输入author / title / volume(等),然后从返回的结果列表中找到该文章,挑出DOI并将其粘贴到我的参考列表中。我经常使用R和Python进行数据分析(我受到了RCurl的一篇文章的启发),但对网络协议知之甚少......是否有这种可能(例如使用像Python的BeautifulSoup这样的东西?)。有没有什么好的参考资料可以做类似于这个任务的任何事情?我一样对学习网络抓取和一般网络抓取工具感兴趣,就像完成这项特殊任务一样......感谢您的时间!

解决方案

美丽的汤非常适合解析网页 - 这是您想做的一半。 Python,Perl和Ruby都有一个Mechanize版本,这就是另一个版本:

http://wwwsearch.sourceforge.net/mechanize/



机械化让您控制浏览器:

 #点击链接
browser.follow_link(link_node)

#提交表单
browser.select_form(name =search)
browser [authors] = [author#1,author#2]
browser [volume] =any
search_response = br.submit()

凭借机械化和美丽的汤,你有一个很好的开始。一个额外的工具,我会考虑的是Firebug,在这个快速的ruby scraping指南中使用:

http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-秒/



Firebug可以加快构建用于解析文档的xpaths,为您节省大量时间。



祝你好运!


I was wondering if it is possible to "automate" the task of typing in entries to search forms and extracting matches from the results. For instance, I have a list of journal articles for which I would like to get DOI's (digital object identifier); manually for this I would go to the journal articles search page (e.g., http://pubs.acs.org/search/advanced), type in the authors/title/volume (etc.) and then find the article out of its list of returned results, and pick out the DOI and paste that into my reference list. I use R and Python for data analysis regularly (I was inspired by a post on RCurl) but don't know much about web protocols... is such a thing possible (for instance using something like Python's BeautifulSoup?). Are there any good references for doing anything remotely similar to this task? I'm just as much interested in learning about web scraping and tools for web scraping in general as much as getting this particular task done... Thanks for your time!

解决方案

Beautiful Soup is great for parsing webpages- that's half of what you want to do. Python, Perl, and Ruby all have a version of Mechanize, and that's the other half:

http://wwwsearch.sourceforge.net/mechanize/

Mechanize let's you control a browser:

# Follow a link
browser.follow_link(link_node)

# Submit a form
browser.select_form(name="search")
browser["authors"] = ["author #1", "author #2"]
browser["volume"] = "any"
search_response = br.submit()

With Mechanize and Beautiful Soup you have a great start. One extra tool I'd consider is Firebug, as used in this quick ruby scraping guide:

http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/

Firebug can speed your construction of xpaths for parsing documents, saving you some serious time.

Good luck!

这篇关于网络抓取填写(和检索)搜索表单?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆