网络抓取填写(和检索)搜索表格? [英] web scraping to fill out (and retrieve) search forms?

查看:26
本文介绍了网络抓取填写(和检索)搜索表格?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道是否可以自动化"输入条目以搜索表单并从结果中提取匹配项的任务.例如,我有一个期刊文章列表,我想为其获取 DOI(数字对象标识符);为此,我会手动转到期刊文章搜索页面(例如,http://pubs.acs.org/搜索/高级),输入作者/标题/卷(等),然后从返回的结果列表中找到该文章,然后选择 DOI 并将其粘贴到我的参考列表中.我经常使用 R 和 Python 进行数据分析(我的灵感来自 RCurl 上的一篇帖子),但对 Web 协议知之甚少……这种事情可能吗(例如使用 Python 的 BeautifulSoup 之类的东西?).是否有任何与此任务类似的远程操作的良好参考?我对学习网络抓取和网络抓取工具的兴趣与完成这项特定任务一样……谢谢您的时间!

I was wondering if it is possible to "automate" the task of typing in entries to search forms and extracting matches from the results. For instance, I have a list of journal articles for which I would like to get DOI's (digital object identifier); manually for this I would go to the journal articles search page (e.g., http://pubs.acs.org/search/advanced), type in the authors/title/volume (etc.) and then find the article out of its list of returned results, and pick out the DOI and paste that into my reference list. I use R and Python for data analysis regularly (I was inspired by a post on RCurl) but don't know much about web protocols... is such a thing possible (for instance using something like Python's BeautifulSoup?). Are there any good references for doing anything remotely similar to this task? I'm just as much interested in learning about web scraping and tools for web scraping in general as much as getting this particular task done... Thanks for your time!

推荐答案

Beautiful Soup 非常适合解析网页 - 这是您想做的一半.Python、Perl 和 Ruby 都有一个 Mechanize 版本,那就是另一半:

Beautiful Soup is great for parsing webpages- that's half of what you want to do. Python, Perl, and Ruby all have a version of Mechanize, and that's the other half:

http://wwwsearch.sourceforge.net/mechanize/

机械化让你控制浏览器:

Mechanize let's you control a browser:

# Follow a link
browser.follow_link(link_node)

# Submit a form
browser.select_form(name="search")
browser["authors"] = ["author #1", "author #2"]
browser["volume"] = "any"
search_response = br.submit()

有了 Mechanize 和 Beautiful Soup,您就有了一个良好的开端.我会考虑的一个额外工具是 Firebug,在这个快速的 ruby​​ 抓取指南中使用:

With Mechanize and Beautiful Soup you have a great start. One extra tool I'd consider is Firebug, as used in this quick ruby scraping guide:

http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/

Firebug 可以加速您构建用于解析文档的 xpath,从而为您节省大量时间.

Firebug can speed your construction of xpaths for parsing documents, saving you some serious time.

祝你好运!

这篇关于网络抓取填写(和检索)搜索表格?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆