硒与BeautifulSoup的网页抓取 [英] Selenium versus BeautifulSoup for web scraping

查看:136
本文介绍了硒与BeautifulSoup的网页抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从刮使用Python网站的内容。首先,我用 BeautifulSoup 机械化 Python的,但我看到的网站有这样的通过JavaScript,所以我创建的内容的一个按钮决定使用

既然我能找到的元素,并使用Selenium像 driver.find_element_by_xpath 方法得到他们的内容,有什么理由使用 BeautifulSoup 时,我可以只使用Selenium的一切?

而在这种特殊情况下,我需要使用Selenium点击JavaScript的按钮,是它更好地使用硒解析以及或者我应该同时使用硒和美味的汤?


解决方案

直接回答你的问题之前,这是值得一说为出发点:如果你需要做的是从静态的HTML页面拉的内容,你应该使用< A HREF =htt​​p://docs.python.org/2/library/urllib2.html> 的urllib2 ,提供的 LXML BeautifulSoup ,没有硒(硒虽然可能足够太)。不使用Selenium不必要的优点:


  • 带宽。使用Selenium意味着获取当你访问一个网页在浏览器中通常会被获取的所有资源 - 样式,脚本,图像等。这可能是不必要的。

  • 的稳定性和易用性错误恢复。硒可以稍微脆弱的,在我的经验 - 即使有PhantomJS - 和创建架构杀雄硒实例,并创建一个使用<$ C,当一个新的是一个小比建立简单的重试上异常逻辑更具刺激性$ C>的urllib2 。

  • 潜在的,CPU和内存使用率 - 这取决于你抓取的网站,你想并行运行多少个线程蜘蛛,这是可以想象的,要么DOM布局逻辑或JavaScript的执行可以得到pretty昂贵

请注意,要求饼干功能的网站是不是一个理由摆脱硒 - 你可以轻松地创建一个URL开启功能奇迹般地设置和使用的 cookielib / cookiejar

好,那你可能会考虑使用硒? pretty多少完全处理,你要爬网的内容是通过JavaScript添加到页面,而不是烤到HTML的情况。即使是这样,你也许能得到你想要的数据,而不打破了重型机械。通常这些情况之一适用:


  • JavaScript和页面服早已烤到它的内容。 JavaScript的只是没有做到这一点把内容放到页面的模板或其他DOM操作。在这种情况下,你可能想看看是否有拉你感兴趣的直出的JavaScript使用正则表达式的内容的简单方法。

  • 的JavaScript是创下了网络API来加载内容。在这种情况下,可以考虑,如果你能找出相关的API网址,只是自己打他们。这可能比实际运行JavaScript和刮内容关闭网页简单得多和更直接的

如果您的的决定使用Selenium您的情况的优点,与 PhantomJS 驱动程序中使用它,而不是,比方说,默认火狐驱动程序。网络蜘蛛通常并不需要实际图形渲染页面,或使用任何浏览器特定的怪癖或功能,因此无头的浏览器 - 以其更低的CPU和内存的成本和更少的运动部件崩溃或挂起 -​​ 是理想的。

I'm scraping content from a website using Python. First I used BeautifulSoup and Mechanize on Python but I saw that the website had a button that created content via JavaScript so I decided to use Selenium.

Given that I can find elements and get their content using Selenium with methods like driver.find_element_by_xpath, what reason is there to use BeautifulSoup when I could just use Selenium for everything?

And in this particular case, I need to use Selenium to click on the JavaScript button so is it better to use Selenium to parse as well or should I use both Selenium and Beautiful Soup?

解决方案

Before answering your question directly, it's worth saying as a starting point: if all you need to do is pull content from static HTML pages, you should probably use urllib2 with lxml or BeautifulSoup, not Selenium (although Selenium will probably be adequate too). The advantages of not using Selenium needlessly:

  • Bandwidth. Using Selenium means fetching all the resources that would normally be fetched when you visit a page in a browser - stylesheets, scripts, images, and so on. This is probably unnecessary.
  • Stability and ease of error recovery. Selenium can be a little fragile, in my experience - even with PhantomJS - and creating the architecture to kill a hung Selenium instance and create a new one is a little more irritating than setting up simple retry-on-exception logic when using urllib2.
  • Potentially, CPU and memory usage - depending upon the site you're crawling, and how many spider threads you're trying to run in parallel, it's conceivable that either DOM layout logic or JavaScript execution could get pretty expensive.

Note that a site requiring cookies to function isn't a reason to break out Selenium - you can easily create a URL-opening function that magically sets and sends cookies with HTTP requests using cookielib/cookiejar.

Okay, so why might you consider using Selenium? Pretty much entirely to handle the case where the content you want to crawl is being added to the page via JavaScript, rather than baked into the HTML. Even then, you might be able to get the data you want without breaking out the heavy machinery. Usually one of these scenarios applies:

  • JavaScript served with the page has the content already baked into it. The JavaScript is just there to do the templating or other DOM manipulation that puts the content into the page. In this case, you might want to see if there's an easy way to pull the content you're interested in straight out of the JavaScript using regex.
  • The JavaScript is hitting a web API to load content. In this case, consider if you can identify the relevant API URLs and just hit them yourself; this may be much simpler and more direct than actually running the JavaScript and scraping content off the web page.

If you do decide your situation merits using Selenium, use it with the PhantomJS driver, not, say, the default FireFox driver. Web spidering doesn't ordinarily require actually graphically rendering the page, or using any browser-specific quirks or features, so a headless browser - with its lower CPU and memory cost and fewer moving parts to crash or hang - is ideal.

这篇关于硒与BeautifulSoup的网页抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆