使用 Python 抓取网页 JavaScript 页面 [英] Web-scraping JavaScript page with Python

查看:80
本文介绍了使用 Python 抓取网页 JavaScript 页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试开发一个简单的网络抓取工具.我想提取没有 HTML 代码的文本.事实上,我达到了这个目标,但是我看到在一些加载 JavaScript 的页面中我没有得到很好的结果.

I'm trying to develop a simple web scraper. I want to extract text without the HTML code. In fact, I achieve this goal, but I have seen that in some pages where JavaScript is loaded I didn't obtain good results.

例如,如果一些 JavaScript 代码添加了一些文本,我看不到它,因为当我调用

For example, if some JavaScript code adds some text, I can't see it, because when I call

response = urllib2.urlopen(request)

我得到了没有添加的原始文本(因为 JavaScript 是在客户端执行的).

I get the original text without the added one (because JavaScript is executed in the client).

所以,我正在寻找一些想法来解决这个问题.

So, I'm looking for some ideas to solve this problem.

推荐答案

EDIT 30/Dec/2017:此答案出现在 Google 搜索的热门结果中,因此我决定对其进行更新.旧答案还在最后.

EDIT 30/Dec/2017: This answer appears in top results of Google searches, so I decided to update it. The old answer is still at the end.

dryscape 不再维护,dryscape 开发人员推荐的库仅适用于 Python 2.我发现使用 Selenium 的 Python 库和 Phantom JS 作为 Web 驱动程序足够快且容易完成工作.

dryscape isn't maintained anymore and the library dryscape developers recommend is Python 2 only. I have found using Selenium's python library with Phantom JS as a web driver fast enough and easy to get the work done.

安装 Phantom JS 后,请确保 phantomjs 二进制文件是在当前路径中可用:

Once you have installed Phantom JS, make sure the phantomjs binary is available in the current path:

phantomjs --version
# result:
2.1.1

示例

举个例子,我用以下 HTML 代码创建了一个示例页面.(链接):

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <title>Javascript scraping test</title>
</head>
<body>
  <p id='intro-text'>No javascript support</p>
  <script>
     document.getElementById('intro-text').innerHTML = 'Yay! Supports javascript';
  </script> 
</body>
</html>

没有 javascript,它说:No javascript support 和 javascript:Yay!支持javascript

without javascript it says: No javascript support and with javascript: Yay! Supports javascript

import requests
from bs4 import BeautifulSoup
response = requests.get(my_url)
soup = BeautifulSoup(response.text)
soup.find(id="intro-text")
# Result:
<p id="intro-text">No javascript support</p>

使用 JS 支持抓取:

from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get(my_url)
p_element = driver.find_element_by_id(id_='intro-text')
print(p_element.text)
# result:
'Yay! Supports javascript'

<小时>

您还可以使用 Python 库 dryscrape 来抓取 JavaScript 驱动的网站.


You can also use Python library dryscrape to scrape javascript driven websites.

import dryscrape
from bs4 import BeautifulSoup
session = dryscrape.Session()
session.visit(my_url)
response = session.body()
soup = BeautifulSoup(response)
soup.find(id="intro-text")
# Result:
<p id="intro-text">Yay! Supports javascript</p>

这篇关于使用 Python 抓取网页 JavaScript 页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆