使用BeautifulSoup抓取具有JavaScript的网页 [英] Scraping a webpage that has JavaScript with BeautifulSoup

查看:1198
本文介绍了使用BeautifulSoup抓取具有JavaScript的网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

伙计们!我再次向您提出申请.我可以使用标记来抓取简单的网站,但最近遇到了一个包含JavaScript的非常复杂的网站.因此,我想以表格(csv)的格式获取页面底部的所有估算值.类似于用户",收入估算","EPS估算".

guys! I am applying to you once again. I am ok with scraping simple websites with tags but recently I've encountered a quite complex website which has JavaScript. As a result I would like to obtain all the estimates at the bottom of the page in a format of table (csv). Like 'User', 'Revenue estimate', 'EPS estimate'.

我希望自己解决这个问题,但还是失败了.

I hoped to figure it by myself but kinda failed.

这是我的代码:

from urllib import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://www.estimize.com/jpm/fq3-2016?sort=rank&direction=asc&estimates_per_page=142&show_confirm=false")
soup = BeautifulSoup(html.read(), "html.parser")
print(soup.findAll('script')[11].string.encode('utf8'))

输出的格式很奇怪,我不知道如何以适当的格式提取数据. 我将不胜感激!

The output has a strange format and I don't know how to extract the data in an adequate form. I'll appreciate any help!

推荐答案

您要提取的数据看起来像是在数据模型中,这意味着它是JSON.如果您使用以下内容进行少量分析:

Looks like the data you're trying to extract is in a data model, which means it's in JSON. If you do a small amount of parsing with the following:

import json
import re

data_string = soup.findAll('script')[11].string.encode('utf8')
data_string = data_string.split("DataModel.parse(")[1]
data_string = data_string.split(");")[0]

// parse out erroneous html
while re.search('\<[^\>]*\>', datastring):
    data_string = ''.join(datastring.split(re.search('\<[^\>]*\>', datastring).group(0)))

// parse out other function parameters, leaving you with the json
data_you_want = json.loads(data_string.split(re.search('\}[^",\}\]]+,', data_string).group(0))[0]+'}')

print(data_you_want["estimate"])
>>> {'shares': {'shares_hash': {'twitter': None, 'stocktwits': None, 'linkedin': None}}, 'lastRevised': None, 'id': None, 'revenue_points': None, 'sector': 'financials', 'persisted': False, 'points': None, 'instrumentSlug': 'jpm', 'wallstreetRevenue': 23972, 'revenue': 23972, 'createdAt': None, 'username': None, 'isBlind': False, 'releaseSlug': 'fq3-2016', 'statement': '', 'errorRanges': {'revenue': {'low': 21247.3532016398, 'high': 26820.423240734}, 'eps': {'low': 1.02460526459765, 'high': 1.81359679579922}}, 'eps_points': None, 'rank': None, 'instrumentId': 981, 'eps': 1.4, 'season': '2016-fall', 'releaseId': 52773}

DataModel.parse是一个javascript方法,这意味着它以括号和冒号结尾.该函数的参数是所需的JSON对象.通过将其加载到 json.loads 中,您可以像访问它一样字典.

The DataModel.parse is a javascript method which means it ends with a parenthesis and a colon. the parameter for the function is the JSON object you want. By loading it into json.loads you're able to access it much like a dictionary.

从那里,您可以将数据重新映射为您想要的CSV格式.

From there you remap the data into the form you want it to be in for the csv.

这篇关于使用BeautifulSoup抓取具有JavaScript的网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆