Glassdoor Web 刮与硒 [英] Glassdoor Web Scrape With Selenium

查看:18
本文介绍了Glassdoor Web 刮与硒的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取以下链接左下角图表中显示的评级趋势数据,但似乎无法找到获取它的方法.我担心这是因为它是作为图片嵌入的,因此无法访问数据,但我想我会检查.

I am trying to scrape the rating trend data that is displayed in the bottom-left chart of the link below but cannot seem to figure out a way to get to it. I am worried this is because it is embedded as a picture so the data is not accessible but thought I would check.

添加了我拼接在一起的代码,但我只得到了轴值.

Added the code I stitched together but I only get the axis values.

任何帮助将不胜感激.

https://www.glassdoor.com/Reviews/Netflix-Reviews-E11891.htm#trends-overallRating

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from time import sleep
import pandas as pd
from selenium.webdriver.common import action_chains, keys
from selenium.common.exceptions import NoSuchElementException
import numpy as np
import sys
import re
import warnings

options = Options()
options.headless = True


driver = webdriver.Chrome(r'PATH',options=options)
driver.get('https://www.glassdoor.com/Reviews/Netflix-Reviews-E11891.htm#trends-overallRating')

trend_element = driver.find_elements_by_xpath('//*[@id="DesktopTrendChart"]')[0]
trend = trend_element.text
print(trend)

推荐答案

我最初尝试使用 BeautifulSoup.

I was originally having a go at it using BeautifulSoup.

我能够提取出相应值的所有坐标(我成功地做到了).花了大约一个小时左右才找到它的位置,提取它,进入一个漂亮、整洁的数据框.

I was able to pull out all the coordinates of the corresponding values (which I did successfully do). Took about an hour or so to find where it was all located, extract it, get into a nice, tidy dataframe.

下一步,我要将 x 和 y 坐标转换为相应的 x 和 y 标签,然后进行插值以创建更精细的数据集(我还没有尝试过).我预计这将需要大约一个小时左右.

For the next step, I was going to convert the x and y coordinates to the corresponding x and y labels, then interpolate to create a more granular set of data (which I had not attempted yet). I was anticipating this would take about another hour or so.

在这样做之前我做了更多的研究,发现了一篇有趣的文章 此处.

I did a little more research prior to doing that and found an interesting article here.

阅读后,然后回到最初的问题,能够在 a) 更少的代码行中做到这一点,b) 没有 BeautifulSoup,并且 c) 花了我大约 5-10 分钟的时间,d)我学到了一些新东西.

After reading it, and then going back to the orginal problem, was able to do this in a) less line of code, b) without BeautifulSoup, and c) took me about 5-10 minutes to do, and d) I learned something new.

请阅读该链接,查看代码,这应该可以满足您的需求.

So read over that link, check out the code, and this should get you what you need.

import requests
import json
import pandas as pd

url = 'https://www.glassdoor.co.uk/api/employer/11891-rating.htm?dataType=trend&category=overallRating&locationStr=&jobTitleStr=&filterCurrentEmployee=false'

with requests.Session() as se:
    se.headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36",
        "Accept-Encoding": "gzip, deflate",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Accept-Language": "en"
    }
    response = se.get(url)

data = json.loads(response.text)

results = pd.DataFrame()
results['date'], results['rating'] = data['dates'], data['employerRatings']

输出:

print (results)
          date  rating
0   2018/12/30  3.66104
1   2018/12/30  3.66311
2   2018/11/25  3.69785
3   2018/10/28  3.73478
4    2018/9/30  3.68311
5    2018/8/26  3.69093
6    2018/7/29  3.70312
7    2018/6/24  3.74851
8    2018/5/27  3.67543
9    2018/4/29  3.67500
10   2018/3/25  3.62248
11   2018/2/25  3.73467
12   2018/1/28  3.70791
13  2017/12/31  3.72217
14  2017/11/26  3.69733
15  2017/10/29  3.61443
16   2017/9/24  3.47046
17   2017/8/27  3.46511
18   2017/7/30  3.46711
19   2017/6/25  3.48164
20   2017/5/28  3.52925
21   2017/4/30  3.46825
22   2017/3/26  3.46874
23   2017/2/26  3.52620

这篇关于Glassdoor Web 刮与硒的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆