使用Python中的搜索框和JavaScript抓取特定网站 [英] Scraping a specific website with a search box and javascripts in Python

查看:33
本文介绍了使用Python中的搜索框和JavaScript抓取特定网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在网站 https://sray.arabesque.com/dashboard 上,有一个搜索框输入"在html中.我想在搜索框中输入公司名称,在退出菜单中选择该名称的第一个建议(例如"Anglo American plc"),转到包含该公司信息的网址,加载javascript以获取完整的信息的html版本,然后在底部刮取GC得分,ESG得分和温度得分.

On the website https://sray.arabesque.com/dashboard there is a search box "input" in html. I want to enter a company name in the search box, choose the first suggestion for that name in the dropout menu (e.g., "Anglo American plc"), go to the url with the info about that company, load javascripts to get full html version of the obtained page, and then scrape it for GC Score, ESG Score, Temperature Score in the bottom.

!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!pip install selenium

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
options = webdriver.ChromeOptions()
options.add_argument('-headless')
options.add_argument('-no-sandbox')
options.add_argument('-disable-dev-shm-usage')

wd = webdriver.Chrome('chromedriver',options=options)

companies = ['Anglo American plc']

for company in companies:
  # dryscrape.start_xvfb()
  # session = dryscrape.Session()
  # session.visit("https://srayapi.arabesque.com/api/sray/company/history/004BTP-E")
  resp = wd.get('https://sray.arabesque.com/dashboard/')
#print(driver.page_source)
  e = wd.find_element_by_id(id_='mat-input-0')
  e.send_keys(company)
  e.send_keys(Keys.ENTER)
  innerHTML = e.execute_script("return document.body.innerHTML")
  print(innerHTML)

如果在搜索框中输入公司名称后仍不知道该网址,我不太了解如何访问包含Anglo American信息的网址.

I don't quite understand how to visit an URL with info about Anglo American and scrape it if we don't know the URL after entering the company name in the search box.

推荐答案

在不完全知道为什么要使用硒的情况下,先进行搜索,然后再获得另一个网站,这是我要获取所需数据的方法:

Without exactly knowing why you want to use selenium, use the search and then getting another site, here is what I would do to get the data you are looking for:

import requests
import json

session = requests.Session()
url = 'https://srayapi.arabesque.com/api/sray/q'
response = session.get(url).json()

rays = response['data']['rays']
[ray for ray in rays if ray['name'].startswith('Anglo American')]

然后做任何您想做的事,所以对于 esg gc 温度也许:

Then do whatever you want, so for esg, gc and temperature perhaps:

myObj = [{result['name']: {'gc': result['gc'], 'esg': result['esg'], 'temp': result['score_near']}} for result in results]

这篇关于使用Python中的搜索框和JavaScript抓取特定网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆