如何在滚动时从使用 javascript 加载元素的网页中抓取? [英] How can I scrape from a webpage that uses javascript to load in elements as you scroll?

查看:23
本文介绍了如何在滚动时从使用 javascript 加载元素的网页中抓取?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的朋友问我是否可以编写一个网页抓取脚本来从特定网站收集 pokemon 的数据.

My friend asked if I could write a web scraping script to collect data of pokemon from a specific website.

我编写了以下代码来呈现 javascript 并获取一个特定的类来从网站收集数据 (https://www.smogon.com/dex/ss/pokemon/).

I've written the following code to render the javascript and get a particular class to collect data from the website (https://www.smogon.com/dex/ss/pokemon/).

问题是,当您向下滚动页面时,页面会加载更多条目.有什么办法可以从这里刮掉吗?我是网络抓取的新手,所以我不完全确定这一切是如何工作的.

The issue is, the page loads more entries as you scroll down the page. Is there any way of scraping from this? I'm new to web scraping so I'm not entirely sure how this all works.

from requests_html import HTMLSession

def getPokemon(link):
    session = HTMLSession()
    r = session.get(link)
    r.html.render()
    for pokemon in r.html.find("div.PokemonAltRow"):
        print(pokemon)
    quit()

getPokemon('https://www.smogon.com/dex/ss/pokemon/')

推荐答案

数据实际上存在于页面源中.请参阅 view-source:https://www.smogon.com/dex/ss/pokemon/(它作为 javascript 变量存在于脚本标签内部).

The data is actually present in the page source. See view-source:https://www.smogon.com/dex/ss/pokemon/ (It is present inside on the script tag as a javascript variable).

import requests
import re
import json


response = requests.get('https://www.smogon.com/dex/ss/pokemon/')

# The following regex will help you take the json string from the response text
data = "".join(re.findall(r'dexSettings = (\{.*\})', response.text))

# the above will only return a string, we need to parse that to json in order to process it as a regular json object using `json.loads()`
data = json.loads(data)

# now we can query json string like below.
data = data.get('injectRpcs', [])[1][1].get('items', [])

for row in data:
  print(row.get('name', ''))
  print(row.get('description', ''))

这里

这篇关于如何在滚动时从使用 javascript 加载元素的网页中抓取?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆