网页抓取未返回完整的html [英] Web scrape not returning full html

查看:42
本文介绍了网页抓取未返回完整的html的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在其中刮刮" https://www.kaggle.com/kernels "为了返回网站上的所有标题名称,但我遇到了一个问题,即没有将详细信息'div data-reactroot'的容器拖入抓取的数据中.

I am attempting to scrape 'https://www.kaggle.com/kernels' in order to return all of the title names on the site, but I am running into an issue where the container for this detail 'div data-reactroot' is not being pulled into the scraped data.

import urllib
from bs4 import BeautifulSoup

kaggle = 'https://www.kaggle.com/kernels'
data = urllib.request.urlopen(kaggle).read()
htmlparse = BeautifulSoup(data, 'html.parser')
print(htmlparse.findAll("div", {"class" : "block-link block-link--bordered"}))

我的代码中是否存在错误,或者网站上是否存在某种阻止我抓取这些数据的块?

Is there an error in my code or is there some sort of block on the site preventing me from scraping this data?

推荐答案

每次请求页面时,JavaScript都会以json格式获取所需数据.您可以从" https://www.kaggle.com/kernels.json?sortBy=hotness&group=everyone&pageSize=20&after=439354&language=all&outputType=all "

The data you want is fetched by JavaScript in json format each time you request the page. You can fetch it from "https://www.kaggle.com/kernels.json?sortBy=hotness&group=everyone&pageSize=20&after=439354&language=all&outputType=all" like this.

import requests
import json
source = requests.get("https://www.kaggle.com/kernels.json?sortBy=hotness&group=everyone&pageSize=20&after=439354&language=all&outputType=all")
json_obj = source.json()
for a in json_obj:
    print (a["title"])

输出:

2004-2005 Landfalling Hurricanes animation
Visualization of StockData
Generating Sentences One Letter at a Time 
Decoding the Sexiest Job of 21st Century!!
Novice to Grandmaster
Analysis  on Pokemon Data
ROC Curve with k-Fold CV
Japan Bulgaria trade playground
Bootstrapping and CIs with Veteran Suicides
Replicating "Did I do that?" paper analyses with R
Social Progress Index and World Happiness Report
SVM+HOG On ColourCompositeImage
Low- level students
PyTorch Speech Recognition Challenge (WIP)  
Loans -getting Insights
Exploring Youtube Trending Statistics EDA
3 Simple Steps (LB: .9878 with new data)
Titanic: Neural Network using Keras
Feature Engineering 
Why do employees leave and what to do about it

您唯一需要更改的是之后"查询字符串参数,在我的请求中该参数为439354,但您可以将其设置为0以获取第一条记录.

The only thing you will have to change is the "after" query string parameter which in my request was 439354 but you could set it to 0 to get the first records.

您还可以通过更改"pageSize"查询字符串参数来更改返回的记录数量,例如" https://www.kaggle.com/kernels.json?sortBy=hotness&group=everyone&pageSize=5&after=0&language=all&outputType=all "

You could also change the amount of records returned by changing the "pageSize" query string parameter e.g. "https://www.kaggle.com/kernels.json?sortBy=hotness&group=everyone&pageSize=5&after=0&language=all&outputType=all"

输出:

Data ScienceTutorial for Beginners
Data visualization and investigation
Spooky NLP and Topic Modelling tutorial
20 Years Of Games Analysis
NYC Taxi EDA - Update: The fast & the curious

或使用urllib的示例:

Or an example with urllib:

import urllib.request
import json
kaggle = "https://www.kaggle.com/kernels.json?sortBy=hotness&group=everyone&pageSize=5&after=0&language=all&outputType=all"
data = urllib.request.urlopen(kaggle).read()
json_obj = json.loads(data.decode("utf-8"))
for a in json_obj:
    print (a["title"])

这篇关于网页抓取未返回完整的html的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆