当我抓取一个循环而不是当我直接访问它时,这个网页如何阻止我? [英] How is this webpage blocking me when I scrape through a loop but not when I access it directly?

查看:32
本文介绍了当我抓取一个循环而不是当我直接访问它时,这个网页如何阻止我?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取一组网页.当我直接从一个网页抓取时,我可以访问 html.但是,当我遍历 pd 数据框以抓取一组网页时,即使是只有一行的数据框,我也会看到截断的 html 并且无法提取所需的数据.

I am trying to scrape a set of webpages. When I scrape from one webpage directly, I am able to access the html. However, when I iterate through a pd dataframe to scrape a set of webpages, even a dataframe with only one row, I see a truncated html and cannot extract my desired data.

import pandas as pd
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import re

first_names = pd.Series(['Robert'], index = [0])
last_names = pd.Series(['McCoy'], index = [0])
names = pd.DataFrame(columns = ['first_name', 'last_name'])
names['first_name'] = first_names
names['last_name'] = last_names

freq = []

for first_name, last_name in names.iterrows():
    url = "https://zbmath.org/authors/?q={}+{}".format(first_name, 
    last_name)
    r = requests.get(url)
    html = BeautifulSoup(r.text)
    html=str(html)
    frequency = re.findall('Joint\sPublications">(.*?)</a>', html)
    freq.append(frequency)

print(freq)

[[]]

url = "https://zbmath.org/authors/?q=robert+mccoy"
r = requests.get(url)
html = BeautifulSoup(r.text)
html=str(html)
frequency = re.findall('Joint\sPublications">(.*?)</a>', html)
freq.append(frequency)

print(freq)

[[], ['10', '8', '6', '5', '3', '3', '2', '2', '2', '2', '2'', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1','1', '1']]

[[], ['10', '8', '6', '5', '3', '3', '2', '2', '2', '2', '2', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1']]

如何循环浏览多个网页而不被阻止?

How can I loop through multiple webpages but not get blocked?

推荐答案

Iterrows 返回 (index,(columns)) 的元组,所以解决方案是稍微不同的解析:

Iterrows returns a tuple of (index,(columns)), so the solution is to parse it slightly differently:

for _,(first_name, last_name) in names.iterrows():
    url = "https://zbmath.org/authors/?q={}+{}".format(first_name, 
    last_name)
    r = requests.get(url)
    html = BeautifulSoup(r.text)
    html=str(html)
    frequency = re.findall('Joint\sPublications">(.*?)</a>', html)
    freq.append(frequency)

print(freq)

这篇关于当我抓取一个循环而不是当我直接访问它时,这个网页如何阻止我?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆