Webscraper 不会从第 2 页循环到第 5 页 [英] Webscraper won't loop from page 2 to page 5

查看:33
本文介绍了Webscraper 不会从第 2 页循环到第 5 页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 https://www.realtor.com/realestateagents/phoenix_az//pg-2 作为我的起点.我想从第 2 页到第 5 页以及中间的每一页,同时收集姓名和数字.我正在完美地收集第 2 页上的信息,但是我无法在无需插入新网址的情况下将其转到下一页.我正在尝试设置一个循环来自动执行此操作,但是在对我认为是循环的内容进行编码后,我只是在刮板停止之前仅在第 2 页(起点)上获取信息.我也是循环新手,尝试了多种方法,但都无法正常工作.

I am using https://www.realtor.com/realestateagents/phoenix_az//pg-2 as my starting point. I want to go from page 2 to page 5 and each page in-between while collecting names and numbers. I am collecting information on page 2 perfectly however I can not get it to go to the next page without having to plug in a new url. I am trying to set up a loop to do this automatically however after coding what I thought would be a loop im just getting the information only on page 2 (the starting point) before the scraper stops. I am new too loops and have tried multiple ways but can get none to work.

以下是目前的完整代码.

Below is the complete code for now.

import requests
from requests import get
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup 
import numpy as np
from numpy import arange
import pandas as pd 

from time import sleep
from random import randint

headers = {'user-agent': ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)'
                          'AppleWebKit/537.36 (KHTML, like Gecko)'
                          'Chrome/45.0.2454.101 Safari/537.36'),
                          'referer': 'https://www.realtor.com/realestateagents/phoenix_az//pg-2'}

my_url = 'https://www.realtor.com/realestateagents/phoenix_az//pg-2'

#opening up connection, grabbing the page
uClient = uReq(my_url)
#read page 
page_html = uClient.read()
#close page
uClient.close()

pages = np.arange(2, 3, 1)

for page in pages:

    page = requests.get("https://www.realtor.com/realestateagents/phoenix_az//pg-" , headers=headers)

#html parsing
page_soup = soup(page_html, "html.parser")

#finds all realtors on page 
containers = page_soup.findAll("div",{"class":"agent-list-card clearfix"})

#creating csv file 
filename = "phoenix.csv"
f = open(filename, "w")

headers = "agent_name, agent_number\n"
f.write(headers)

#controlling scrape speed 


for container in containers:

    try:
        name = container.find('div', class_='agent-name text-bold')
        agent_name = name.a.text.strip()
    except AttributeError:
        print("-")

    try:
        number = container.find('div', class_='agent-phone hidden-xs hidden-xxs')
        agent_number = number.text.strip()
    except AttributeError:
        print("-")
    except NameError:
        print("-")

    try:
        print("name: " + agent_name)
        print("number: " + agent_number)
    except NameError:
        print("-")

    try:
        f.write(agent_name + "," + agent_number + "\n")
    except NameError:
        print("-")

f.close()

推荐答案

不确定这是否是您需要的,但这里有一个基于您的示例的工作(和简化)代码,它抓取了前五页.

Not sure if that's what you need, but here's a working (and simplified) code based on your example that scrapes the first five pages.

如果你仔细观察,我正在使用 for 循环 来移动"通过将页码附加到 url 来浏览页面.然后,我得到 HTML,将其解析为代理 div,获取名称和编号(如果 None 然后我添加 N/A),最后将列表转储到 csv 文件.

If you take a close look, I'm using a for loop to "move" thru the pages by appending the page number to the url. Then, I get the HTML, parse it for agent div, grab name and number (if None then I add N/A) and finally dump the list to a csv file.

为了匹配评论,我添加了一个城市 Pheonix 和一个 wait_for 功能,可以在 1 到 10 秒之间的任何时间停止脚本,可调整.

To match the comments, I've added a city Pheonix and a wait_for feature that stops the script for any time between 1 to 10 seconds, adjustable.

import csv
import random
import time

import requests
from bs4 import BeautifulSoup


realtor_data = []

for page in range(1, 6):
    print(f"Scraping page {page}...")
    url = f"https://www.realtor.com/realestateagents/phoenix_az/pg-{page}"
    soup = BeautifulSoup(requests.get(url).text, "html.parser")

    for agent_card in soup.find_all("div", {"class": "agent-list-card clearfix"}):
        name = agent_card.find("div", {"class": "agent-name text-bold"}).find("a")
        number = agent_card.find("div", {"itemprop": "telephone"})
        realtor_data.append(
            [
                name.getText().strip(),
                number.getText().strip() if number is not None else "N/A",
                "Pheonix",
             ],
        )
    wait_for = random.randint(1, 10)
    print(f"Sleeping for {wait_for} seconds...")
    time.sleep(wait_for)

with open("data.csv", "w") as output:
    w = csv.writer(output)
    w.writerow(["NAME:", "PHONE NUMBER:"])
    w.writerows(realtor_data)

输出:

带有房地产经纪人姓名和电话号码的 .csv 文件.

A .csv file with realtor's name and phone number.

NAME:                     PHONE NUMBER:    CITY:
------------------------  ---------------  -------
Shawn Rogers              (480) 313-7031   Pheonix
The Jason Mitchell Group  (480) 470-1993   Pheonix
Kyle Caldwell             (602) 390-2245   Pheonix
THE VALENTINE GROUP       N/A              Pheonix
Nancy Wolfe               (602) 418-1010   Pheonix
Rhonda DuBois             (623) 418-2970   Pheonix
Sabrina Hurley            (602) 410-1985   Pheonix
Bryan Adams               (480) 375-1292   Pheonix
DeAnn Fry                 (623) 748-3818   Pheonix
Esther P Goh              (480) 703-3836   Pheonix
...

这篇关于Webscraper 不会从第 2 页循环到第 5 页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆