For Loop 试图抓取 TripAdvisor 餐厅数据 [英] For Loop trying to scrape TripAdvisor Restaurant data

查看:18
本文介绍了For Loop 试图抓取 TripAdvisor 餐厅数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取香港所有餐厅的列表及其对应的 URL.目前,在我下面的代码中,我能够抓取第一页和第二页.但我希望底部的 for 循环更具动态性,并继续抓取直到达到我在 range() 中指定的条目数量.

I am trying to scrape a list of all the restaurants in Hong Kong and their corresponding URLs. Currently, in my code below, I am able to scrape the 1st and 2nd pages. But I want my for loop towards the bottom to be a bit more dynamic and keep scraping until it hits the amount of entries I specified in range().

我在这方面还是新手,所以任何帮助都会很棒.

I am still a novice at this so any help would be awesome.

#import libraries
import requests
from bs4 import BeautifulSoup
import csv


#scrape the first page because this URL is different then when you start moving to different pages
url0 = 'https://www.tripadvisor.com/Restaurants-g294217-Hong_Kong.html#EATERY_LIST_CONTENTS'
r = requests.get(url0)
data = r.text
soup = BeautifulSoup(r.text, "html.parser")
for link in soup.findAll('a', {'property_title'}):
    print 'https://www.tripadvisor.com/Restaurant_Review-g294217-' + link.get('href')
    print link.string

#loop to move into the next pages. entries are in increments of 30 per page
for i in range(0, 120, 30):
    entries = str(30)
    #url format offsets the restaurants in increments of 30 after the oa; hence entries as variable
    url1 = 'https://www.tripadvisor.com/Restaurants-g294217-oa' + entries + '-Hong_Kong.html#EATERY_LIST_CONTENTS'
    r1 = requests.get(url1)
    data1 = r1.text
    soup1 = BeautifulSoup(data1, "html.parser")
    for link in soup1.findAll('a', {'property_title'}):
        print 'https://www.tripadvisor.com/Restaurant_Review-g294217-' + link.get('href')
        print link.string
    break

推荐答案

最终添加了一段时间,让它按照我想要的方式循环.希望这对未来的人们有所帮助

Ended up adding a while that got it to loop the way I wanted it to. Hope this helps people in the future

for i in range(30, 120, 30):
    while i <= range:
        i = str(i)
        #url format offsets the restaurants in increments of 30 after the oa; hence entries as variable
        url1 = 'https://www.tripadvisor.com/Restaurants-g294217-oa' + i + '-Hong_Kong.html#EATERY_LIST_CONTENTS'
        r1 = requests.get(url1)
        data1 = r1.text
        soup1 = BeautifulSoup(data1, "html.parser")
        for link in soup1.findAll('a', {'property_title'}):
            print 'https://www.tripadvisor.com/Restaurant_Review-g294217-' + link.get('href')
            print link.string
        break

这篇关于For Loop 试图抓取 TripAdvisor 餐厅数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆