抓取谷歌目的地 [英] Scraping Google Destinations

查看:42
本文介绍了抓取谷歌目的地的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在准备环游世界,并且很想知道世界各地的热门景点有哪些,所以我正在尝试在某个地方搜索热门目的地.我想最终得到一个国家的顶级地方,以及他们最好的景点.最近添加了谷歌目的地作为一个很好的功能.

I'm preparing a tour around the world and am curious to find out what the top sights are around the world, so I´m trying to scrape the top destinations within a certain place. I want to end up with the top places in a country, and their best sights. Google Destinations was recently added as a a great functionality for this.

例如,在 Google 上搜索 Cuba Destinations 时,Google 会显示一张卡片,其中包含目的地哈瓦那、巴拉德罗、特立尼达、古巴圣地亚哥.

For example, when googling Cuba Destinations, Google shows a card with destinations Havana, Varadero, Trinidad, Santiago de Cuba.

然后,在谷歌搜索 Havana Cuba Destinations 时,它显示了`Old Havana, Malecon, Castillo de los Tres Reyes Magos del Morro, El Capitolio.

Then, when googling Havana Cuba Destinations, it shows `Old Havana, Malecon, Castillo de los Tres Reyes Magos del Morro, El Capitolio.

最后我会把它变成一张桌子,看起来像:

Finally I´ll turn it into a table, that looks like:

Cuba, Havana, Old Havana.
Cuba, Havana, Malecon.
Cuba, Havana, Castillo de los Tres Reyes Magos del Morro.
Cuba, Havana, El Capitolio.
Cuba, Varadero, Hicacos Peninsula.

等等.

我已经尝试了 旅游目的地 API 中所示的 API 调用,但这并没有提供正确的反馈,通常会产生 OVER_QUERY_LIMIT.

I have tried the API call as shown in Travel destinations API, butthat does not provide the right feedback, and often yields OVER_QUERY_LIMIT.

以下代码返回错误:

URL = "https://www.google.nl/destination/compare?q=cuba+destinations&site=search&output=search&dest_mid=/m/0d04z6&sa=X&ved=0API_KEY"

import requests 
from bs4 import BeautifulSoup 

#URL = "http://www.values.com/inspirational-quotes"
r = requests.get(URL) 

soup = BeautifulSoup(r.content, 'html5lib') 
print(soup.prettify()) 

有什么建议吗?

推荐答案

您需要为此使用 Selenium 之类的东西,因为页面会生成多个 XHR,您将无法单独使用请求来获取呈现的页面.首先安装 Selenium.

You will need to use something like Selenium for this as the page makes multiple XHRs you will not be able to get the rendered page using requests alone. First install Selenium.

sudo pip3 install selenium

然后获取驱动程序 https://sites.google.com/a/chromium.org/chromedriver/downloads(根据您的操作系统,您可能需要指定驱动程序的位置)

Then get a driver https://sites.google.com/a/chromium.org/chromedriver/downloads (Depending upon your OS you may need to specify the location of your driver)

from bs4 import BeautifulSoup
from selenium import webdriver
import time
browser = webdriver.Chrome()
url = ("https://www.google.nl/destination/compare?q=cuba+destinations&site=search&output=search&dest_mid=/m/0d04z6&sa=X&ved=0API_KEY")
browser.get(url)
time.sleep (2)
html_source = browser.page_source
browser.quit()

soup = BeautifulSoup(html_source, "lxml")
# Get the headings
hs = [tag.text for tag in soup.find_all('h2')]
# get the text containg divs
divs = [tag.text for tag in soup.find_all('div', {'class': False})]
# Delete surplus divs
del divs[:22]
del divs[-1:]

print(list(zip(hs,divs)))

输出:

[('Havana', "Cuban capital known for Old Havana's colonial architecture, live salsa music & nearby beaches."), ('Varadero', 'Major Cuban resort town on Hicacos Peninsula, with a 20km beach, a golf course & several parks.'), ('Trinidad', 'Cuban town known for Plaza Mayor, colonial architecture & plantations of Valle de los Ingenios.'), ('Santiago de Cuba', 'Cuban city known for Afro-Cuban festivals & music, plus Spanish colonial & revolutionary history.'), ('Viñales', 'Cuban town known for Viñales Valley, Casa de Caridad Botanical Gardens & nearby tobacco farms.'), ('Cienfuegos', 'Cuban coastal city, known for Tomás Terry Theater, Arco de Triunfo & Playa Rancho Luna resorts.'), ('Santa Clara', 'Cuban city home to the Che Guevara Mausoleum, Parque Vidal & ornate Teatro La Caridad.'), ('Cayo Coco', 'Cuban island known for its white-sand beaches & resorts, plus reef snorkeling & flamingos.'), ('Cayo Santa María', 'Cuban island known for Gaviotas Beach, Cayo Santa María Wildlife Refuge & Pueblo La Estrella.'), ('Cayo Largo del Sur', 'Cuban island, known for beaches like Playa Blanca & Playa Sirena, plus a sea turtle center & diving.'), ('Plaza de la Revolución', 'Che Guevara and monuments'), ('Camagüey', 'Ballet, churches, history, and beaches'), ('Holguín', 'Cuban city known for Parque Calixto García, the Hacha de Holguín axe head & Guardalavaca beaches.'), ('Cayo Guillermo', 'Cuban island with beaches like Playa del Medio & Playa Pilar, plus vast expanses of coral reef.'), ('Matanzas', 'Caves, theater, beaches, history, and rivers'), ('Baracoa', 'Beaches, rivers, and nature'), ('Centro Habana', '\xa0'), ('Playa Girón', 'Beaches, snorkeling, and museums'), ('Topes de Collantes', 'Scenic nature reserve park for hiking'), ('Guardalavaca', 'Cuban resort known for Esmeralda Beach, the Cayo Naranjo Aquarium & the Chorro de Maíta Museum.'), ('Bay of Pigs', 'Snorkeling, scuba diving, and beaches'), ('Isla de la Juventud', 'Scuba diving and beaches'), ('Zapata Swamp', 'Parks, crocodiles, birdwatching, and swamps'), ('Pinar del Río', 'History'), ('Remedios', 'Churches, beaches, and museums'), ('Bayamo', 'Wax museums, monuments, history, and music'), ('Sierra Maestra', 'Peaks with a storied political history'), ('Las Terrazas', 'Zip-lining, nature reserves, and hiking'), ('Sancti Spíritus', 'History and museums'), ('Playa Ancon', 'Beaches, snorkeling, and scuba diving'), ('Jibacoa', 'Beaches, snorkeling, and jellyfish'), ('Jardines de la Reina', 'Scuba diving, fly-fishing, and gardens'), ('Cayo Jutías', 'Beach and snorkeling'), ('Guamá, Cuba', 'Crocodiles, beaches, snorkeling, and lakes'), ('Morón', 'Crocodiles, lagoons, and beaches'), ('Las Tunas', 'Beaches, nightlife, and history'), ('Soroa', 'Waterfalls, gardens, nature, and ecotourism'), ('Guanabo', 'Beach'), ('María la Gorda', 'Scuba diving, beaches, and snorkeling'), ('Alejandro de Humboldt National Park', 'Park, protected area, and hiking'), ('Ciego de Ávila', 'Zoos and beaches'), ('Bacunayagua', '\xa0'), ('Guantánamo', 'Beaches, history, and nature'), ('Cárdenas', 'Beaches, museums, monuments, and history'), ('Canarreos Archipelago', 'Sailing and coral reefs'), ('Caibarién', 'Beaches'), ('El Nicho', 'Waterfalls, parks, and nature'), ('San Luis Valley', 'Cranes, national wildlife refuge, and elk')]

根据评论更新:

from bs4 import BeautifulSoup
from selenium import webdriver
import time

browser = webdriver.Chrome()
for place in ["Cuba", "Belgum", "France"]:
    url = ("https://www.google.nl/destination/compare?site=destination&output=search")
    browser.get(url) # you may not need to do this every time if you clear the search box
    time.sleep(2)
    element = browser.find_element_by_name('q') # get the query box
    time.sleep(2)
    element.send_keys(place) # populate the search box
    time.sleep (2)
    search_box=browser.find_element_by_class_name('sbsb_c') # get the first element in the list
    search_box.click() # click it
    time.sleep (2)
    destinations=browser.find_element_by_id('DESTINATIONS') # Click the destinations link
    destinations.click()
    time.sleep (2)
    html_source = browser.page_source
    soup = BeautifulSoup(html_source, "lxml")
    # Get the headings
    hs = [tag.text for tag in soup.find_all('h2')]
    # get the text containg divs
    divs = [tag.text for tag in soup.find_all('div', {'class': False})]
    # Delete surplus divs
    del divs[:22]
    del divs[-1:]
    print(list(zip(hs,divs)))

browser.quit()

这篇关于抓取谷歌目的地的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆