网络抓取:使用python从air的xpath中提取url:airbnb列表 [英] webscraping: extracting url from xpath in html using python: airbnb listings

查看:50
本文介绍了网络抓取:使用python从air的xpath中提取url:airbnb列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用python 3库从AirBnb的城市页面中提取列表的网址.我熟悉如何使用Beautifulsoup抓取更简单的网站并请求库.

I am trying to extract urls for listings from a city page in AirBnb, using python 3 libraries. I am familiar with how to scrape simpler websites with Beautifulsoup and requests libraries.

url: https://www.airbnb.com/s/Denver--CO--美国/房屋'

HTML中的元素

如果我检查页面上链接的元素(在Chrome中),则会得到:

If I inspect the element of a link on the page (in Chrome), I get:

xpath: "//*[@id="listing-9770909"]/div[2]/a"
selector: "listing-9770909 > div._v72lrv > a"

我的尝试:

import requests
from bs4 import BeautifulSoup

url = 'https://www.airbnb.com/s/Denver--CO--United-States/homes'
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
divs = soup.find_all('div', attrs={'id': 'listing'})

尝试2:

import requests
from lxml import html

page = requests.get(url)
root = html.fromstring(page.content)
tree = root.getroottree()
result = root.xpath('//div[@id="listing-9770909"]/div[2]/a')
for r in result:
    print(r)

这些都不返回任何东西.我需要能够提取的是页面链接的URL.有什么想法吗?

Neither of these returns anything. What I need to be able to extract is the url for the page link. Any ideas?

推荐答案

要提取链接,首先必须确保页面源中存在链接的URL.为此,您可以使用页面源中的任何列表ID进行搜索(如果您使用的是google chrome,mozilla firefox,则为ctrl + u).如果网址存在于页面源中,则可以使用列表页面的响应文本中的xpath直接将其抓取.这里上面的Airbnb列表页面在页面源中没有链接,因此该页面可能正在将请求发送到其他一些页面(通常是json请求).您可以找出这些请求并将请求发送到这些页面并获取所需的数据.如果对此有任何疑问,请发表评论.

To extract the links, first you have to make sure that the urls to the links exists in the page source. For this you can search with any of the listing ids in the page source(ctrl+u if you are using google chrome,mozilla firefox). If the urls exist in the page source you can directly scrape them using xpath in the response text of the listing page. Here the above listing page of Airbnb is not having the links in the page source, so the page might be sending requests to some other pages(usually json requests). You can find out those requests and send requests to those pages and get the required data. Please comment if you have any doubt regarding this.

这篇关于网络抓取:使用python从air的xpath中提取url:airbnb列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆