在python中使用beautifulsoup进行抓取时缺少值 [英] Missing values while scraping using beautifulsoup in python

查看:60
本文介绍了在python中使用beautifulsoup进行抓取时缺少值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为第一个使用python的项目,我正在尝试进行网络抓取(对编程来说是全新的东西),我快完成了,但是网页上的某些值丢失了,所以我想用一些东西来替换该丢失的值就像"0"一样或未找到",实际上我只是想从数据中制作一个CSV文件,而不是真正地进行分析.

I'm trying to do web scraping as my first project using python (completely new to programming), I'm almost done, however some values on the web page are missing, so I want to replace that missing value with something like a "0" or "Not found", really I just want to make a csv file out of the data, not really going forward with the analysis.

我要抓取的网页是: https://www.lamudi.com.mx/nuevo-leon/departamento/for-rent/?page=1

我有一个循环,该循环收集页面的所有te链接,然后转到它们中的每个链接以抓取数据并将其保存在列表中,但是我的某些列表中的元素少于其他列表.因此,我只希望我的程序确定何时缺少一个值,并附加一个"0".或未找到"达到我的尺寸"列表.

I have a loop that collects all of te links of the page, and then goes to each one of them to scrape the data and save it on a list, however some of my lists have less elements than others. So I just want my program to identify when is a missing value and append a "0" or "Not found" to my "sizes" list.

用于收集页面上的链接:

For collecting the links on the page:

tags = soup('a',{'class':'js-listing-link'})
for tag in tags:
    link = tag.get('href')
    if link not in links:
        links.append(link)

print("Number of Links:", len(links))

用于收集每个部门的规模:

For collecting the size of each department:

for link in links:
    size = soup('span',{'class':'Overview-attribute icon-livingsize-v4'})
    for mysize in size:
        mysize = mysize.get_text().strip()
        sizes.append(mysize)

print("Number of Sizes:", len(sizes))

推荐答案

在此页面上,您可以选择所有列表行(使用 .select('.ListingCell-row')),然后选择其中的所有信息(并用-替换缺少的信息):

On this page, you can select all listing rows (with .select('.ListingCell-row')) and then select all information within it (and substituting the missing info with -):

import requests
from bs4 import BeautifulSoup


url = 'https://www.lamudi.com.mx/nuevo-leon/departamento/for-rent/?page=1'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:81.0) Gecko/20100101 Firefox/81.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')

for row in soup.select('.ListingCell-row'):
    name = row.h3.get_text(strip=True)
    link = row.h3.a['href']
    size = row.select_one('.icon-livingsize')
    size = size.get_text(strip=True) if size else '-'
    print(name)
    print(link)
    print(size)
    print('-' * 80)

打印:

Loft en Renta Amueblado Una Recámara Cerca Udem
https://www.lamudi.com.mx/loft-en-renta-amueblado-una-recamara-cerca-udem.html
50 m²
--------------------------------------------------------------------------------
DEPARTAMENTO EN RENTA SAN JERONIMO EQUIPADO
https://www.lamudi.com.mx/departamento-en-renta-san-jeronimo-equipado.html
-
--------------------------------------------------------------------------------
Departamento - Narvarte
https://www.lamudi.com.mx/departamento-narvarte-58.html
60 m²
--------------------------------------------------------------------------------

...and so on.

这篇关于在python中使用beautifulsoup进行抓取时缺少值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆