处理具有不同分页结构的链接时遇到麻烦 [英] Getting trouble dealing with links having differntly paginated structure

查看:42
本文介绍了处理具有不同分页结构的链接时遇到麻烦的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用python编写了一个脚本,以抓取位于其着陆页地图旁边右侧区域中不同项目的标题.我在脚本中使用了两个链接:一个具有分页功能,另一个没有.

I've written a script in python to scrape the titles of different items located at the right sided area right next to the map of its landing page. There are two links I've used within my script: one has pagination and the other don't have.

执行脚本时,它首先检查分页链接.如果找到一个,则将链接传递到 get_paginated_info()函数以在此处打印结果.但是,如果找不到分页链接,则将汤对象传递给 get_info()函数,并在此打印结果.此刻的脚本正好按照我描述的方式工作.

When I execute my script, it first check for the pagination links. If it finds one then it passes the links to get_paginated_info() function to print result there. However, if it fails to find pagination links then it passes the soup object to get_info() function and prints the result there. The script at this moment works just exactly the way I described.

如何使我的脚本仅在链接具有分页或不符合我已经尝试应用的逻辑的情况下,在 get_info()函数中打印结果?我想从脚本中删除 get_paginated_info()函数吗?

How can I make my script print the result within get_info() function only whether the link has pagination or not complying with the logic I've already tried to apply as I wish to kick out get_paginated_info() function from my script?

这是我到目前为止的尝试:

This my attempt so far:

import requests 
from bs4 import BeautifulSoup
from urllib.parse import urljoin

urls = (
    'https://www.mobilehome.net/mobile-home-park-directory/maine/all',
    'https://www.mobilehome.net/mobile-home-park-directory/rhode-island/all',
)

def get_names(link):
    r = requests.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    items = soup.select_one(".pagination a.next_page")
    if items:
        npagelink = items.find_previous_sibling().get("href").split("/")[-1]
        return [get_paginated_info(link + "/page/{}".format(page)) for page  in range(1,int(npagelink)+1)]

    else:
        return [get_info(soup)]

def get_info(soup):
    print("================links without pagination==============")
    for items in soup.select("td[class='table-row-price']"):
        item = items.select_one("h2 a").text
        print(item)

def get_paginated_info(url):
    r = requests.get(url)
    sauce = BeautifulSoup(r.text,"lxml")
    print("================links with pagination==============")
    for content in sauce.select("td[class='table-row-price']"):
        title = content.select_one("h2 a").text
        print(title)

if __name__ == '__main__':
    for url in urls:
        get_names(url)

任何能够应对不同风格的更好的设计都将受到高度赞赏.

Any better design capable of dealing with different liks will be highly appreciated.

推荐答案

我稍微改变了逻辑.因此,现在无论在有分页的情况下还是没有分页的情况下,脚本都将调用 get_names .但是在 for 循环的第二种情况下,只会执行一次迭代

I slightly have changed the logic. So now both in cases when Pagination is there and when there is no Pagination script will call get_names. But in second case in for loop only one iteration will be executed

import requests 
from bs4 import BeautifulSoup
from urllib.parse import urljoin

urls = (
    'https://www.mobilehome.net/mobile-home-park-directory/maine/all',
    'https://www.mobilehome.net/mobile-home-park-directory/rhode-island/all',
)

def get_names(link):
    r = requests.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    items = soup.select_one(".pagination a.next_page")
    try:
        npagelink = items.find_previous_sibling().get("href").split("/")[-1]
    except AttributeError:
        npagelink = 1
    return [get_info(link + "/page/{}".format(page)) for page in range(1, int(npagelink) + 1)]


def get_info(url):
    r = requests.get(url)
    sauce = BeautifulSoup(r.text,"lxml")
    for content in sauce.select("td[class='table-row-price']"):
        title = content.select_one("h2 a").text
        print(title)

if __name__ == '__main__':
    for url in urls:
        get_names(url)

仔细检查输出,以确保一切正常

Please double-check the output to be sure that everything works as expected

这篇关于处理具有不同分页结构的链接时遇到麻烦的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆