如何通过多个结果页面与迭代美丽的汤时,网页抓取 [英] How to iterate through multiple results pages when web scraping with Beautiful Soup

查看:171
本文介绍了如何通过多个结果页面与迭代美丽的汤时,网页抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有我已经写在这里我用美丽的汤刮搜索结果的网站的脚本。我已成功地分离,我通过它的类名想要的数据。

I have a script that i have written where i use Beautiful Soup to scrape a website for search results. I have managed to isolate the data that i want via its class name.

然而,搜索结果不是在单个页面上。相反,它们是在多个页面US $ p $垫,所以我希望让他们所有。我要让我的脚本可以检查是否有下一个结果页面,并运行本身有作为。由于结果在数量上有所不同,我不知道结果的页面中存在多少,所以我不能predefine一系列遍历。我也试图用一个if_page_exists'检查。但是,如果我把一个页码即出结果的范围,页面始终存在,它只是不具有任何resulta,但有一个网页,其中说,有没有显示的结果。

However, the search results are not on a single page. Instead, they are spread over multiple pages so i want get them all. I want to make my script be able to check if there is a next results page and run itself there as well. Since the results vary in number, i do not know how many pages of results exist so i cant predefine a range to iterate over. I have also tried to use an 'if_page_exists' check. However, if i put a page number that is out of the ranges of results, the page always exists, it just doesnt have any resulta but has a page which says there are no results to display.

不过是什么我已经注意到的是,每个页面结果具有其中有IDNextLink1和最后一页结果不具有这样的下一步的链接。所以,我认为那是魔术可能。但我不知道如何以及在何处实施检查。我已经越来越无限循环之类的东西。

What i have noticed however is that each page result has a 'Next' link which has id 'NextLink1' and the last page result does not have this. So i think thats were the magic might be. But i dont know how and where to implement that check. I have been getting infinite loops and stuff.

下面的脚本发现了搜索词X的结果。援助将大大AP preciated。

The script below finds the results for search term 'x'. Assistance would be greatly appreciated.

from urllib.request import urlopen
from bs4 import BeautifulSoup

#all_letters = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o","p","q","r","s","t","u","v", "w", "x", "y", "z", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
all_letters= ['x']
for letter in all_letters:

    page_number = 1
    url = "https://www.co.dutchess.ny.us/CountyClerkDocumentSearch/Search.aspx?q=nco1%253d2%2526name1%253d" + letter + "&page=" + str (page_number)
    html = urlopen(url)
    bsObj = BeautifulSoup(html)
    nameList = bsObj.findAll("td", {"class":"party-name"})

    for name in nameList:
        print(name.get_text())

此外,没有人知道实例的字母数字字符的名单综合类比我上面的脚本注释掉更好的方式缩短?

Also, does anyone know a shorter way of instantiating a list of alphanumeric characters thata better than the one i commented out in the script above?

推荐答案

试试这个:

from urllib.request import urlopen
from bs4 import BeautifulSoup


#all_letters = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o","p","q","r","s","t","u","v", "w", "x", "y", "z", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
all_letters= ['x']
pages = []

def get_url(letter, page_number):
    return "https://www.co.dutchess.ny.us/CountyClerkDocumentSearch/Search.aspx?q=nco1%253d2%2526name1%253d" + letter + "&page=" + str (page_number)

def list_names(soup):
    nameList = soup.findAll("td", {"class":"party-name"})
    for name in nameList:
        print(name.get_text())

def get_soup(letter, page):
    url = get_url(letter, page)
    html = urlopen(url)
    return BeautifulSoup(html)

def main():
    for letter in all_letters:
        bsObj = get_soup(letter, 1)

        sel = bsObj.find('select', {"name": "ctl00$ctl00$InternetApplication_Body$WebApplication_Body$SearchResultPageList1"})    
        for opt in sel.findChildren("option", selected = lambda x: x != "selected"):
            pages.append(opt.string)

        list_names(bsObj)

        for page in pages:
            bsObj = get_soup(letter, page)
            list_names(bsObj)
main()

的main()功能,从第一页 get_soup(字母,1)我们发现,并存储在一个列表,包含了所有页码选择选项的值。

In the main() function, from first page get_soup(letter, 1) we find and store in a list the select options values that contains all page numbers.

接下来,我们遍历页码从下一个页面中提取数据。

Next, we loop over page numbers to extract data from next pages.

这篇关于如何通过多个结果页面与迭代美丽的汤时,网页抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆