怎样才能让网页刷屏使用美丽的汤遍历搜索结果的多个页面? [英] How can I make a web scraper traverse multiple pages of search results using Beautiful Soup?

查看:218
本文介绍了怎样才能让网页刷屏使用美丽的汤遍历搜索结果的多个页面?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想写一个刮刀从下面的页面结果:

<一个href=\"https://www.co.dutchess.ny.us/CountyClerkDocumentSearch/Search.aspx?q=nco1%253d2%2526name1%253da&page=1\" rel=\"nofollow\">https://www.co.dutchess.ny.us/CountyClerkDocumentSearch/Search.aspx?q=nco1%253d2%2526name1%253da&page=1

我想获得的所有结果,而不仅仅是A的结果,但我想我可以用一个字母开始,然后贯穿整个字母表运行。如果有人可以帮助这部分,这将是巨大的。

不管怎样,我要零上所有的政党名称,也就是财产阶级政党-name元素。

我有以下的code:

 从urllib.request里进口的urlopen
从BS4进口BeautifulSoup
HTML = urlopen(\"https://www.co.dutchess.ny.us/CountyClerkDocumentSearch/Search.aspx?q=nco1%253d2%2526name1%253da&page=1\")
bsObj = BeautifulSoup(HTML)
NAMELIST = bsObj.findAll(TD,{级:党名})
在名称列表名称:
 打印(name.get_text())

然而,这仅适用于一个页面。结果跨越多个页面。我怎样才能做到这一点对于多页?

此外,如果你可以得到所有的结果帮助,不只是一个,那将是巨大的。

修改
我现在已经提高了我的code和可以在所有的搜索。但是,我还是不能进入下一个页面。我一直在使用PAGE_NUMBER尝试++但不知道在哪里停止,因为显示的结果数不同而不同。我怎么可能有它进入下一个分页符在最后一页???

新的code:

 从urllib.request里进口的urlopen
从BS4进口BeautifulSoupall_letters = [一,B,C,D,E,F,G,H,I,J,K,升 ,M,N,O,p,q,R,S,T,U,v,W,×, Y,Z,0,1,2,3,4,5,6,7,8,9]
在all_letters信:    PAGE_NUMBER = 1
    URL =htt​​ps://www.co.dutchess.ny.us/CountyClerkDocumentSearch/Search.aspx?q=nco1%253d2%2526name1%253d+字母+&放大器;页=+ STR(PAGE_NUMBER)
    HTML =的urlopen(URL)
    bsObj = BeautifulSoup(HTML)
    NAMELIST = bsObj.findAll(TD,{级:党名})    在名称列表名称:
        打印(name.get_text())


解决方案

据我了解,你想改变页面上的STARTS_WITH参数ANS遍历所有的字母。如果我的问题的理解是正确的话,这可能是有帮助的。

如果你分析的网址,你会得到你的答案。

URL =<一个href=\"https://www.co.dutchess.ny.us/CountyClerkDocumentSearch/Search.aspx?q=nco1%253d2%2526name1%253da&page=1\" rel=\"nofollow\">https://www.co.dutchess.ny.us/CountyClerkDocumentSearch/Search.aspx?q=nco1%253d2%2526name1%253da&page=1\"

%253D后面的字母决定了STARTS_WITH一词。目前,它是'a',因此它与'一',如果你想迭代只需更改网址

开始返回

URL ='https://www.co.dutchess.ny.us/CountyClerkDocumentSearch/Search.aspx?q=nco1%253d2%2526name1%253d'+ STARTS_WITH +'和;页= 1

STARTS_WITH 可以是任何东西或者是字符(A,B,C,...)或字符串(ABC,ASDE,...)

I am trying to write a scraper to get results from the following page:

https://www.co.dutchess.ny.us/CountyClerkDocumentSearch/Search.aspx?q=nco1%253d2%2526name1%253da&page=1

I am trying to get all results, not just "A" results, but i figured I could start with one letter and then run through the whole alphabet. If someone can assist with this part that would be great too.

Anyway, I want to zero in on all Party Names, that is, elements with property class party-name.

I have the following code:

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://www.co.dutchess.ny.us/CountyClerkDocumentSearch/Search.aspx?q=nco1%253d2%2526name1%253da&page=1")
bsObj = BeautifulSoup(html)
nameList = bsObj.findAll("td", {"class":"party-name"})
for name in nameList:
 print(name.get_text())

However, this only works for one page. The results span over multiple pages. How can I accomplish this for multiple pages?

Also if you can help with getting all results, not just A, that would be great.

EDIT I have improved my code now and can go over all searches. However, I still cannot go to the next page. I have tried using page_number++ but that does not know where to stop since number of page results varies. How could i have it go to the next page break at the last page???

New Code:

from urllib.request import urlopen
from bs4 import BeautifulSoup

all_letters = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o","p","q","r","s","t","u","v", "w", "x", "y", "z", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
for letter in all_letters:

    page_number = 1
    url = "https://www.co.dutchess.ny.us/CountyClerkDocumentSearch/Search.aspx?q=nco1%253d2%2526name1%253d" + letter + "&page=" + str (page_number)
    html = urlopen(url)
    bsObj = BeautifulSoup(html)
    nameList = bsObj.findAll("td", {"class":"party-name"})

    for name in nameList:
        print(name.get_text())

解决方案

From what I understand you want to change the "starts_with" parameter on the page ans iterate over all the alphabets. If my understanding of the question is correct then this might be helpful.

If you analyze the url you will get your answer.

url = "https://www.co.dutchess.ny.us/CountyClerkDocumentSearch/Search.aspx?q=nco1%253d2%2526name1%253da&page=1"

the letter after "%253d" dictates the "starts_with" term. Currently it is 'a' hence it returns with starts with 'a' if you want to iterate just change the url

url = 'https://www.co.dutchess.ny.us/CountyClerkDocumentSearch/Search.aspx?q=nco1%253d2%2526name1%253d' + starts_with + '&page=1'

starts_with can be anything either a character (a,b,c,...) or a string (abc,asde,...)

这篇关于怎样才能让网页刷屏使用美丽的汤遍历搜索结果的多个页面?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆