使用硒刮取多个选择选项 [英] Scraping multiple select options using Selenium

查看:37
本文介绍了使用硒刮取多个选择选项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须从 https://secc.gov.in/lgdStateList 网站上抓取PDF.州,区和街区共有3个下拉菜单.有几个州,在每个州下我们都有区,在每个区下都有街区.

I am required to scrape PDF's from the website https://secc.gov.in/lgdStateList. There are 3 drop-down menus for a state, a district and a block. There are several states, under each state we have districts and under each district there are blocks.

我试图实现以下代码.我可以选择州,但是选择地区时似乎出现了一些错误.

I tried to implement the following code. I was able to select the state, but there seems to be some error when I select the district.

from selenium import webdriver
from selenium.webdriver.support.ui import Select
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import time

from selenium import webdriver
from bs4 import BeautifulSoup

browser = webdriver.Chrome()
url = ("https://secc.gov.in/lgdStateList")
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = BeautifulSoup(html_source, 'html.parser')
for name_list in soup.find_all(class_ ='dropdown-row'):
    print(name_list.text)

driver = webdriver.Chrome()
driver.get('https://secc.gov.in/lgdStateList')

selectState = Select(driver.find_element_by_id("lgdState"))
for state in selectState.options:
    state.click()
    selectDistrict = Select(driver.find_element_by_id("lgdDistrict"))
    for district in selectDistrict.options:
        district.click()
        selectBlock = Select(driver.find_element_by_id("lgdBlock"))
        for block in selectBlock.options():
            block.click()

我遇到的错误是:

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="lgdDistrict"]"}
  (Session info: chrome=83.0.4103.106)

我需要在3个菜单中爬行的帮助.

I need help crawling through the 3 menus.

任何帮助/建议都将不胜感激.让我知道评论中的任何澄清.

Any help/suggestions would be really appreciated. Let me know of any clarifications in the comments.

推荐答案

在这里您可以找到的不同状态.您可以从地区和街区下拉菜单中找到相同的内容.

This is where you can find the value of different states. You can find the same from district and block dropdowns.

您现在应该在有效负载中使用这些值来获取要从中获取数据的表:

You should now use those values within payload to get the table you would like to grab data from:

import urllib3
import requests
from bs4 import BeautifulSoup

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

link = "https://secc.gov.in/lgdGpList"

payload = {
    'stateCode': '10',
    'districtCode': '188',
    'blockCode': '1624'
}

r = requests.post(link,data=payload,verify=False)
soup = BeautifulSoup(r.text,"html.parser")
for items in soup.select("table#example tr"):
    data = [' '.join(item.text.split()) for item in items.select("th,td")]
    print(data)

脚本生成的输出:

['Select State', 'Select District', 'Select Block']
['', 'Select District', 'Select Block']
['ARARIA BASTI (93638)', 'BANGAMA (93639)', 'BANSBARI (93640)']
['BASANTPUR (93641)', 'BATURBARI (93642)', 'BELWA (93643)']
['BOCHI (93644)', 'CHANDRADEI (93645)', 'CHATAR (93646)']
['CHIKANI (93647)', 'DIYARI (93648)', 'GAINRHA (93649)']
['GAIYARI (93650)', 'HARIA (93651)', 'HAYATPUR (93652)']
['JAMUA (93653)', 'JHAMTA (93654)', 'KAMALDAHA (93655)']
['KISMAT KHAWASPUR (93656)', 'KUSIYAR GAWON (93657)', 'MADANPUR EAST (93658)']
['MADANPUR WEST (93659)', 'PAIKTOLA (93660)', 'POKHARIA (93661)']
['RAMPUR KODARKATTI (93662)', 'RAMPUR MOHANPUR EAST (93663)', 'RAMPUR MOHANPUR WEST (93664)']
['SAHASMAL (93665)', 'SHARANPUR (93666)', 'TARAUNA BHOJPUR (93667)']

您需要抓取上面每个结果旁边括号中的可用数字,然后在 payload 中使用它们,并发送另一个发布请求以下载pdf文件.在执行之前,请确保将脚本放在文件夹中,以便可以获取其中的所有文件.

You need to scrape the numbers available in brackets adjacent to each results above and then use them in payload and send another post requests to download the pdf files. Make sure to put the script in a folder before execution so that you can get all the files within.

import urllib3
import requests
from bs4 import BeautifulSoup

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

link = "https://secc.gov.in/lgdGpList"
download_link = "https://secc.gov.in/downloadLgdwisePdfFile"

payload = {
    'stateCode': '10',
    'districtCode': '188',
    'blockCode': '1624'
}
r = requests.post(link,data=payload,verify=False)
soup = BeautifulSoup(r.text,"html.parser")
for item in soup.select("table#example td > a[onclick^='downloadLgdFile']"):
    gp_code = item.text.strip().split("(")[1].split(")")[0]
    payload['gpCode'] = gp_code
    with open(f'{gp_code}.pdf','wb') as f:
        f.write(requests.post(download_link,data=payload,verify=False).content)

这篇关于使用硒刮取多个选择选项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆