如何使用python从多个URL中提取数据 [英] How to extract data from multiple URL using python

查看:26
本文介绍了如何使用python从多个URL中提取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从多个 URL 中抓取数据,我正在这样做

Hi i want to scrap data from multiple URL, I am doing like

for i in range(493):
    my_url = 'http://tis.nhai.gov.in/TollInformation?TollPlazaID={}'.format(i)

但它没有给我完整的数据,它只打印最后一个 url 数据,

but it not giving me complete data, it is printing only last url data,

这是我的代码,请帮忙

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import psycopg2
import operator


for i in range(493):
    my_url = 'http://tis.nhai.gov.in/TollInformation?TollPlazaID={}'.format(i)

    uClient = uReq(my_url)
    page1_html = uClient.read()
    uClient.close()
    # html parsing
    page1_soup = soup(page1_html, 'html.parser')

    # grabing data
    containers = page1_soup.findAll('div', {'class': 'PA15'})

    # Make the connection to PostgreSQL
    conn = psycopg2.connect(database='--',user='--', password='--', port=--)
    cursor = conn.cursor()
    for container in containers:
        toll_name1 = container.p.b.text
        toll_name = toll_name1.split(" ")[1]

        search1 = container.findAll('b')
        highway_number = search1[1].text.split(" ")[0]

        text = search1[1].get_text()
        onset = text.index('in')
        offset = text.index('Stretch')
        state = str(text[onset +2:offset]).strip(' ')

        location = list(container.p.descendants)[10]
        mystr = my_url[my_url.find('?'):]
        TID = mystr.strip('?TollPlazaID=')

        query = "INSERT INTO tollmaster (TID, toll_name, location, highway_number, state) VALUES (%s, %s, %s, %s, %s);"
        data = (TID, toll_name, location, highway_number, state)

        cursor.execute(query, data)

# Commit the transaction
conn.commit()

但它只显示倒数第二个网址数据

but it's displaying only second-last url data

推荐答案

好像有些页面缺少你的关键信息,你可以用error-cating来解决,像这样:>

Seems like some of the pages are missing your key information, you can use error-catching for it, like this:

try: 
    tbody = soup('table', {"class": "tollinfotbl"})[0].find_all('tr')[1:]
except IndexError:
    continue  # Skip this page if no items were scrapped

您可能需要添加一些日志记录/打印信息来跟踪不存在的表.

You may want to add some logging/print information to keep track of nonexisting tables.

它只显示最后一页的信息,因为您在 for 循环之外提交事务,为每个 i 覆盖您的 conn.只需将 conn.commit() 放在 for 循环中,在远端.

It's showing information from only last page, as you are commiting your transaction outside the for loop, overwriting your conn for every i. Just put conn.commit() inside for loop, at the far end.

这篇关于如何使用python从多个URL中提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆