如何使用python从多个URL中提取数据 [英] How to extract data from multiple URL using python
问题描述
我想从多个 URL 中抓取数据,我正在这样做
Hi i want to scrap data from multiple URL, I am doing like
for i in range(493):
my_url = 'http://tis.nhai.gov.in/TollInformation?TollPlazaID={}'.format(i)
但它没有给我完整的数据,它只打印最后一个 url 数据,
but it not giving me complete data, it is printing only last url data,
这是我的代码,请帮忙
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import psycopg2
import operator
for i in range(493):
my_url = 'http://tis.nhai.gov.in/TollInformation?TollPlazaID={}'.format(i)
uClient = uReq(my_url)
page1_html = uClient.read()
uClient.close()
# html parsing
page1_soup = soup(page1_html, 'html.parser')
# grabing data
containers = page1_soup.findAll('div', {'class': 'PA15'})
# Make the connection to PostgreSQL
conn = psycopg2.connect(database='--',user='--', password='--', port=--)
cursor = conn.cursor()
for container in containers:
toll_name1 = container.p.b.text
toll_name = toll_name1.split(" ")[1]
search1 = container.findAll('b')
highway_number = search1[1].text.split(" ")[0]
text = search1[1].get_text()
onset = text.index('in')
offset = text.index('Stretch')
state = str(text[onset +2:offset]).strip(' ')
location = list(container.p.descendants)[10]
mystr = my_url[my_url.find('?'):]
TID = mystr.strip('?TollPlazaID=')
query = "INSERT INTO tollmaster (TID, toll_name, location, highway_number, state) VALUES (%s, %s, %s, %s, %s);"
data = (TID, toll_name, location, highway_number, state)
cursor.execute(query, data)
# Commit the transaction
conn.commit()
但它只显示倒数第二个网址数据
but it's displaying only second-last url data
推荐答案
好像有些页面缺少你的关键信息,你可以用error-cating
来解决,像这样:>
Seems like some of the pages are missing your key information, you can use error-catching
for it, like this:
try:
tbody = soup('table', {"class": "tollinfotbl"})[0].find_all('tr')[1:]
except IndexError:
continue # Skip this page if no items were scrapped
您可能需要添加一些日志记录/打印信息来跟踪不存在的表.
You may want to add some logging/print information to keep track of nonexisting tables.
它只显示最后一页的信息,因为您在 for
循环之外提交事务,为每个 i
覆盖您的 conn
.只需将 conn.commit()
放在 for
循环中,在远端.
It's showing information from only last page, as you are commiting your transaction outside the for
loop, overwriting your conn
for every i
. Just put conn.commit()
inside for
loop, at the far end.
这篇关于如何使用python从多个URL中提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!