从href抓取数据 [英] Scraping data from href
问题描述
我试图获取DFS的邮政编码,为此,我尝试获取每个商店的href,然后单击它,下一页有可以从中获取邮政编码的商店位置,但是我可以获取事情正常,我要去哪里错了?
我尝试先获取上层属性td.searchResults
,然后为它们中的每一个尝试单击href with title DFS
,然后单击获取postalCode.最终对所有三个页面进行迭代.
如果有更好的方法,请告诉我.
I was trying to get the postcodes for DFS, for that i tried getting the href for each shop and then click on it, the next page has shop location from which i can get the postal code, but i am able to get things working, Where am i going wrong?
I tried getting upper level attribute first td.searchResults
and then for each of them i am trying to click on href with title DFS
and after clicking getting the postalCode. Eventually iterate for all three pages.
If there is a better way to do it let me know.
driver = webdriver.Firefox()
driver.get('http://www.localstore.co.uk/stores/75061/dfs/')
html = driver.page_source
soup = BeautifulSoup(html)
listings = soup.select('td.searchResults')
for l in listings:
while True:
driver.find_element_by_css_selector("a[title*='DFS']").click()
shops= {}
#info = soup.find('span', itemprop='postalCode').contents
html = driver.page_source
soup = BeautifulSoup(html)
info = soup.find(itemprop="postalCode").get_text()
shops.append(info)
更新:
driver = webdriver.Firefox()
driver.get('http://www.localstore.co.uk/stores/75061/dfs/')
html = driver.page_source
soup = BeautifulSoup(html)
listings = soup.select('td.searchResults')
for l in listings:
driver.find_element_by_css_selector("a[title*='DFS']").click()
shops = []
html = driver.page_source
soup = BeautifulSoup(html)
info = soup.find_all('span', attrs={"itemprop": "postalCode"})
for m in info:
if m:
m_text = m.get_text()
shops.append(m_text)
print (shops)
推荐答案
因此,在玩了一段时间之后,我认为使用硒不是最好的方法.这将需要使用driver.back()
并等待元素重新出现,以及一堆其他杂物.我只用requests
,re
和bs4
就能得到想要的东西. re
包含在Python标准库中,如果尚未安装requests
,则可以使用pip进行如下操作:pip install requests
So after playing with this for a little while, I don't think the best way to do this is with selenium. It would require using driver.back()
and waiting for elements to re-appear, and a whole mess of other stuff. I was able to get what you want using just requests
, re
and bs4
. re
is included in the Python standard library and if you haven't installed requests
, you can do it with pip as follows: pip install requests
from bs4 import BeautifulSoup
import re
import requests
base_url = 'http://www.localstore.co.uk'
url = 'http://www.localstore.co.uk/stores/75061/dfs/'
res = requests.get(url)
soup = BeautifulSoup(res.text)
shops = []
links = soup.find_all('a', href=re.compile('.*\/store\/.*'))
for l in links:
full_link = base_url + l['href']
town = l['title'].split(',')[1].strip()
res = requests.get(full_link)
soup = BeautifulSoup(res.text)
info = soup.find('span', attrs={"itemprop": "postalCode"})
postalcode = info.text
shops.append(dict(town_name=town, postal_code=postalcode))
print shops
这篇关于从href抓取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!