BeautifulSoup刮的街道地址。 [英] BeautifulSoup to scrape street address
问题描述
我使用的是code。在遥远的底部得到的网站链接和清真寺名称即可。不过,我想也弄的名称和街景地址即可。请大家帮我卡住了。
目前我正在以下
网络链接:
< DIV CLASS =subtitleLink>< A HREF =http://www.salatomatic.com/d/Tempe+5313+Masjid-Al-Hijrah >
和清真寺名称
< B>清真寺铝伊历< / B>
但想获得以下;
面额
< B>面额:LT; / B>逊尼派(繁体)
和街道地址
< BR> 45站街(悉尼)及NBSP;&安培; NBSP;
下面code擦伤以下
< TD宽度= 25 GT;< A HREF =http://www.salatomatic.com/d/Tempe+5313+Masjid-Al-Hijrah>&LT ; IMG SRC =HTTP://www.halalfire.com/images/en/photo_small.jpg'ALT =清真寺铝伊历称号='清真寺铝伊历BORDER = 0 WIDTH = 48 HEIGHT = 36>&LT ; / A>&下; / A>&下; / TD>&下; TD宽度= 10 -10;&下; IMG SRC =http://www.salatomatic.com/images/spacer.gif宽度= 10边界= 0> &所述; / TD>&下; TD NOWRAP>&下;股利类=subtitleLink>&下; A HREF =http://www.salatomatic.com/d/Tempe+5313+Masjid-Al-Hijrah> < b>清真寺铝伊历< / b>< / A>&安培; NBSP;&安培; NBSP; < / DIV>< DIV CLASS =tinyLink>< B>面额:LT; / B>逊尼派(繁体)LT; BR> 45站街(悉尼)及NBSP;&安培; NBSP;< / DIV>< / TD>< TD ALIGN =右VALIGN =中心>< DIV CLASS =tinyLink >< / DIV>< / TD>
code:
从BS4进口BeautifulSoup
进口的urllib2为url1 =http://www.salatomatic.com/c/Sydney+168
内容1 = urllib2.urlopen(URL1).read()
汤= BeautifulSoup(内容1)结果= soup.findAll(格,{级:subtitleLink})
对于结果的结果:
BR = result.find('B')
一个= result.find('a')的
CURRENTURL = a.get('href属性)
如果不是currenturl.startswith(HTTP):
CURRENTURL =http://www.salatomatic.com+ CURRENTURL
打印CURRENTURL
ELIF currenturl.startswith(HTTP):
打印a.get('href属性)
POS = br.get_text()
打印POS
您可以检查下一< DIV>
与元素类
与价值属性 tinyLink
,并且包含一个< b>
和< BR>
标签并提取其字符串:
...
打印POS
DIV = result.find_next_sibling('格',ATTRS = {类:tinyLink})
如果DIV和div.b和div.br:
打印(div.b.next_sibling.string)
打印(div.br.next_sibling.string)
I am using the code at the far bottom to get weblink, and the Masjid name. however I would like to also get denomination and street address. please help I am stuck.
Currently I am getting the following
Weblink:
<div class="subtitleLink"><a href="http://www.salatomatic.com/d/Tempe+5313+Masjid-Al-Hijrah">
and Masjid name
<b>Masjid Al-Hijrah</b>
But would like to get the below;
Denomination
<b>Denomination:</b> Sunni (Traditional)
and street address
<br>45 Station Street (Sydney)
The below code scrapes the following
<td width=25><a href="http://www.salatomatic.com/d/Tempe+5313+Masjid-Al-Hijrah"><img src='http://www.halalfire.com/images/en/photo_small.jpg' alt='Masjid Al-Hijrah' title='Masjid Al-Hijrah' border=0 width=48 height=36></a></a></td><td width=10><img src="http://www.salatomatic.com/images/spacer.gif" width=10 border=0></td><td nowrap><div class="subtitleLink"><a href="http://www.salatomatic.com/d/Tempe+5313+Masjid-Al-Hijrah"><b>Masjid Al-Hijrah</b></a> </div><div class="tinyLink"><b>Denomination:</b> Sunni (Traditional)<br>45 Station Street (Sydney) </div></td><td align=right valign=center><div class="tinyLink"></div></td>
CODE:
from bs4 import BeautifulSoup
import urllib2
url1 = "http://www.salatomatic.com/c/Sydney+168"
content1 = urllib2.urlopen(url1).read()
soup = BeautifulSoup(content1)
results = soup.findAll("div", {"class" : "subtitleLink"})
for result in results :
br = result.find('b')
a = result.find('a')
currenturl = a.get('href')
if not currenturl.startswith("http"):
currenturl = "http://www.salatomatic.com" + currenturl
print currenturl
elif currenturl.startswith("http"):
print a.get('href')
pos = br.get_text()
print pos
You can check next <div>
element with a class
attribute with value tinyLink
and that contains either a <b>
and a <br>
tags and extract their strings:
...
print pos
div = result.find_next_sibling('div', attrs={"class": "tinyLink"})
if div and div.b and div.br:
print(div.b.next_sibling.string)
print(div.br.next_sibling.string)
这篇关于BeautifulSoup刮的街道地址。的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!