BeautifulSoup刮的街道地址。 [英] BeautifulSoup to scrape street address

查看:167
本文介绍了BeautifulSoup刮的街道地址。的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的是code。在遥远的底部得到的网站链接清真寺名称即可。不过,我想也弄的名称街景地址即可。请大家帮我卡住了。

目前我正在以下

网络链接:

 < D​​IV CLASS =subtitleLink>< A HREF =htt​​p://www.salatomatic.com/d/Tempe+5313+Masjid-Al-Hijrah >

清真寺名称

 < B>清真寺铝伊历< / B>

但想获得以下;

面额

 < B>面额:LT; / B>逊尼派(繁体)

街道地址

 < BR> 45站街(悉尼)及NBSP;&安培; NBSP;

下面code擦伤以下

 < TD宽度= 25 GT;< A HREF =htt​​p://www.salatomatic.com/d/Tempe+5313+Masjid-Al-Hijrah>&LT ; IMG SRC =HTTP://www.halalfire.com/images/en/photo_small.jpg'ALT =清真寺铝伊历称号='清真寺铝伊历BORDER = 0 WIDTH = 48 HEIGHT = 36>&LT ; / A>&下; / A>&下; / TD>&下; TD宽度= 10 -10;&下; IMG SRC =htt​​p://www.salatomatic.com/images/spacer.gif宽度= 10边界= 0> &所述; / TD>&下; TD NOWRAP>&下;股利类=subtitleLink>&下; A HREF =htt​​p://www.salatomatic.com/d/Tempe+5313+Masjid-Al-Hijrah> < b>清真寺铝伊历< / b>< / A>&安培; NBSP;&安培; NBSP; < / DIV>< D​​IV CLASS =tinyLink>< B>面额:LT; / B>逊尼派(繁体)LT; BR> 45站街(悉尼)及NBSP;&安培; NBSP;< / DIV>< / TD>< TD ALIGN =右VALIGN =中心>< D​​IV CLASS =tinyLink >< / DIV>< / TD>

code:

 从BS4进口BeautifulSoup
进口的urllib2为url1 =htt​​p://www.salatomatic.com/c/Sydney+168
内容1 = urllib2.urlopen(URL1).read()
汤= BeautifulSoup(内容1)结果= soup.findAll(格,{级:subtitleLink})
对于结果的结果:
    BR = result.find('B')
    一个= result.find('a')的
    CURRENTURL = a.get('href属性)
    如果不是currenturl.startswith(HTTP):
        CURRENTURL =htt​​p://www.salatomatic.com+ CURRENTURL
        打印CURRENTURL
    ELIF currenturl.startswith(HTTP):
        打印a.get('href属性)
    POS = br.get_text()
    打印POS


解决方案

您可以检查下一< D​​IV> 元素类与价值属性 tinyLink ,并且包含一个< b> < BR> 标签并提取其字符串:

  ...
打印POS
DIV = result.find_next_sibling('格',ATTRS = {类:tinyLink})
如果DIV和div.b和div.br:
    打印(div.b.next_sibling.string)
    打印(div.br.next_sibling.string)

I am using the code at the far bottom to get weblink, and the Masjid name. however I would like to also get denomination and street address. please help I am stuck.

Currently I am getting the following

Weblink:

<div class="subtitleLink"><a href="http://www.salatomatic.com/d/Tempe+5313+Masjid-Al-Hijrah">

and Masjid name

<b>Masjid Al-Hijrah</b>

But would like to get the below;

Denomination

<b>Denomination:</b> Sunni (Traditional)

and street address

<br>45 Station Street (Sydney)&nbsp;&nbsp;

The below code scrapes the following

<td width=25><a href="http://www.salatomatic.com/d/Tempe+5313+Masjid-Al-Hijrah"><img src='http://www.halalfire.com/images/en/photo_small.jpg' alt='Masjid Al-Hijrah' title='Masjid Al-Hijrah' border=0 width=48 height=36></a></a></td><td width=10><img src="http://www.salatomatic.com/images/spacer.gif" width=10 border=0></td><td nowrap><div class="subtitleLink"><a href="http://www.salatomatic.com/d/Tempe+5313+Masjid-Al-Hijrah"><b>Masjid Al-Hijrah</b></a>&nbsp;&nbsp; </div><div class="tinyLink"><b>Denomination:</b> Sunni (Traditional)<br>45 Station Street (Sydney)&nbsp;&nbsp;</div></td><td align=right valign=center><div class="tinyLink"></div></td>

CODE:

from bs4 import BeautifulSoup
import urllib2

url1 = "http://www.salatomatic.com/c/Sydney+168"
content1 = urllib2.urlopen(url1).read()
soup = BeautifulSoup(content1) 

results = soup.findAll("div", {"class" : "subtitleLink"})
for result in results :
    br = result.find('b')
    a = result.find('a')
    currenturl =  a.get('href')
    if not currenturl.startswith("http"):
        currenturl = "http://www.salatomatic.com" + currenturl
        print currenturl
    elif currenturl.startswith("http"):
        print a.get('href')
    pos = br.get_text()
    print pos

解决方案

You can check next <div> element with a class attribute with value tinyLink and that contains either a <b> and a <br> tags and extract their strings:

...
print pos 
div = result.find_next_sibling('div', attrs={"class": "tinyLink"})
if div and div.b and div.br:
    print(div.b.next_sibling.string)
    print(div.br.next_sibling.string)

这篇关于BeautifulSoup刮的街道地址。的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆