使用beautifulsoup访问未标记的文本 [英] Accessing untagged text using beautifulsoup

查看:66
本文介绍了使用beautifulsoup访问未标记的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用python和beautifulsoup4提取一些地址信息. 更具体地说,在检索非美国邮政编码时,我需要帮助.

I am using python and beautifulsoup4 to extract some address information. More specifically, I require assistance when retrieving non-US based zip codes.

考虑一家美国公司的以下html数据:(已经是汤对象)

Consider the following html data of a US based company: (already a soup object)

<div class="compContent curvedBottom" id="companyDescription">
<div class="vcard clearfix">
<p id="adr">
<span class="street-address">999 State St Ste 100</span><br/>
<span class="locality">Salt Lake City,</span>
<span class="region">UT</span>
<span class="zip">84114-0002,</span>
<br/><span class="country-name">United States</span>
</p>
<p>
<span class="tel">
<strong class="type">Phone: </strong>+1-000-000-000
                            </span><br/>
</p>
<p class="companyURL"><a class="url ext" href="http://www.website.com" target="_blank">http://www.website.com</a></p>
</div>

</ul>
</div>

我可以使用以下python代码提取邮政编码(84114-0002):

I can extract the zipcode (84114-0002) by using the following piece of python code:

class CompanyDescription:
    def __init__(self, page):
        self.data = page.find('div', attrs={'id': 'companyDescription'})


    def address(self):
        #TODO: Also retrieve the Zipcode for UK and German based addresses - tricky!
        address = {'street-address': '', 'locality': '', 'region': '', 'zip': '', 'country-name': ''}
        for key in address:
            try:
                adr = self.data.find('p', attrs={'id': 'adr'})
                if adr.find('span', attrs={'class': key}) is None:
                    address[key] = ''
                else:
                    address[key] = adr.find('span', attrs={'class': key}).text.split(',')[0]

                # Attempting to grab another zip code value
                if address['zip'] == '':
                    pass

            except:
                # We should return a dictionary with "" as key adr
                return address

        return address

您可以看到我在行if address['zip'] == '':

这两个汤对象示例给我带来麻烦.在下面,我想检索 EC4N 4SA

These two soup object examples are giving me trouble. In the below I would like to retrieve EC4N 4SA

<div class="compContent curvedBottom" id="companyDescription">
<div class="vcard clearfix">
<p id="adr">
<span class="street-address">Albert Buildings</span><br/>
<span class="extended-address">00 Queen Victoria Street</span>
<span class="locality">London</span>
                                    EC4N 4SA
                                    <span class="region">London</span>
<br/><span class="country-name">England</span>
</p>
<p>
</p>
<p class="companyURL"><a class="url ext" href="http://www.website.com.com" target="_blank">http://www.website.com.com</a></p>
</div>
<p><strong>Line of Business</strong> <br/>Management services, nsk</p> 
</div>

以及以下,我有兴趣获得 71364

as well as below, where I am interested in getting 71364

<div class="compContent curvedBottom" id="companyDescription">
<div class="vcard clearfix">
<p id="adr">
<span class="street-address">Alfred-Kärcher-Str. 100</span><br/>
                                                71364
                                    <span class="locality">Winnenden</span>
<span class="region">Baden-Württemberg</span>
<br/><span class="country-name">Germany</span>
</p>
<p>
<span class="tel">
<strong class="type">Phone: </strong>+00-1234567
                            </span><br/>
<span class="tel"><strong class="type">Fax: </strong>+00-1234567</span>
</p>
</div>
</div>

现在,我正在大约68,000个帐户上运行此程序,其中28,000个帐户不在美国.我只列举了两个例子,我知道当前的方法不是防弹的.可能还有其他地址格式无法正常运行该脚本,但我认为弄清英国和德国帐户将大有帮助.

Now, I am running this program over approximately 68,000 accounts of which 28,000 are non-US based. I have only pulled out two examples of which I know the current method is not bullet proof. There may be other address formats where this script is not working as expected but I believe figuring out UK and German based accounts will help tremendously.

预先感谢

推荐答案

因为它只是<p>中没有标签的文本,所以您可以使用

Because it is only text without tag inside <p> so you can use

find_all(text=True, recursive=False) 

仅获取文本(不带标签),而不从嵌套标签(<span>)中获取.这样会给出带有文本和一些\n和空格的列表,因此您可以使用join()创建一个字符串,并使用strip()删除所有\n和空格.

to get only text (without tags) but not from nested tags (<span>). This gives list with your text and some \n and spaces so you can use join() to create one string, and strip() to remove all \n and spaces.

data = '''<p id="adr">
<span class="street-address">Albert Buildings</span><br/>
<span class="extended-address">00 Queen Victoria Street</span>
<span class="locality">London</span>
                                    EC4N 4SA
                                    <span class="region">London</span>
<br/><span class="country-name">England</span>
</p>'''

from bs4 import BeautifulSoup as BS

soup = BS(data, 'html.parser').find('p')

print(''.join(soup.find_all(text=True, recursive=False)).strip())

结果:EC4N 4SA

与第二个HTML相同

data = '''<p id="adr">
<span class="street-address">Alfred-Kärcher-Str. 100</span><br/>
                                                71364
                                    <span class="locality">Winnenden</span>
<span class="region">Baden-Württemberg</span>
<br/><span class="country-name">Germany</span>
</p>'''

from bs4 import BeautifulSoup as BS

soup = BS(data, 'html.parser').find('p')

print(''.join(soup.find_all(text=True, recursive=False)).strip())

结果:71364

这篇关于使用beautifulsoup访问未标记的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆