BeautifulSoup:从锚标记中提取文本 [英] BeautifulSoup: extract text from anchor tag

查看:98
本文介绍了BeautifulSoup:从锚标记中提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要提取:

    来自image标记和的src后面的
  • 文本
  • div类数据内的定位标签的文本
  • text from following src of the image tag and
  • text of the anchor tag which is inside the div class data

我成功地提取了img src,但是在从定位标记中提取文本时遇到了麻烦.

I successfully manage to extract the img src, but am having trouble extracting the text from the anchor tag.

<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> 

以下是整个这是我的代码:

for div in soup.findAll('div', attrs={'class':'image'}):
    print "\n"
    for data in div.findNextSibling('div', attrs={'class':'data'}):
        for a in data.findAll('a', attrs={'class':'title'}):
            print a.text
    for img in div.findAll('img'):
        print img['src']

我想做的是提取图像src(链接)和div class=data 中的标题,例如:

What I am trying to do is extract the image src (link) and the title inside the div class=data, so for example:

 <a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> 

应提取:

Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)

推荐答案

以上所有答案确实可以帮助我构建答案,因此,我对其他用户提出的所有答案投了赞成票:我对自己正在处理的确切问题的回答:

All the above answers really help me to construct my answer, because of this I voted for all the answers that other users put it out: But I finally put together my own answer to exact problem I was dealing with:

作为明确定义的问题,我必须以dom结构访问某些兄弟姐妹及其子代:此解决方案将迭代dom结构中的图像,并使用产品标题构造图像名称,并将图像保存到本地目录.

As question clearly defined I had to access some of the siblings and its children in a dom structure: This solution will iterate over the images in the dom structure and construct image name using product title and save the image to the local directory.

import urlparse
from urllib2 import urlopen
from urllib import urlretrieve
from BeautifulSoup import BeautifulSoup as bs
import requests

def getImages(url):
    #Download the images
    r = requests.get(url)
    html = r.text
    soup = bs(html)
    output_folder = '~/amazon'
    #extracting the images that in div(s)
    for div in soup.findAll('div', attrs={'class':'image'}):
        modified_file_name = None
        try:
            #getting the data div using findNext
            nextDiv =  div.findNext('div', attrs={'class':'data'})
            #use findNext again on previous object to get to the anchor tag
            fileName = nextDiv.findNext('a').text
            modified_file_name = fileName.replace(' ','-') + '.jpg'
        except TypeError:
            print 'skip'
        imageUrl = div.find('img')['src']
        outputPath = os.path.join(output_folder, modified_file_name)
        urlretrieve(imageUrl, outputPath)

if __name__=='__main__':
    url = r'http://www.amazon.com/s/ref=sr_pg_1?rh=n%3A172282%2Ck%3Adigital+camera&keywords=digital+camera&ie=UTF8&qid=1343600585'
    getImages(url)

这篇关于BeautifulSoup:从锚标记中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆