BeautifulSoup:从锚标签中提取文本 [英] BeautifulSoup: extract text from anchor tag

查看:26
本文介绍了BeautifulSoup:从锚标签中提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要提取:

  • 来自 image 标签的 src 和
  • 之后的文本
  • div 类数据中的锚标记文本
  • text from following src of the image tag and
  • text of the anchor tag which is inside the div class data

我成功地提取了 img src,但无法从锚标记中提取文本.

I successfully manage to extract the img src, but am having trouble extracting the text from the anchor tag.

<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> 

这里是整个HTML 页面.

这是我的代码:

for div in soup.findAll('div', attrs={'class':'image'}):
    print "
"
    for data in div.findNextSibling('div', attrs={'class':'data'}):
        for a in data.findAll('a', attrs={'class':'title'}):
            print a.text
    for img in div.findAll('img'):
        print img['src']

我想做的是div class=data中提取图像src(链接)和标题,例如:

What I am trying to do is extract the image src (link) and the title inside the div class=data, so for example:

 <a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> 

应该提取:

尼康 COOLPIX L26 16.1 MP 数码相机,带 5 倍变焦尼克尔玻璃镜头和 3 英寸 LCD(红色)

推荐答案

以上所有答案确实帮助我构建我的答案,因此我投票支持其他用户提出的所有答案:但我终于把它放在一起我自己对我正在处理的确切问题的回答:

All the above answers really help me to construct my answer, because of this I voted for all the answers that other users put it out: But I finally put together my own answer to exact problem I was dealing with:

作为明确定义的问题,我必须访问 dom 结构中的一些兄弟姐妹及其子项:此解决方案将迭代 dom 结构中的图像并使用产品标题构造图像名称并将图像保存到本地目录.

As question clearly defined I had to access some of the siblings and its children in a dom structure: This solution will iterate over the images in the dom structure and construct image name using product title and save the image to the local directory.

import urlparse
from urllib2 import urlopen
from urllib import urlretrieve
from BeautifulSoup import BeautifulSoup as bs
import requests

def getImages(url):
    #Download the images
    r = requests.get(url)
    html = r.text
    soup = bs(html)
    output_folder = '~/amazon'
    #extracting the images that in div(s)
    for div in soup.findAll('div', attrs={'class':'image'}):
        modified_file_name = None
        try:
            #getting the data div using findNext
            nextDiv =  div.findNext('div', attrs={'class':'data'})
            #use findNext again on previous object to get to the anchor tag
            fileName = nextDiv.findNext('a').text
            modified_file_name = fileName.replace(' ','-') + '.jpg'
        except TypeError:
            print 'skip'
        imageUrl = div.find('img')['src']
        outputPath = os.path.join(output_folder, modified_file_name)
        urlretrieve(imageUrl, outputPath)

if __name__=='__main__':
    url = r'http://www.amazon.com/s/ref=sr_pg_1?rh=n%3A172282%2Ck%3Adigital+camera&keywords=digital+camera&ie=UTF8&qid=1343600585'
    getImages(url)

这篇关于BeautifulSoup:从锚标签中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆