Python:BeautifulSoup从锚标记中提取文本 [英] Python: BeautifulSoup extract text from anchor tag

查看:354
本文介绍了Python:BeautifulSoup从锚标记中提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从图像标签的下面的src文件和div类数据中的锚标签文本中提取文本。



我成功设法提取img src,但我无法从锚标记中提取文本。

 < a class =titlehref =http://rads.stackoverflow.com/amzn/click/B0073HSK0K>尼康COOLPIX L26 16.1 MP 5倍数码变焦数码相机尼克尔玻璃镜头和3英寸LCD(红色)< / a> 

以下是整个 HTML页面



这是我的代码

  for soup in soup.findAll('div',attrs = {'class':'image'}):
在div.findNextSibling('div',attrs = {'class':'data'})中打印\\\

数据
for a data.findAll('a',attrs = {'class':'title'}):
print a.text
for img in div.findAll('img' ):
print img ['src']

我想要做的是提取图像的src(链接)和标题在侧面div class = data。



例如

 < a class =titlehref =http://rads.stackoverflow.com/amzn/click/B0073HSK0K>尼康COOLPIX L26 16.1 MP数码相机(带5倍变焦)NIKKOR Gla ss镜头和3英寸LCD(红色)< / a> 

我想提取:尼康COOLPIX L26 16倍数码相机NIKKOR玻璃镜头和3英寸液晶显示器(红色)

解决方案

构建我的答案,因为这个我投了所有答案,其他用户发布:但我终于把自己的答案放在一起,我正在处理的确切问题:



<作为一个明确定义的问题,我必须访问dom结构中的一些兄弟姐妹及其子代:此解决方案将迭代dom结构中的图像,并使用产品标题构建图像名称并将图像保存到本地目录。

 从urllib2导入urlparse 
从urllib导入urlopen
从BeautifulSoup导入urlretrieve
导入BeautifulSoup作为bs
导入请求

def getImages(url):
#下载图片
r = requests.get(url)
html = r.text
soup = bs(html)
output_folder ='〜/ amazon'
#提取div中的图像
为soup.findAll('div', attrs = {'class':'image'}):
modified_file_name = None
try:
#使用findNext获取数据div
nextDiv = div.findNext('div' ,attrs = {'class':'data'})
#use在前一个对象上再次findNext以获得锚点标记
fileName = nextDiv.findNext('a').text
modified_file_name = fileName.replace('',' - ')+'.jpg'
TypeError除外:
print'skip'
imageUrl = d iv.find('img')['src']
outputPath = os.path.join(output_folder,modified_file_name)
urlretrieve(imageUrl,outputPath)
$ b $ if if __name __ = ='__ main__':
url = r'http://www.amazon.com/s/ref=sr_pg_1?rh = n%3A172282%2Ck%3Adigital + camera& keywords = digital + camera& ie = UTF8& amp; ; qid = 1343600585'
getImages(url)


I want to extract text from following src of the image tag and text of the anchor tag which is inside the div class data.

I successfully manage to extract the img src but I am having trouble on extracting the text from the anchor tag.

<a class="title" href="http://rads.stackoverflow.com/amzn/click/B0073HSK0K">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> 

Here is the link for the entire HTML page

Here is my code

for div in soup.findAll('div', attrs={'class':'image'}):
    print "\n"
    for data in div.findNextSibling('div', attrs={'class':'data'}):
        for a in data.findAll('a', attrs={'class':'title'}):
            print a.text
    for img in div.findAll('img'):
        print img['src']

What I am trying to do is extract the image src (link) and the title in side the div class=data.

so for example

 <a class="title" href="http://rads.stackoverflow.com/amzn/click/B0073HSK0K">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> 

I want to extract : Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)

解决方案

All the above answers really help me to construct my answer, because of this I voted for all the answers that other users put it out: But I finally put together my own answer to exact problem I was dealing with:

As question clearly defined I had to access some of the siblings and its children in a dom structure: This solution will iterate over the images in the dom structure and construct image name using product title and save the image to the local directory.

import urlparse
from urllib2 import urlopen
from urllib import urlretrieve
from BeautifulSoup import BeautifulSoup as bs
import requests

def getImages(url):
    #Download the images
    r = requests.get(url)
    html = r.text
    soup = bs(html)
    output_folder = '~/amazon'
    #extracting the images that in div(s)
    for div in soup.findAll('div', attrs={'class':'image'}):
        modified_file_name = None
        try:
            #getting the data div using findNext
            nextDiv =  div.findNext('div', attrs={'class':'data'})
            #use findNext again on previous object to get to the anchor tag
            fileName = nextDiv.findNext('a').text
            modified_file_name = fileName.replace(' ','-') + '.jpg'
        except TypeError:
            print 'skip'
        imageUrl = div.find('img')['src']
        outputPath = os.path.join(output_folder, modified_file_name)
        urlretrieve(imageUrl, outputPath)

if __name__=='__main__':
    url = r'http://www.amazon.com/s/ref=sr_pg_1?rh=n%3A172282%2Ck%3Adigital+camera&keywords=digital+camera&ie=UTF8&qid=1343600585'
    getImages(url)

这篇关于Python:BeautifulSoup从锚标记中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆