使用BeautifulSoup遍历标记类的html [英] Iterating html through tag classes with BeautifulSoup

查看:44
本文介绍了使用BeautifulSoup遍历标记类的html的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将网页中的某些特定标签保存到Excel文件中,所以我有以下代码:

I'm saving some specific tags from webpage to an Excel file so I have this code:

`import requests
from bs4 import BeautifulSoup
import openpyxl

url = "http://www.euro.com.pl/telewizory-led-lcd-plazmowe,strona-1.bhtml"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")

wb = openpyxl.Workbook()
ws = wb.active

tagiterator = soup.h2

row, col = 1, 1
ws.cell(row=row, column=col, value=tagiterator.getText())
tagiterator = tagiterator.find_next()

while tagiterator.find_next():
    if tagiterator.name == 'h2':
        row += 1
        col = 1
        ws.cell(row=row, column=col, value=tagiterator.getText(strip=True))
    elif tagiterator.name == 'span':
        col += 1
        ws.cell(row=row, column=col, value=tagiterator.getText(strip=True))
tagiterator = tagiterator.find_next()

wb.save('DG3test.xlsx')`

它可以工作,但是我想排除一些标签.我只想获取具有产品名称"类的h2标签和具有属性值"类的span标签.我尝试通过以下方式做到这一点:

It works, but I want exclude some tags. I want to get only that h2 tags which have 'product-name' class and that span tags which have 'attribute-value' class. I tried to do this by:

tagiterator['class'] == 'product-name'

tagiterator.hasClass('product-name')

tagiterator.get

还有更多没有用的东西.

And some more which also didn't worked.

在创建的这张可怜图片中可以看到我想要的值: https://ibb.co/eWLsoQ 网址在代码中.

Values I want are visible in this poor image I created: https://ibb.co/eWLsoQ and url is in the code.

推荐答案

我所做的就是将其写入excel文件,希望,这是您可以做的但是,只要写一个注释,我将包括此代码.逻辑适用,编写产品信息,添加行+ = 1,然后添加列,然后重置列...(为什么这样做?因此产品停留在同一行中:).您已经完成的事情

What I did not include is writing it to an excel file, hopefully, that's something you can do, nevertheless, just write a comment and I'll include the code for this. Logic applies, write product information, add row+=1 and column then resets the column...(why do we do this? so the product stays within the same row :). something you've already done

from bs4 import BeautifulSoup

import requests

header = {'User-agent' : 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5'}


url = requests.get("http://www.euro.com.pl/telewizory-led-lcd-plazmowe,strona-1.bhtml", headers=header).text
soup = BeautifulSoup(url, 'lxml')

find_products = soup.findAll('div',{'class':'product-row'})

for item in find_products:
    title_text = item.find('div',{'class':'product-header'}).h2.a.text.strip() #Finds the title / name of product
    # print(title_text)
    display = item.find('span',{'class':'attribute-value'}).text.strip() #Finds for example the this text 49 cali, Full HD, 1920 x 1080
    # print(display)
    functions_item = item.findAll('span',{'class':'attribute-value'})[1] #We find now the functions or the 'Funkcje'
    list_of_funcs = functions_item.findAll('a') #We find the list of the functions e.g. wifi
    #Now you can store them or do-smt...

    for funcs in list_of_funcs:
        print(funcs.text.strip())

算法:

  1. 我们找到每种产品
  2. 我们在每种产品中找到标签并提取相关信息
  3. 我们使用 .text 仅提取文本部分
  4. 我们使用for循环遍历每个产品,然后遍历其功能或包含产品功能的标签.

这篇关于使用BeautifulSoup遍历标记类的html的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆