Python:如何使用(BeautifulSoup)访问和遍历div类元素的列表 [英] Python: How to access and iterate over a list of div class element using (BeautifulSoup)

查看:119
本文介绍了Python:如何使用(BeautifulSoup)访问和遍历div类元素的列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用BeautifulSoup解析有关汽车生产的数据(另请参阅我的

I'm parsing data about car production with BeautifulSoup (see also my first question):

from bs4 import BeautifulSoup
import string

html = """
<h4>Production Capacity (year)</h4>
    <div class="profile-area">
      Vehicle 1,140,000 units /year
    </div>
<h4>Output</h4>
    <div class="profile-area">
      Vehicle 809,000 units ( 2016 ) 
    </div>
    <div class="profile-area">
      Vehicle 815,000 units ( 2015 ) 
    </div>
    <div class="profile-area">
      Vehicle 836,000 units ( 2014 ) 
    </div>
    <div class="profile-area">
      Vehicle 807,000 units ( 2013 ) 
    </div>
    <div class="profile-area">
      Vehicle 760,000 units ( 2012 ) 
    </div>
    <div class="profile-area">
      Vehicle 805,000 units ( 2011 ) 
    </div>
"""
soup = BeautifulSoup(html, 'lxml')

for item in soup.select("div.profile-area"):
  produkz = item.text.strip()
  produkz = produkz.replace('\n',':')

  prev_h4 = str(item.find_previous_sibling('h4'))
  if "Models" in prev_h4:
    models=produkz
  else:
    models=""

  if "Capacity" in prev_h4:
    capacity=produkz
  else:
    capacity=""

  if "( 2015 )" in produkz:
    prod15=produkz
  else:
    prod15=""
  if "( 2016 )" in produkz:
    prod16=produkz
  else:
    prod16=""
  if "( 2017 )" in produkz:
    prod17=produkz
  else:
    prod17=""

  print(models+';'+capacity+';'+prod15+';'+prod16+';'+prod17)

我的问题是,所有匹配的HTML出现的下一个循环("div.profile-area")都会覆盖我的结果:

My problem is, that the next loop on all matching HTML occurrences ("div.profile-area") overwrites my result:

;Vehicle 1,140,000 units /year;;;;;;
;;;;;;Vehicle 809,000 units ( 2016 );
;;;;;Vehicle 815,000 units ( 2015 );;
;;;;Vehicle 836,000 units ( 2014 );;;
;;;Vehicle 807,000 units ( 2013 );;;;
;;Vehicle 760,000 units ( 2012 );;;;;
;;;;;;;

我想要的结果是:

;Vehicle 1,140,000 units /year;Vehicle 760,000 units ( 2012 );Vehicle 807,000 units ( 2013 );Vehicle 836,000 units ( 2014 );Vehicle 815,000 units ( 2015 );Vehicle 809,000 units ( 2016 );

如果您能向我展示一种更好的代码构造方法,我将感到非常高兴.预先感谢.

I would be glad if you could show me a better way to structure my code. Thanks in advance.

推荐答案

我建议您将每个条目存储在字典中,然后可以在末尾轻松提取想要的字段(您似乎不想要2011? ):

I would suggest you store each entry in a dictionary, you can then extract the fields you want easily at the end (you don't seem to want 2011?):

from bs4 import BeautifulSoup
import re

html = """
<h4>Production Capacity (year)</h4>
    <div class="profile-area">
      Vehicle 1,140,000 units /year
    </div>
<h4>Output</h4>
    <div class="profile-area">
      Vehicle 809,000 units ( 2016 ) 
    </div>
    <div class="profile-area">
      Vehicle 815,000 units ( 2015 ) 
    </div>
    <div class="profile-area">
      Vehicle 836,000 units ( 2014 ) 
    </div>
    <div class="profile-area">
      Vehicle 807,000 units ( 2013 ) 
    </div>
    <div class="profile-area">
      Vehicle 760,000 units ( 2012 ) 
    </div>
    <div class="profile-area">
      Vehicle 805,000 units ( 2011 ) 
    </div>
"""

soup = BeautifulSoup(html, 'lxml')
units = {}

for item in soup.find_all(['h4', 'div']):
    if item.name == 'h4':
        for h4 in ['capacity', 'output', 'models']:
            if h4 in item.text.lower():
                break
    elif item.get('class', [''])[0] == 'profile-area':
        vehicle = item.get_text(strip=True)

        if h4 == 'output':
            re_year = re.search(r'\( (\d+) \)', vehicle)

            if re_year:
                year = re_year.group(1)
            else:
                year = 'unknown'

            units[year] = vehicle
        else:
            units[h4] = vehicle

req_fields = ['models', 'capacity', '2012', '2013', '2014', '2015', '2016']            
print(';'.join([units.get(field, '') for field in req_fields]))

这将显示:

;Vehicle 1,140,000 units /year;Vehicle 760,000 units ( 2012 );Vehicle 807,000 units ( 2013 );Vehicle 836,000 units ( 2014 );Vehicle 815,000 units ( 2015 );Vehicle 809,000 units ( 2016 )

使用正则表达式从车辆条目中提取年份.然后将其用作字典中的键.

A regular expression is used to extract the year from the vehicle entry. This is then used as the key in the dictionary.

对于pastebin中的HTML,它提供了:

For the HTML in pastebin it gives:

Volkswagen Golf, Golf Variant(Estate), Golf Plus, CrossGolf (2006-), e-Golf (2014-)Volkswagen Touran, CrossTouran (2007-), Tiguan (2007-);I.D. electric vehicles based on MEB (planning);SEAT new SUV MQB-A2 platform (2018- planning);Components:press shop, chassis, plastics technology;Vehicle 1,140,000 units /year;Vehicle 760,000 units ( 2012 );Vehicle 807,000 units ( 2013 );Vehicle 836,000 units ( 2014 );Vehicle 815,000 units ( 2015 );Vehicle 809,000 units ( 2016 )

这篇关于Python:如何使用(BeautifulSoup)访问和遍历div类元素的列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆