使用BeautifulSoup将HTML表数据解析为字典 [英] Parse HTML table data with BeautifulSoup into a dict

查看:93
本文介绍了使用BeautifulSoup将HTML表数据解析为字典的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用BeautifulSoup解析存储在HTML表中的信息,并将其存储到字典中.我已经能够到达表,并遍历值,但是表中仍然有很多垃圾,我不确定该如何处理.

I am trying to use BeautifulSoup to parse the information stored in an HTML table and store it into a dict. I've been able to get to the table, and iterate through the values, but there is still a lot of junk in the table that I'm not sure how to take care of.

# load the HTML file
r = requests.get("http://www.ebay.com/itm/222378225962")
soup = BeautifulSoup(r.content, "html.parser")

# navigate to the item attributes table
table = soup.find('div', 'itemAttr')

# iterate through the attribute information
attr = []
for i in table.findAll("tr"):
    attr.append(i.text.strip().replace('\t', ''))

使用这种方法,数据就是这样.如您所见,其中有很多垃圾,有些行包含Year和VIN之类的多个项目.

With this method, this is what the data looks like. As you you see, there is a lot of junk in there, and some lines contain multiple items like Year and VIN.

[u'Condition:\nUsed',
 u'Seller Notes:\n\u201cExcellent Condition\u201d',
 u'Year: \n\n2015\n\n VIN (Vehicle Identification Number): \n\n2G1FJ1EW2F9192023',
 u'Mileage: \n\n29,000\n\n Transmission: \n\nManual',
 u'Make: \n\nChevrolet\n\n Body Type: \n\nCoupe',
 u'Model: \n\nCamaro\n\n Warranty: \n\nVehicle has an existing warranty',
 u'Trim: \n\nSS Coupe 2-Door\n\n Vehicle Title: \n\nClear',
 u'Engine: \n\n6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated\n\n Options: \n\nLeather Seats',
 u'Drive Type: \n\nRWD\n\n Safety Features: \n\nAnti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags',
 u'Power Options: \n\nAir Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats\n\n Sub Model: \n\n1LE',
 u'Fuel Type: \n\nGasoline\n\n Color: \n\nWhite',
 u'For Sale By: \n\nPrivate Seller\n\n Interior Color: \n\nBlack',
 u'Disability Equipped: \n\nNo\n\n Number of Cylinders: \n\n8',
 u'']

最终,我希望将数据存储在如下所示的字典中.我知道如何创建字典,但不知道如何在不进行暴力查找和替换的情况下清理需要放入字典中的数据.

Ultimately, I want the data to be stored in a dictionary like below. I know how to create a dictionary, but don't know how to clean up the data that needs to go into the dictionary without brute force find-and-replace.

{'Condition' : 'Used',
 'Seller Notes' : 'Excellent Condition',
 'Year': '2015',
 'VIN (Vehicle Identification Number)': '2G1FJ1EW2F9192023',
 'Mileage': '29,000', 
 'Transmission': 'Manual',
 'Make': 'Chevrolet', 
 'Body Type': 'Coupe',
 'Model': 'Camaro', 
 'Warranty': 'Vehicle has an existing warranty',
 'Trim': 'SS Coupe 2-Door',
 'Vehicle Title' : 'Clear',
 'Engine': '6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated', 
 'Options': 'Leather Seats',
 'Drive Type': 'RWD', 
 'Safety Features' : 'Anti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags',
 'Power Options' : 'Air Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats',
 'Sub Model' : '1LE',
 'Fuel Type' : 'Gasoline', 
 'Exterior Color' : 'White',
 'For Sale By' : 'Private Seller', 
 'Interior Color' : 'Black',
 'Disability Equipped' : 'No', 
 'Number of Cylinders': '8'}

推荐答案

与其尝试从tr元素中解析出数据,不如尝试对td.attrLabels数据元素进行迭代.您可以将这些标签用作键,然后将相邻的同级元素用作值.

Rather than trying to parse out the data from the tr elements, a better approach would be to iterate over the td.attrLabels data elements. You can use these labels as the key, and then use the adjacent sibling elements as the value.

在下面的示例中,CSS选择器div.itemAttr td.attrLabels用于选择具有.attrLabels类的所有td元素,这些元素是div.itemAttr的后代.从那里开始,方法 .find_next_sibling() 用于查找相邻的同级元素.

In the example below, the CSS selector div.itemAttr td.attrLabels is used to select all td elements with .attrLabels classes that are descendants of the div.itemAttr. From there, the method .find_next_sibling() is used to find the adjacent sibling element.

r = requests.get("http://www.ebay.com/itm/222378225962")
soup = BeautifulSoup(r.content, 'lxml')

data = []
for label in soup.select('div.itemAttr td.attrLabels'):
    data.append({ label.text.strip(): label.find_next_sibling().text.strip() })

输出:

> [{'Year:': '2015'}, {'VIN (Vehicle Identification Number):': '2G1FJ1EW2F9192023'}, {'Mileage:': '29,000'}, {'Transmission:': 'Manual'}, {'Make:': 'Chevrolet'}, {'Body Type:': 'Coupe'}, {'Model:': 'Camaro'}, {'Warranty:': 'Vehicle has an existing warranty'}, {'Trim:': 'SS Coupe 2-Door'}, {'Vehicle Title:': 'Clear'}, {'Engine:': '6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated'}, {'Options:': 'Leather Seats'}, {'Drive Type:': 'RWD'}, {'Safety Features:': 'Anti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags'}, {'Power Options:': 'Air Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats'}, {'Sub Model:': '1LE'}, {'Fuel Type:': 'Gasoline'}, {'Exterior Color:': 'White'}, {'For Sale By:': 'Private Seller'}, {'Interior Color:': 'Black'}, {'Disability Equipped:': 'No'}, {'Number of Cylinders:': '8'}]


如果您还想检索表头th元素,则可以选择表元素,然后使用CSS选择器th, td.attrLabels来检索两个标签:


If you also want to retrieve the table header th elements, then you could select the table element and then use the CSS selector th, td.attrLabels in order to retrieve both labels:

r = requests.get("http://www.ebay.com/itm/222378225962")
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('div', 'itemAttr')

data = []
for label in table.select('th, td.attrLabels'):
    data.append({ label.text.strip(): label.find_next_sibling().text.strip() })

输出:

> [{'Condition:': 'Used'}, {'Seller Notes:': '"Excellent Condition"'}, {'Year:': '2015'}, {'VIN (Vehicle Identification Number):': '2G1FJ1EW2F9192023'}, {'Mileage:': '29,000'}, {'Transmission:': 'Manual'}, {'Make:': 'Chevrolet'}, {'Body Type:': 'Coupe'}, {'Model:': 'Camaro'}, {'Warranty:': 'Vehicle has an existing warranty'}, {'Trim:': 'SS Coupe 2-Door'}, {'Vehicle Title:': 'Clear'}, {'Engine:': '6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated'}, {'Options:': 'Leather Seats'}, {'Drive Type:': 'RWD'}, {'Safety Features:': 'Anti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags'}, {'Power Options:': 'Air Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats'}, {'Sub Model:': '1LE'}, {'Fuel Type:': 'Gasoline'}, {'Exterior Color:': 'White'}, {'For Sale By:': 'Private Seller'}, {'Interior Color:': 'Black'}, {'Disability Equipped:': 'No'}, {'Number of Cylinders:': '8'}]


如果要删除键的非字母数字字符,则可以使用:


If you want to strip out non-alphanumeric character(s) for the keys, then you could use:

r = requests.get("http://www.ebay.com/itm/222378225962")
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('div', 'itemAttr')

data = []
for label in table.select('th, td.attrLabels'):
    key = re.sub(r'\W+', '', label.text.strip())
    value = label.find_next_sibling().text.strip()

    data.append({ key: value })

输出:

> [{'Condition': 'Used'}, {'SellerNotes': '"Excellent Condition"'}, {'Year': '2015'}, {'VINVehicleIdentificationNumber': '2G1FJ1EW2F9192023'}, {'Mileage': '29,000'}, {'Transmission': 'Manual'}, {'Make': 'Chevrolet'}, {'BodyType': 'Coupe'}, {'Model': 'Camaro'}, {'Warranty': 'Vehicle has an existing warranty'}, {'Trim': 'SS Coupe 2-Door'}, {'VehicleTitle': 'Clear'}, {'Engine': '6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated'}, {'Options': 'Leather Seats'}, {'DriveType': 'RWD'}, {'SafetyFeatures': 'Anti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags'}, {'PowerOptions': 'Air Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats'}, {'SubModel': '1LE'}, {'FuelType': 'Gasoline'}, {'ExteriorColor': 'White'}, {'ForSaleBy': 'Private Seller'}, {'InteriorColor': 'Black'}, {'DisabilityEquipped': 'No'}, {'NumberofCylinders': '8'}]

这篇关于使用BeautifulSoup将HTML表数据解析为字典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆