使用BeautifulSoup将HTML表数据解析为字典 [英] Parse HTML table data with BeautifulSoup into a dict
问题描述
我正在尝试使用BeautifulSoup
解析存储在HTML表中的信息,并将其存储到字典中.我已经能够到达表,并遍历值,但是表中仍然有很多垃圾,我不确定该如何处理.
I am trying to use BeautifulSoup
to parse the information stored in an HTML table and store it into a dict. I've been able to get to the table, and iterate through the values, but there is still a lot of junk in the table that I'm not sure how to take care of.
# load the HTML file
r = requests.get("http://www.ebay.com/itm/222378225962")
soup = BeautifulSoup(r.content, "html.parser")
# navigate to the item attributes table
table = soup.find('div', 'itemAttr')
# iterate through the attribute information
attr = []
for i in table.findAll("tr"):
attr.append(i.text.strip().replace('\t', ''))
使用这种方法,数据就是这样.如您所见,其中有很多垃圾,有些行包含Year和VIN之类的多个项目.
With this method, this is what the data looks like. As you you see, there is a lot of junk in there, and some lines contain multiple items like Year and VIN.
[u'Condition:\nUsed',
u'Seller Notes:\n\u201cExcellent Condition\u201d',
u'Year: \n\n2015\n\n VIN (Vehicle Identification Number): \n\n2G1FJ1EW2F9192023',
u'Mileage: \n\n29,000\n\n Transmission: \n\nManual',
u'Make: \n\nChevrolet\n\n Body Type: \n\nCoupe',
u'Model: \n\nCamaro\n\n Warranty: \n\nVehicle has an existing warranty',
u'Trim: \n\nSS Coupe 2-Door\n\n Vehicle Title: \n\nClear',
u'Engine: \n\n6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated\n\n Options: \n\nLeather Seats',
u'Drive Type: \n\nRWD\n\n Safety Features: \n\nAnti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags',
u'Power Options: \n\nAir Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats\n\n Sub Model: \n\n1LE',
u'Fuel Type: \n\nGasoline\n\n Color: \n\nWhite',
u'For Sale By: \n\nPrivate Seller\n\n Interior Color: \n\nBlack',
u'Disability Equipped: \n\nNo\n\n Number of Cylinders: \n\n8',
u'']
最终,我希望将数据存储在如下所示的字典中.我知道如何创建字典,但不知道如何在不进行暴力查找和替换的情况下清理需要放入字典中的数据.
Ultimately, I want the data to be stored in a dictionary like below. I know how to create a dictionary, but don't know how to clean up the data that needs to go into the dictionary without brute force find-and-replace.
{'Condition' : 'Used',
'Seller Notes' : 'Excellent Condition',
'Year': '2015',
'VIN (Vehicle Identification Number)': '2G1FJ1EW2F9192023',
'Mileage': '29,000',
'Transmission': 'Manual',
'Make': 'Chevrolet',
'Body Type': 'Coupe',
'Model': 'Camaro',
'Warranty': 'Vehicle has an existing warranty',
'Trim': 'SS Coupe 2-Door',
'Vehicle Title' : 'Clear',
'Engine': '6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated',
'Options': 'Leather Seats',
'Drive Type': 'RWD',
'Safety Features' : 'Anti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags',
'Power Options' : 'Air Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats',
'Sub Model' : '1LE',
'Fuel Type' : 'Gasoline',
'Exterior Color' : 'White',
'For Sale By' : 'Private Seller',
'Interior Color' : 'Black',
'Disability Equipped' : 'No',
'Number of Cylinders': '8'}
推荐答案
与其尝试从tr
元素中解析出数据,不如尝试对td.attrLabels
数据元素进行迭代.您可以将这些标签用作键,然后将相邻的同级元素用作值.
Rather than trying to parse out the data from the tr
elements, a better approach would be to iterate over the td.attrLabels
data elements. You can use these labels as the key, and then use the adjacent sibling elements as the value.
在下面的示例中,CSS选择器div.itemAttr td.attrLabels
用于选择具有.attrLabels
类的所有td
元素,这些元素是div.itemAttr
的后代.从那里开始,方法 .find_next_sibling()
用于查找相邻的同级元素.
In the example below, the CSS selector div.itemAttr td.attrLabels
is used to select all td
elements with .attrLabels
classes that are descendants of the div.itemAttr
. From there, the method .find_next_sibling()
is used to find the adjacent sibling element.
r = requests.get("http://www.ebay.com/itm/222378225962")
soup = BeautifulSoup(r.content, 'lxml')
data = []
for label in soup.select('div.itemAttr td.attrLabels'):
data.append({ label.text.strip(): label.find_next_sibling().text.strip() })
输出:
> [{'Year:': '2015'}, {'VIN (Vehicle Identification Number):': '2G1FJ1EW2F9192023'}, {'Mileage:': '29,000'}, {'Transmission:': 'Manual'}, {'Make:': 'Chevrolet'}, {'Body Type:': 'Coupe'}, {'Model:': 'Camaro'}, {'Warranty:': 'Vehicle has an existing warranty'}, {'Trim:': 'SS Coupe 2-Door'}, {'Vehicle Title:': 'Clear'}, {'Engine:': '6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated'}, {'Options:': 'Leather Seats'}, {'Drive Type:': 'RWD'}, {'Safety Features:': 'Anti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags'}, {'Power Options:': 'Air Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats'}, {'Sub Model:': '1LE'}, {'Fuel Type:': 'Gasoline'}, {'Exterior Color:': 'White'}, {'For Sale By:': 'Private Seller'}, {'Interior Color:': 'Black'}, {'Disability Equipped:': 'No'}, {'Number of Cylinders:': '8'}]
如果您还想检索表头th
元素,则可以选择表元素,然后使用CSS选择器th, td.attrLabels
来检索两个标签:
If you also want to retrieve the table header th
elements, then you could select the table element and then use the CSS selector th, td.attrLabels
in order to retrieve both labels:
r = requests.get("http://www.ebay.com/itm/222378225962")
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('div', 'itemAttr')
data = []
for label in table.select('th, td.attrLabels'):
data.append({ label.text.strip(): label.find_next_sibling().text.strip() })
输出:
> [{'Condition:': 'Used'}, {'Seller Notes:': '"Excellent Condition"'}, {'Year:': '2015'}, {'VIN (Vehicle Identification Number):': '2G1FJ1EW2F9192023'}, {'Mileage:': '29,000'}, {'Transmission:': 'Manual'}, {'Make:': 'Chevrolet'}, {'Body Type:': 'Coupe'}, {'Model:': 'Camaro'}, {'Warranty:': 'Vehicle has an existing warranty'}, {'Trim:': 'SS Coupe 2-Door'}, {'Vehicle Title:': 'Clear'}, {'Engine:': '6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated'}, {'Options:': 'Leather Seats'}, {'Drive Type:': 'RWD'}, {'Safety Features:': 'Anti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags'}, {'Power Options:': 'Air Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats'}, {'Sub Model:': '1LE'}, {'Fuel Type:': 'Gasoline'}, {'Exterior Color:': 'White'}, {'For Sale By:': 'Private Seller'}, {'Interior Color:': 'Black'}, {'Disability Equipped:': 'No'}, {'Number of Cylinders:': '8'}]
如果要删除键的非字母数字字符,则可以使用:
If you want to strip out non-alphanumeric character(s) for the keys, then you could use:
r = requests.get("http://www.ebay.com/itm/222378225962")
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('div', 'itemAttr')
data = []
for label in table.select('th, td.attrLabels'):
key = re.sub(r'\W+', '', label.text.strip())
value = label.find_next_sibling().text.strip()
data.append({ key: value })
输出:
> [{'Condition': 'Used'}, {'SellerNotes': '"Excellent Condition"'}, {'Year': '2015'}, {'VINVehicleIdentificationNumber': '2G1FJ1EW2F9192023'}, {'Mileage': '29,000'}, {'Transmission': 'Manual'}, {'Make': 'Chevrolet'}, {'BodyType': 'Coupe'}, {'Model': 'Camaro'}, {'Warranty': 'Vehicle has an existing warranty'}, {'Trim': 'SS Coupe 2-Door'}, {'VehicleTitle': 'Clear'}, {'Engine': '6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated'}, {'Options': 'Leather Seats'}, {'DriveType': 'RWD'}, {'SafetyFeatures': 'Anti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags'}, {'PowerOptions': 'Air Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats'}, {'SubModel': '1LE'}, {'FuelType': 'Gasoline'}, {'ExteriorColor': 'White'}, {'ForSaleBy': 'Private Seller'}, {'InteriorColor': 'Black'}, {'DisabilityEquipped': 'No'}, {'NumberofCylinders': '8'}]
这篇关于使用BeautifulSoup将HTML表数据解析为字典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!