使用BeautifulSoup提取特定的dl和dd列表元素 [英] Using BeautifulSoup to extract specific dl and dd list elements

查看：1580 发布时间：2018/6/19 14:09:19 python html beautifulsoup extract

本文介绍了使用BeautifulSoup提取特定的dl和dd列表元素的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我第一次发帖。我正在使用BeautifulSoup 4和Python 2.7（pycharm）。我有一个包含元素的网页，我需要提取标签为薪水：或日期：的特定元素，页面包含多个列表。

问题：我似乎无法识别和提取特定的文本。

示例html：

 <$ c $ < />日期：< / dt>< dd> 2015年9月13日< / dd>< dt>薪酬：< / dt> dd从每年40,130英镑开始。 < / dd>< / dl>< dl>< dt>日期：< / dt> dd> 2015年12月15日< / dd>< dt>薪酬：< / dt>< dd> ;开始每年22,460英镑< / dd>< / dl>< dl>< dt>日期：< / dt> dd> 2014年1月10日< / dd& < / dt>< dd>以每年18,160英镑开始。< / dd>< / dl>

我试过但没有成功的代码：

  r = requests.get（http://www.mywebsite.com/test.html）
汤= BeautifulSoup（r.content，html.parser ）
 dl_data = soup.find_all（dl）
 for dlitem in dl_data：
 print dlitem.find（dt，text =Date：）。parent.findNext（ dd）。contents [0] 
 print dlitem.find（dt，text =Salary：）。parent.findNext（dd）。contents [0]

预期结果：

  9月13日2015 
 2015年12月15日
 2014年1月10日
起价为每年40,130英镑。 
起价为每年22,460英镑。 
起价为每年18,160英镑。

实际结果：

  print dlitem.find（dt，text =Date：）。parent.findNext（dd）。contents [0] 
 AttributeError：'NoneType'对象没有属性'父''

我已经尝试了这段代码的许多变种并且围绕着圈子，我想出了如何打印所有dd元素到屏幕，只是不具体的dd元素！

谢谢

解决方案如果订单不重要，只需进行一些更改：

... dl_data =汤。 find_all（dd） for dl_data中的dlitem： print dlitem.string
结果：

2015年9月13日起价为每年40,130英镑。 2015年12月15日起价为每年22,460英镑。 2014年1月10日起价为每年18,160英镑。
您的最新要求：

<$ p $对于列表中的项目，zip（soup.find_all（dd）[0 :: 3]，soup.find_all（dd）[2 :: 3]））：
date，salary = item
print'，'.join（[date.string，salary.string]）

输出：

2015年9月13日，100 2015年9月14日，200

My first time posting. I am using BeautifulSoup 4 and python 2.7 (pycharm). I have a webpage containing elements and I need to extract specific elements where the tags are either 'Salary:' or 'Date:', the page contains multiple lists .

The problem: I cannot seem to identify and extract specific text. I have searched this site and tried without success.

Example html:
<dl><dt>Date:</dt><dd>13 September 2015</dd><dt>Salary:</dt><dd>Starting at £40,130 per annum.</dd></dl><dl><dt>Date:</dt><dd>15 December 2015</dd><dt>Salary:</dt><dd>Starting at £22,460 per annum.</dd></dl><dl><dt>Date:</dt><dd>10 January 2014</dd><dt>Salary:</dt><dd>Starting at £18,160 per annum.</dd></dl>
Code which I have tried without success:
r = requests.get("http://www.mywebsite.com/test.html") soup = BeautifulSoup(r.content, "html.parser") dl_data = soup.find_all("dl") for dlitem in dl_data: print dlitem.find("dt",text="Date:").parent.findNext("dd").contents[0] print dlitem.find("dt",text="Salary:").parent.findNext("dd").contents[0]
Expected Result:
13 September 2015 15 December 2015 10 January 2014 Starting at £40,130 per annum. Starting at £22,460 per annum. Starting at £18,160 per annum.
Actual Result:
print dlitem.find("dt",text="Date:").parent.findNext("dd").contents[0] AttributeError: 'NoneType' object has no attribute 'parent'
I have tried numerous variations of this code and gone round in circles, I figured out how to print out all dd elements to screen, just not specific dd elements!

Thanks
解决方案
If order is not important just make some changes:
... dl_data = soup.find_all("dd") for dlitem in dl_data: print dlitem.string
Result:
13 September 2015 Starting at £40,130 per annum. 15 December 2015 Starting at £22,460 per annum. 10 January 2014 Starting at £18,160 per annum.
For your latest request:
for item in list(zip(soup.find_all("dd")[0::3],soup.find_all("dd")[2::3])): date, salary = item print ', '.join([date.string, salary.string])
Output:
13 September 2015, 100 14 September 2015, 200

这篇关于使用BeautifulSoup提取特定的dl和dd列表元素的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用BeautifulSoup提取特定的dl和dd列表元素 [英] Using BeautifulSoup to extract specific dl and dd list elements

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

使用BeautifulSoup提取特定的dl和dd列表元素 [英] Using BeautifulSoup to extract specific dl and dd list elements

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭