BeautifulSoup getText 来自 <p> 之间,不拾取后续段落 [英] BeautifulSoup getText from between <p>, not picking up subsequent paragraphs

查看:13
本文介绍了BeautifulSoup getText 来自 <p> 之间,不拾取后续段落的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,当谈到 Python 时,我是一个完全的新手.但是,我已经编写了一段代码来查看 RSS 提要,打开链接并从文章中提取文本.这是我目前所拥有的:

from BeautifulSoup import BeautifulSoup导入提要解析器导入 urllib# 字典链接 = {}标题 = {}# 变量n = 0rss_url = "feed://www.gfsc.gg/_layouts/GFSC/GFSCRSSFeed.aspx?Division=ALL&Article=All&Title=News&Type=doc&List=%7b66fa9b18-776a-4e91-9f80-30386c图7d%23%7b679e913e-6301-4bc4-9fd9-a788b926f565%7D%23%7b0e65f37f-1129-4c78-8f59-3db5f96409fd%7D%23%7bdd7c290d-5f17-43b7-b6fd-50089368e090%7D%23%7b4790a972-c55f-46a5-8020-396780eb8506%7D%23%7b6b67c​​085-7c25-458d-8a98-373e0ac71c52%7D%23%7be3b71b9c-30CE-47c0-8bfb-f3224e98b756%7D%23%7b25853d98-37d7-4ba2-83f9-78685f2070df%7D%23%7b14c41f90-c462-44cf-a773-878521aa007c%7D%23%7b7ceaf3bf-d501-4f60-a3e4-2af84d0e1528%7D%23%7baf17e955-96b7-49e9-ad8a-7ee0ac097f37%7D%23%7b3faca1d0-be40-445c-a577-c742c2d367a8%7d%23%7b6296a8d6-7cab-4609-b7f7-b6b7c3a264d6%7d%23%7b43e2b52d-e4f1-4628-84ad-0042d7"# 解析RSS提要feed = feedparser.parse(rss_url)# 查看整个提要,一次一个条目在 feed.entries 中发布:# 从帖子创建变量链接 = post.link标题 = post.title# 添加字典链接n += 1链接[n] = 链接对于links.items()中的k,v:# 打开 RSS 提要页面 = urllib.urlopen(v).read()页 = str(页)汤 = BeautifulSoup(页面)# 找到段落标签之间的所有文本并去掉html页 = 汤.find('p').getText()# 去除与号和手表:page = re.sub('&w+;','',page)page = re.sub('WATCH:','',page)# 打印页面打印(页)打印(" ")# 在第三篇文章之后停止,就在测试 ** 被删除 **如果 (k >= 3):休息

这会产生以下输出:

<预><代码>>>>(执行RSS_BeautifulSoup.py"的第 1 到 45 行)2012 年 6 月末,根西岛银行的存款总额以英镑计算,较 2012 年 3 月末的 1010 亿英镑增加了 21 亿英镑,达到 1031 亿英镑.这比一年前的同期下降了 9.4%.总资产和负债增加了 29 亿英镑,达到 1,312 亿英镑,比上季度增长 2.3%,但比一年前的水平低 5.7%.较高的数字反映了数量和汇率因素的影响.截至 2012 年 6 月 30 日止季度,管理和管理的总资产净值增加了 7.11 亿英镑(0.3%),达到 2,708 亿英镑.自 2011 年 6 月 30 日以来,总资产净值下降了 3.6 英镑亿 (1.3%).委员会更新了表格 REG、表格 QIF 和表格 FTL 上的保证,以考虑到委员会关于个人问卷和个人声明的指导说明.特别是,在上述表格中插入了以下保证(根据应用程序略有不同),>>>

问题是这是每篇文章的第一段,但是我需要展示整篇文章.任何帮助将不胜感激.

解决方案

离你越来越近了!

# 找出段落标签之间的所有文本并去掉html页 = 汤.find('p').getText()

使用 find(如您所见)在找到一个结果.如果您需要所有段落,则需要 find_all.如果页面的格式一致(只是查看了一个),您还可以使用类似

soup.find('div',{'id':'ctl00_PlaceHolderMain_RichHtmlField1__ControlWrapper_RichHtmlField'})

在文章正文中归零.

Firstly, I am a complete newbie when it comes to Python. However, I have written a piece of code to look at an RSS feed, open the link and extract the text from the article. This is what I have so far:

from BeautifulSoup import BeautifulSoup
import feedparser
import urllib

# Dictionaries
links = {}
titles = {}

# Variables
n = 0

rss_url = "feed://www.gfsc.gg/_layouts/GFSC/GFSCRSSFeed.aspx?Division=ALL&Article=All&Title=News&Type=doc&List=%7b66fa9b18-776a-4e91-9f80-    30195001386c%7d%23%7b679e913e-6301-4bc4-9fd9-a788b926f565%7d%23%7b0e65f37f-1129-4c78-8f59-3db5f96409fd%7d%23%7bdd7c290d-5f17-43b7-b6fd-50089368e090%7d%23%7b4790a972-c55f-46a5-8020-396780eb8506%7d%23%7b6b67c085-7c25-458d-8a98-373e0ac71c52%7d%23%7be3b71b9c-30ce-47c0-8bfb-f3224e98b756%7d%23%7b25853d98-37d7-4ba2-83f9-78685f2070df%7d%23%7b14c41f90-c462-44cf-a773-878521aa007c%7d%23%7b7ceaf3bf-d501-4f60-a3e4-2af84d0e1528%7d%23%7baf17e955-96b7-49e9-ad8a-7ee0ac097f37%7d%23%7b3faca1d0-be40-445c-a577-c742c2d367a8%7d%23%7b6296a8d6-7cab-4609-b7f7-b6b7c3a264d6%7d%23%7b43e2b52d-e4f1-4628-84ad-0042d644deaf%7d"

# Parse the RSS feed
feed = feedparser.parse(rss_url)

# view the entire feed, one entry at a time
for post in feed.entries:
    # Create variables from posts
    link = post.link
    title = post.title
    # Add the link to the dictionary
    n += 1
    links[n] = link

for k,v in links.items():
    # Open RSS feed
    page = urllib.urlopen(v).read()
    page = str(page)
    soup = BeautifulSoup(page)

    # Find all of the text between paragraph tags and strip out the html
    page = soup.find('p').getText()

    # Strip ampersand codes and WATCH:
    page = re.sub('&w+;','',page)
    page = re.sub('WATCH:','',page)

    # Print Page
    print(page)
    print(" ")

    # To stop after 3rd article, just whilst testing ** to be removed **
    if (k >= 3):
        break

This produces the following output:

>>> (executing lines 1 to 45 of "RSS_BeautifulSoup.py")
​Total deposits held with Guernsey banks at the end of June 2012 increased 2.1% in sterling terms by £2.1 billion from the end of March 2012 level of £101 billion, up to £103.1 billion. This is 9.4% lower than the same time a year ago.  Total assets and liabilities increased by £2.9 billion to £131.2 billion representing a 2.3% increase over the quarter though this was 5.7% lower than the level a year ago.  The higher figures reflected the effects both of volume and exchange rate factors.

The net asset value of total funds under management and administration has increased over the quarter ended 30 June 2012 by £711 million (0.3%) to reach £270.8 billion.For the year since 30 June 2011, total net asset values decreased by £3.6 billion (1.3%).

The Commission has updated the warranties on the Form REG, Form QIF and Form FTL to take into account the Commission’s Guidance Notes on Personal Questionnaires and Personal Declarations.  In particular, the following warranty (varies slightly dependent on the application) has been inserted in the aforementioned forms,

>>> 

The problem is that this is the first paragraph of each article, however I need to show the entire article. Any help would be gratefully received.

解决方案

You are getting close!

# Find all of the text between paragraph tags and strip out the html
page = soup.find('p').getText()

Using find (as you've noticed) stops after finding one result. You need find_all if you want all the paragraphs. If the pages are formatted consistently ( just looked over one), you could also use something like

soup.find('div',{'id':'ctl00_PlaceHolderMain_RichHtmlField1__ControlWrapper_RichHtmlField'})

to zero in on the body of the article.

这篇关于BeautifulSoup getText 来自 &lt;p&gt; 之间,不拾取后续段落的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆