从之间的&LT BeautifulSoup的getText; p>中不拾取后续段落 [英] BeautifulSoup getText from between <p>, not picking up subsequent paragraphs

查看:398
本文介绍了从之间的&LT BeautifulSoup的getText; p>中不拾取后续段落的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,我是一个完整的新手,当涉及到Python。不过,我已经写了一张code来看待一个RSS feed,打开链接并提取从文章的文本。这是我到目前为止有:

 从BeautifulSoup进口BeautifulSoup
进口feedparser
进口的urllib#字典
链接= {}
标题= {}#变量
N = 0rss_url = \"feed://www.gfsc.gg/_layouts/GFSC/GFSCRSSFeed.aspx?Division=ALL&Article=All&Title=News&Type=doc&List=%7b66fa9b18-776a-4e91-9f80- 30195001386c%7d%23%7b679e913e-6301-4bc4-9fd9-a788b926f565%7d%23%7b0e65f37f-1129-4c78-8f59-3db5f96409fd%7d%23%7bdd7c290d-5f17-43b7-b6fd-50089368e090%7d%23%7b4790a972-c55f-46a5-8020-396780eb8506%7d%23%7b6b67c085-7c25-458d-8a98-373e0ac71c52%7d%23%7be3b71b9c-30ce-47c0-8bfb-f3224e98b756%7d%23%7b25853d98-37d7-4ba2-83f9-78685f2070df%7d%23%7b14c41f90-c462-44cf-a773-878521aa007c%7d%23%7b7ceaf3bf-d501-4f60-a3e4-2af84d0e1528%7d%23%7baf17e955-96b7-49e9-ad8a-7ee0ac097f37%7d%23%7b3faca1d0-be40-445c-a577-c742c2d367a8%7d%23%7b6296a8d6-7cab-4609-b7f7-b6b7c3a264d6%7d%23%7b43e2b52d-e4f1-4628-84ad-0042d644deaf%7d\"#解析RSS提要
饲料= feedparser.parse(rss_url)#查看整个饲料,一次一个条目
为岗位在feed.entries:
    #创建的帖子变量
    链接= post.link
    标题= post.title
    #链接添加到字典
    N + 1 =
    链接[N] =链接对于K,V在links.items():
    #打开RSS源
    页=了urllib.urlopen(V).read()
    页= STR(页)
    汤= BeautifulSoup(页)    #找到所有段落标记之间的文本,并剥离出的HTML
    页= soup.find('P')的getText()    #地带符号codeS和WATCH:
    页=应用re.sub('和; \\ W +;','',页)
    页=应用re.sub('WATCH:','',页)    #打印此页
    打印(页)
    打印()    #停止第三条后,就同时测试**被删除**
    如果(K> = 3):
        打破

这将产生以下的输出:

 >>> (执行第1行至RSS_BeautifulSoup.py45)
在2012年6月底与格恩西银行持有的存款总额,从2012年3月的水平为101十亿£年底英镑计算增长2.1%,2.1十亿£,达到103.1十亿£。这比同时去年同期下降9.4%。总资产和负债增长2.9十亿£到131.2十亿£重新presenting比季度增长2.3%,虽然这比水平去年同期下降5.7%。较高的数字反映了交易量和汇率因素的影响。在管理和行政资金总额的净资产值已经由7.11亿£(0.3%)增加了2012年6月30日的季度自2011年6月30日billion.For年达到£270.8,总资产净值减少3.6£十亿(1.3%)。委员会已更新的形式REG,形式QIF和形式FTL的保证,考虑到对个人问卷和个人声明委员会的指引。特别是,下面的保修(变化稍微取决于应用)已插入在上述形式>>>

问题是,这是每篇文章的第一段,但是我需要显示整篇文章。任何帮助将受到欢迎。


解决方案

您正在接近!

 #查找所有段落标记之间的文本,并剥离出的HTML
页= soup.find('P')的getText()

使用找到(如你已经注意到)找到一个结果后停止。如果你希望所有的段落,你需要 find_all 。如果页面始终(只是看了超过一)格式化的,你也可以使用类似

<$p$p><$c$c>soup.find('div',{'id':'ctl00_PlaceHolderMain_RichHtmlField1__ControlWrapper_RichHtmlField'})

要在零上的文章的主体。

Firstly, I am a complete newbie when it comes to Python. However, I have written a piece of code to look at an RSS feed, open the link and extract the text from the article. This is what I have so far:

from BeautifulSoup import BeautifulSoup
import feedparser
import urllib

# Dictionaries
links = {}
titles = {}

# Variables
n = 0

rss_url = "feed://www.gfsc.gg/_layouts/GFSC/GFSCRSSFeed.aspx?Division=ALL&Article=All&Title=News&Type=doc&List=%7b66fa9b18-776a-4e91-9f80-    30195001386c%7d%23%7b679e913e-6301-4bc4-9fd9-a788b926f565%7d%23%7b0e65f37f-1129-4c78-8f59-3db5f96409fd%7d%23%7bdd7c290d-5f17-43b7-b6fd-50089368e090%7d%23%7b4790a972-c55f-46a5-8020-396780eb8506%7d%23%7b6b67c085-7c25-458d-8a98-373e0ac71c52%7d%23%7be3b71b9c-30ce-47c0-8bfb-f3224e98b756%7d%23%7b25853d98-37d7-4ba2-83f9-78685f2070df%7d%23%7b14c41f90-c462-44cf-a773-878521aa007c%7d%23%7b7ceaf3bf-d501-4f60-a3e4-2af84d0e1528%7d%23%7baf17e955-96b7-49e9-ad8a-7ee0ac097f37%7d%23%7b3faca1d0-be40-445c-a577-c742c2d367a8%7d%23%7b6296a8d6-7cab-4609-b7f7-b6b7c3a264d6%7d%23%7b43e2b52d-e4f1-4628-84ad-0042d644deaf%7d"

# Parse the RSS feed
feed = feedparser.parse(rss_url)

# view the entire feed, one entry at a time
for post in feed.entries:
    # Create variables from posts
    link = post.link
    title = post.title
    # Add the link to the dictionary
    n += 1
    links[n] = link

for k,v in links.items():
    # Open RSS feed
    page = urllib.urlopen(v).read()
    page = str(page)
    soup = BeautifulSoup(page)

    # Find all of the text between paragraph tags and strip out the html
    page = soup.find('p').getText()

    # Strip ampersand codes and WATCH:
    page = re.sub('&\w+;','',page)
    page = re.sub('WATCH:','',page)

    # Print Page
    print(page)
    print(" ")

    # To stop after 3rd article, just whilst testing ** to be removed **
    if (k >= 3):
        break

This produces the following output:

>>> (executing lines 1 to 45 of "RSS_BeautifulSoup.py")
​Total deposits held with Guernsey banks at the end of June 2012 increased 2.1% in sterling terms by £2.1 billion from the end of March 2012 level of £101 billion, up to £103.1 billion. This is 9.4% lower than the same time a year ago.  Total assets and liabilities increased by £2.9 billion to £131.2 billion representing a 2.3% increase over the quarter though this was 5.7% lower than the level a year ago.  The higher figures reflected the effects both of volume and exchange rate factors.

The net asset value of total funds under management and administration has increased over the quarter ended 30 June 2012 by £711 million (0.3%) to reach £270.8 billion.For the year since 30 June 2011, total net asset values decreased by £3.6 billion (1.3%).

The Commission has updated the warranties on the Form REG, Form QIF and Form FTL to take into account the Commission’s Guidance Notes on Personal Questionnaires and Personal Declarations.  In particular, the following warranty (varies slightly dependent on the application) has been inserted in the aforementioned forms,

>>> 

The problem is that this is the first paragraph of each article, however I need to show the entire article. Any help would be gratefully received.

解决方案

You are getting close!

# Find all of the text between paragraph tags and strip out the html
page = soup.find('p').getText()

Using find (as you've noticed) stops after finding one result. You need find_all if you want all the paragraphs. If the pages are formatted consistently ( just looked over one), you could also use something like

soup.find('div',{'id':'ctl00_PlaceHolderMain_RichHtmlField1__ControlWrapper_RichHtmlField'})

to zero in on the body of the article.

这篇关于从之间的&LT BeautifulSoup的getText; p&gt;中不拾取后续段落的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆