在Python中使用BeautifulSoup提取两个标头标签之间的文本 [英] Extracting the text between two header tags using BeautifulSoup in Python

查看:50
本文介绍了在Python中使用BeautifulSoup提取两个标头标签之间的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用BeautifulSoup在Wikipedia页面上使用Python提取电影情节.我是Python和BeautifulSoup的新手,所以我不确定如何实际使用它.

I am trying to extract the plot of a movie, from the wikipedia page, in Python using BeautifulSoup. I am new to Python and BeautifulSoup so I am not sure how to actually approach it.

这是输入代码.

<h2><span class="mw-headline" id="Plot">Plot</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php? title=Moana_(2016_film)&amp;action=edit&amp;section=1" title="Edit section: Plot">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
<p>A small <a href="/wiki/Pounamu" title="Pounamu">pounamu</a> stone that is    the mystical heart of the island <a href="/wiki/Goddess" title="Goddess">goddess</a> Te Fiti is stolen by the <a href="/wiki/Demigod" title="Demigod">demigod</a> <a href="/wiki/M%C4%81ui_(mythology)" title="Māui (mythology)">Maui</a>, who was planning to give it to humanity as a gift. As Maui makes his escape, he is attacked by the lava <a href="/wiki/Demon" title="Demon">demon</a> Te Kā, causing the heart of Te Fiti as well as his power-granting magical fish hook to be lost in the ocean.</p><p>A millennium later, young Moana Waialiki, daughter and heir of the chief on the small <a href="/wiki/Polynesia" title="Polynesia">Polynesian</a> island of Motunui, is chosen by the ocean to receive the heart, but drops it when her father, Chief Tui, comes to get her. He insists the island provides everything the villagers need. But years later, fish become scarce and the island's vegetation begins dying. Moana proposes going beyond the reef to find more fish. Tui rejects her request, as sailing past the reef is forbidden.</p>`
<p>Moana's grandmother Tala shows Moana a secret cave behind a waterfall, where she finds boats inside and discovers her ancestors were voyagers, sailing and discovering new islands across the world. Tala explains that they stopped voyaging because Maui stole the heart of Te Fiti, causing Te Kā and monsters to appear in the ocean. Tala then says Te Kā's darkness has been spreading from island to island, slowly killing them. Tala gives Moana the heart of Te Fiti, which she has kept safe for her granddaughter.</p>
<p>Tala falls ill and with her dying breaths tells Moana to set sail. Moana and her pet <a href="/wiki/Rooster" title="Rooster">rooster</a> Heihei depart in a <a href="/wiki/Drua" title="Drua">drua</a> to find Maui. A <a href="/wiki/Manta_ray" title="Manta ray">manta ray</a>, Tala's reincarnation, follows. After a <a href="/wiki/Typhoon" title="Typhoon">typhoon</a> wave flips her sailboat and knocks her unconscious, she awakens the next morning on an island inhabited by Maui, who traps her in a cave and takes her sailboat to search for his fishhook. After escaping and catching up to Maui, Moana tries to convince him to return the heart, but Maui refuses, fearing its power will attract dark creatures.</p>
<p>Sentient coconut pirates called Kakamora surround the boat and steal the heart, but Maui and Moana retrieve it. Maui agrees to help return the heart, but only after he reclaims his hook, which is hidden in Lalotai, the Realm of Monsters. At Lalotai, they retrieve it by tricking Tamatoa, a giant <a href="/wiki/Coconut_crab" title="Coconut crab">coconut crab</a>. Maui teaches Moana how to properly sail and navigate. They arrive at Te Fiti, where Te Kā attacks. Maui is overpowered and Te Kā severely damages his hook and repels their boat far out to sea. Fearful that returning to fight Te Kā will destroy his hook, Maui abandons Moana.</p>
<p>Distraught, Moana begs the ocean to take the heart and choose another person to return it to Te Fiti. The spirit of Tala comes to her and encourages to find her true calling within herself. Inspired, Moana retrieves the heart from the ocean and returns to Te Fiti alone. Maui, having had a change of heart, returns to distract the lava demon, and his hook is destroyed in the battle. Realizing that Te Kā is actually Te Fiti without her heart, Moana asks the ocean to clear a path for Te Kā to approach her. She sings a song, asking Te Kā to remember who she truly is, allowing Moana to restore her heart. Te Fiti returns and gives a new canoe to Moana and a new magical hook to Maui before returning to her island form.</p>
<p>In a <a href="/wiki/Post-credits_scene" title="Post-credits scene">post-credits scene</a>, Tamatoa, who has been stranded on his back during Moana and Maui's escape, grumbles to the audience that they would help him if he was a <a href="/wiki/Sebastian_(Disney)" title="Sebastian (Disney)">Jamaican crab named Sebastian</a>.</p>
<h2><span class="mw-headline" id="Cast">Cast</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Moana_(2016_film)&amp;action=edit&amp;section=2" title="Edit section: Cast">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
<div class="thumb tright">

因此,我只想提取两个h2之间的文本,即情节.我应该如何使用BeautifulSoup提取它?

So I want to extract only the text between both the h2, which is the plot. How should I extract that using BeautifulSoup?

这是我现在拥有的代码.

EDIT 1: This is the code I have right now.

from BeautifulSoup import *

movie = raw_input('Enter:')
main = "http://www.wikipedia.org"
url = "http://www.wikipedia.org/wiki/"+movie+"_(disambiguation)"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# Retrieve a list of the anchor tags
# Each tag is like a dictionary of HTML attributes
tags = soup('a')
for tag in tags:
    chk = tag.get('href', None)
    chk = str(chk)
    if "film" in chk:
        final = chk

html = urllib.urlopen(main+final).read()
soup = BeautifulSoup(html)
new = []
spa = soup.findAll("span",id = "Plot")
spa_1 = soup.findAllNext("p")
for i in spa_1:
    print i

我试图到达id = Plot并尝试在其后打印所有p标签.

I tried to reach the id=Plot and try to print all the p tags after it.

推荐答案

文档的结构如下:

[h2] / [span id=Plot]
...
[h2]

我们可以做的是搜索id为"Plot"的跨度,然后导航到父级同级节点,收集它们的文本,直到到达下一个H2标头.

What we can do is search for the span with id of "Plot", then navigate through the parent sibling nodes, collecting their text, until we get to the next H2 header.

# collect plot in this list
plot = []

# find the node with id of "Plot"
mark = soup.find(id="Plot")

# walk through the siblings of the parent (H2) node 
# until we reach the next H2 node
for elt in mark.parent.nextSiblingGenerator():
    if elt.name == "h2":
        break
    if hasattr(elt, "text"):
        plot.append(elt.text)

# enjoy
print("".join(plot))

这篇关于在Python中使用BeautifulSoup提取两个标头标签之间的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆