美丽汤具有额外的</body>在实际结束之前 [英] Beautiful Soup has extra </body> before actual end

查看:41
本文介绍了美丽汤具有额外的</body>在实际结束之前的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从PoetryFoundation.org抓取诗歌.我在一个测试用例中发现,当我从一首特定的诗歌中提取html时,它会在实际诗歌结尾之前包含一个额外的</body> .我可以在网上查看这首诗的源代码,而这首诗的中间没有(如预期的那样).我使用特定案例的网址创建了一个示例,以便其他人可以尝试复制问题:

I am trying to scrape poems from PoetryFoundation.org. I have found in one of my test cases that when I pull the html from a specific poem it includes an extra </body> before the end of the actual poem. I can look at the source code for the poem online and there is no in the middle of the poem (as to be expected). I created an example with the url of a specific case such that others can try to replicate the problem:

from bs4 import BeautifulSoup
from urllib.request import urlopen

poem_page = urlopen("https://www.poetryfoundation.org/poems-and-poets/poems/detail/57956")
poem_soup = BeautifulSoup(poem_page.read(), "html5lib")
print(poem_soup)

我正在运行Python 3.5.1.我已经尝试使用默认解析器 html.parser 以及 html5lib lxml .

I'm running Python 3.5.1. I've tried this with the default parsers html.parser as well as html5lib and lxml.

在打印输出中,如果您搜索在诗中",您会发现html的这一片段,这是没有意义的,因为它以</body>结束了整个html文档.</html> ,然后继续处理文档的其余部分:

In the print out, if you search for 'in the poem' you'll find this snippet of html, which makes no sense because it ends the entire html document midway through the poem with </body></html> and then continues on with the rest of document:

in the poem</div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></body></html>. But when we met,<br/><div style="text-indent: -1em; padding-left: 1em;"><br/>

我已经在线查看了源代码,这应该是这样:

I've looked at the source code online and this is what it should be:

in the poem</em>. But when we met,<br></div><div style="text-indent: -1em; padding-left: 1em;">

我不知道为什么我刮时会关闭整个页面的整个html文档.

I have no idea why when I scrape it it's closing the entire html document partway through the page.

推荐答案

当我尝试使用 html.parser 用您的网址获取这首诗时,我遇到了与您相同的问题.在诗的位置被截断.

When I try to get the poem with your url with html.parser,I got the same problem as you.The html was truncated at the in the poem position.

import requests
from bs4 import BeautifulSoup

poem_page = requests.get("https://www.poetryfoundation.org/poems-and-poets/poems/detail/57956")
poem_soup = BeautifulSoup(poem_page.text, "html.parser")
poem_div = poem_soup.find('div', class_='poem')
print poem_div

输出:

<div class="poem" data-view="ContentView">
<div style="text-indent: -1em; padding-left: 1em;">It seems a certain fear underlies everything. <br/></div><div style="text-indent: -1em; padding-left: 1em;">If I were to tell you something profound<br/></div><div style="text-indent: -1em; padding-left: 1em;"> it would be useless, as every single thing I know<br/></div><div style="text-indent: -1em; padding-left: 1em;"> is not timeless. I am particularly risk-averse.<br/></div><div style="text-indent: -1em; padding-left: 1em;"><br/></div><div style="text-indent: -1em; padding-left: 1em;">I choose someone else over me every time, <br/></div><div style="text-indent: -1em; padding-left: 1em;">as I'm sure they'll finish the task at hand, <br/></div><div style="text-indent: -1em; padding-left: 1em;">which is to say that whatever is in front of us<br/></div><div style="text-indent: -1em; padding-left: 1em;"> will get done if I'm not in charge of it.<br/></div><div style="text-indent: -1em; padding-left: 1em;"><br/></div><div style="text-indent: -1em; padding-left: 1em;">There is a limit to the number of times <br/></div><div style="text-indent: -1em; padding-left: 1em;">I can practice every single kind of mortification <br/></div><div style="text-indent: -1em; padding-left: 1em;">(of the flesh?). I can turn toward you and say <em>yes, <br/></em></div><div style="text-indent: -1em; padding-left: 1em;">it was you in the poem</div></div>

但是将解析器更改为 lxml ,一切正常.

But changing the parser to lxml,everything is ok.

import requests
from bs4 import BeautifulSoup

poem_page = requests.get("https://www.poetryfoundation.org/poems-and-poets/poems/detail/57956")
poem_soup = BeautifulSoup(poem_page.text, "lxml")
poem_div = poem_soup.find('div', class_='poem')
# print poem_div
for s in poem_div.find_all('div'):
    print list(s.children)[0]

输出:

It seems a certain fear underlies everything. 
If I were to tell you something profound
 it would be useless, as every single thing I know
 is not timeless. I am particularly risk-averse.
<br/>
I choose someone else over me every time, 
as I'm sure they'll finish the task at hand, 
which is to say that whatever is in front of us
 will get done if I'm not in charge of it.
<br/>
There is a limit to the number of times 
I can practice every single kind of mortification 
(of the flesh?). I can turn toward you and say 
it was you in the poem. But when we met,
<br/>
you were actually wearing a shirt, and the poem 
wasn't about you or your indecipherable tattoo. 
The poem is always about me, but that one time 
I was in love with the memory of my twenties
<br/>
so I was, for a moment, in love with you 
because you remind me of an approaching
 subway brushing hair off my face with 
its hot breath. Darkness. And then light,
<br/>
the exact goldness of dawn fingering
 that brick wall out my bedroom window 
on Smith Street mornings when I'd wake
 next to godknowswho but always someone
<br/>
who wasn't a mistake, because what kind 
of mistakes are that twitchy and joyful 
even if they're woven with a particular 
thread of regret: the guy who used
<br/>
my toothbrush without asking,
I walked to the end of a pier with him,
would have walked off anywhere with him
until one day we both landed in California
<br/>
when I was still young, and going West
meant taking a laptop and some clothes
in a hatchback and learning about produce.
I can turn toward you, whoever you are,
<br/>
and say you are my lover simply because
I say you are, and that is, I realize,
a tautology, but this is my poem. I claim
nothing other than what I write, and even that,
<br/>
I'd leave by the wayside, since the only thing
to pack would be the candlesticks, and 
even those are burned through, thoroughly
replaceable. Who am I kidding? I don't
<br/>
own anything worth packing into anything.
We are cardboard boxes, you and I, stacked
nowhere near each other and humming
different tunes. It is too late to be writing this.
<br/>
I am writing this to tell you something less
than neutral, which is to say I'm sorry.
It was never you. It was always you:
your unutterable name, this growl in my throat.
<br/>

这篇关于美丽汤具有额外的&lt;/body&gt;在实际结束之前的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆