BeautifulSoup后&LT如何提取文本; BR>标签 [英] BeautifulSoup how to extract text after <br> tag

查看:2316
本文介绍了BeautifulSoup后&LT如何提取文本; BR>标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不知道如何到达使用BeautifulSoup以及如何提取我想要的特定的文本下面的段落。由于我是新来的Python和BS4。

I don't know how to reach the following paragraph using BeautifulSoup and how to extract the particular text that I want. As I am new to Python and BS4.

我的HTML是以下几点:

My HTML is following:

<div class="inner-content">
  <div class="bred"></div>
  <div class="clrbth"></div>
  <h1></h1>
  <h4></h4>
  ...
  ...
  ...
  <p></p>
  <p></p>
  <p>

<!--This text I don't want -->

    Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.
    <br></br>


<!-- The text I want to extract using BeautifulSoup-->

    It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like).

  </p>
  <p></p>
  <p></p>
  ...
  ...
  ...
  <div class="bred"></div>
  <div class="clrbth"></div>
  <h1></h1>
 </div>

请告诉我如何提取我的HTML上述文字。谢谢你。

Please tell me how to extract the above mentioned text from my HTML. Thanks.

推荐答案

您可以使用的 find_all() 方法和的 限制 参数,以获得第三 HTML中的p 标记。接下来使用 .find 这在第三段返回第一个 BR 标记。从那里,你可以使用 < STRONG> .next_siblings 方法,该方法返回的 生成器对象和的 。加入 功能。

You can use the find_all() method and the limit argument to get the third p tag in your html. Next use the .find which return the first br tag in the third paragraph. From there you can use the .next_siblings method which return a generator object and the .join function.

>>> third_p = soup.find_all('p', limit=3)[-1]
>>> ''.join(third_p.find('br').next_siblings)

这篇关于BeautifulSoup后&LT如何提取文本; BR&GT;标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆