Python BeautifulSoup段落纯文字 [英] Python BeautifulSoup Paragraph Text only

查看:34
本文介绍了Python BeautifulSoup段落纯文字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于任何与网络抓取相关的事物,我都是新手,据我所知,Requests和BeautifulSoup是实现这一目标的方法.我想编写一个程序,每隔几个小时仅通过电子邮件将给定链接的一个段落发送给我(尝试一种全天阅读博客的新方法)说这个特定的链接" https://fs.blog/mental-models/'有一个段落每个都在不同的模型上.

I am very new to anything webscraping related and as I understand Requests and BeautifulSoup are the way to go in that. I want to write a program which emails me only one paragraph of a given link every couple of hours (trying a new way to read blogs through the day) Say this particular link 'https://fs.blog/mental-models/' has a a paragraph each on different models.

from bs4 import BeautifulSoup
import re
import requests


url = 'https://fs.blog/mental-models/'

r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')

现在,在段落文本开始之前,汤有几层墙:< p>这就是我要阅读的</p>

now soup has a wall of bits before the paragraph text begins: <p> this is what I want to read </p>

soup.title.string 可以很好地工作,但是我不知道如何从这里前进..任何方向?

soup.title.string working perfectly fine, but I don't know how to move ahead from here pls.. any directions?

谢谢

推荐答案

soup.findAll('p')上查找所有 p 标记,然后使用 .text 来获取其文本:

Loop over the soup.findAll('p') to find all the p tags and then use .text to get their text:

此外,由于不需要页脚段落,因此请在类 rte 下的 div 下进行所有操作.

Furthermore, do all that under a div with the class rte since you don't want the footer paragraphs.

from bs4 import BeautifulSoup
import requests

url = 'https://fs.blog/mental-models/'    
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')

divTag = soup.find_all("div", {"class": "rte"})    
for tag in divTag:
    pTags = tag.find_all('p')
    for tag in pTags[:-2]:  # to trim the last two irrelevant looking lines
        print(tag.text)

输出:

Mental models are how we understand the world. Not only do they shape what we think and how we understand but they shape the connections and opportunities that we see.
.
.
.
5. Mutually Assured Destruction
Somewhat paradoxically, the stronger two opponents become, the less likely they may be to destroy one another. This process of mutually assured destruction occurs not just in warfare, as with the development of global nuclear warheads, but also in business, as with the avoidance of destructive price wars between competitors. However, in a fat-tailed world, it is also possible that mutually assured destruction scenarios simply make destruction more severe in the event of a mistake (pushing destruction into the "tails" of the distribution).

这篇关于Python BeautifulSoup段落纯文字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆