提取网页上的突出显示文本 [英] Extract highlight text on webpage

查看:30
本文介绍了提取网页上的突出显示文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道是否可以从网页上的段落中提取突出显示的文本.

I would like to know if there is anyway to Extract highlighted text from a paragraph on a webpage.

经过长时间的搜索.我遇到了这个模块 https://python-docx.readthedocs.io/en/latest/但它用于文档.

After a long search.I have come across this module https://python-docx.readthedocs.io/en/latest/ but its for documents.

例如,假设我们有以下段落:

For example lets say we have the following paragraph:

Stack Overflow 是一个 私有网站,是 Stack Exchange 网络的旗舰网站,由 Jeff Atwood 和 Joel Spolsky 于 2008 年创建.它的创建是为了更开放地替代之前的问题和回答网站,例如 Experts-Exchange.该网站的名称是由 Atwood 流行的编程博客 Coding Horror 的读者于 2008 年 4 月投票选出的.它提供了关于广泛的计算机编程主题的问题和答案"

"Stack Overflow is a privately held website, the flagship site of the Stack Exchange Network,created in 2008 by Jeff Atwood and Joel Spolsky. It was created to be a more open alternative to earlier question and answer sites such as Experts-Exchange. The name for the website was chosen by voting in April 2008 by readers of Coding Horror, Atwood's popular programming blog. It features questions and answers on a wide range of topics in computer programming"

现在在上面的段落中,假设粗体字串是我突出显示的字串,我想提取并输出突出显示的字串.有没有办法在网页上做到这一点.

Now in the above paragraph lets say the bold string of words are the ones I highlighted and I want to extract plus output the highlighted ones. Is there a way I can do this on a webpage.

所以输出应该是:私人持有的网站;专家交流;广泛的主题.

So the output should be: privately held website ; Experts-Exchange ; wide range of topics.

推荐答案

你可以用 bs4 简单地做到这一点.首先确保你已经安装了 bs4 和 requests,如果你想安装它们,只需运行这两个命令

you can simply do this with bs4. first make sure you have installed bs4 and requests and if you want to install them just run these two commands

pip install requests
pip install bs4

那你就得写一个这样的python脚本

then you have to write a python script like this

from bs4 import BeautifulSoup
import requests

page_url = 'http://127.0.0.1:1234'
source_code = requests.get(page_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, features="lxml")
for bold in soup.findAll('b'):
    print(bold.contents)

这篇关于提取网页上的突出显示文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆