去掉HTML标签以获取python中的字符串 [英] Strip HTML tags to get strings in python

查看：368 发布时间：2018/6/19 14:27:08 python html html-parsing beautifulsoup strip

本文介绍了去掉HTML标签以获取python中的字符串的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我尝试从BeautifulSoup的HTML文件中获取一些字符串，并且每次使用它我都会得到部分结果。

我想要在每个li中获取字符串元/标签。到目前为止，我已经能够以ul的方式获得所有内容。

 ＃！/ usr / bin / python 
从bs4导入BeautifulSoup 
 page = open（page.html）
 soup = BeautifulSoup（page）
 source = soup.select（。sidebar li）

我得到的是：

  [< li class =first> 
 Def Leppard  - 让爱变成一个男人< span> Live< / span> < / li>，< li> 
 Inxs  -  Never Tear Us Apart< / li> ;,< li> 
加里摩尔 - 在山上远处< / li>，< li> 
 Linkin Park  -  Numb< / li>，< li> 
 Vita De Vie  -  Basul Si Cu Toba Mare< / li> ;,< li> 
 Nazareth  -  Love Hurts< / li> ;,< li> 
 U2  - 我还没找到我L< / li>，，< li> 
 Blink 182  - 所有小物件< / li>，< li> 
 Scorpions  - 变化之风< / li>，< li> 
 Iggy Pop  -  The Passenger< / li>

我只想得到
使用漂亮的汤 - .strings方法。 pre> for soup.stripped_strings中的字符串： print（repr（string））

来自文档：

如果标签内有多个内容，您仍然可以看到
只是字符串。使用.strings生成器：

或

这些字符串通常会有很多额外的空格，您可以使用.stripped_strings生成器来删除
：

I tried to get some strings from an HTML file with BeautifulSoup and everytime I work with it I get partial results.

I want to get the strings in every li element/tag. So far I've been able to get everything in ul like this.
#!/usr/bin/python from bs4 import BeautifulSoup page = open("page.html") soup = BeautifulSoup(page) source = soup.select(".sidebar li")
And what I get is this:
[<li class="first"> Def Leppard - Make Love Like A Man<span>Live</span> </li>, <li> Inxs - Never Tear Us Apart </li>, <li> Gary Moore - Over The Hills And Far Away </li>, <li> Linkin Park - Numb </li>, <li> Vita De Vie - Basul Si Cu Toba Mare </li>, <li> Nazareth - Love Hurts </li>, <li> U2 - I Still Haven't Found What I'm L </li>, <li> Blink 182 - All The Small Things </li>, <li> Scorpions - Wind Of Change </li>, <li> Iggy Pop - The Passenger </li>]
I want to get only the strings from this.
解决方案
Use beautiful soups - .strings method.
for string in soup.stripped_strings: print(repr(string))
from the docs:

If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator:

or

These strings tend to have a lot of extra whitespace, which you can remove by using the .stripped_strings generator instead:

这篇关于去掉HTML标签以获取python中的字符串的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

去掉HTML标签以获取python中的字符串 [英] Strip HTML tags to get strings in python

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

去掉HTML标签以获取python中的字符串 [英] Strip HTML tags to get strings in python

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭