去掉HTML标签以获取python中的字符串 [英] Strip HTML tags to get strings in python

查看:368
本文介绍了去掉HTML标签以获取python中的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试从BeautifulSoup的HTML文件中获取一些字符串,并且每次使用它我都会得到部分结果。



我想要在每个li中获取字符串元/标签。到目前为止,我已经能够以ul的方式获得所有内容。

 #!/ usr / bin / python 
从bs4导入BeautifulSoup
page = open(page.html)
soup = BeautifulSoup(page)
source = soup.select(。sidebar li)

我得到的是:

  [< li class =first> 
Def Leppard - 让爱变成一个男人< span> Live< / span> < / li>,< li>
Inxs - Never Tear Us Apart< / li> ;,< li>
加里摩尔 - 在山上远处< / li>,< li>
Linkin Park - Numb< / li>,< li>
Vita De Vie - Basul Si Cu Toba Mare< / li> ;,< li>
Nazareth - Love Hurts< / li> ;,< li>
U2 - 我还没找到我L< / li>,,< li>
Blink 182 - 所有小物件< / li>,< li>
Scorpions - 变化之风< / li>,< li>
Iggy Pop - The Passenger< / li>

我只想得到
使用漂亮的汤 - .strings方法。 pre> for soup.stripped_strings中的字符串:
print(repr(string))

来自文档:


如果标签内有多个内容,您仍然可以看到
只是字符串。使用.strings生成器:


这些字符串通常会有很多额外的空格,您可以使用.stripped_strings生成器来删除


I tried to get some strings from an HTML file with BeautifulSoup and everytime I work with it I get partial results.

I want to get the strings in every li element/tag. So far I've been able to get everything in ul like this.

#!/usr/bin/python
from bs4 import BeautifulSoup
page = open("page.html")
soup = BeautifulSoup(page)
source = soup.select(".sidebar li")

And what I get is this:

[<li class="first">
        Def Leppard -  Make Love Like A Man<span>Live</span> </li>, <li>
        Inxs - Never Tear Us Apart        </li>, <li>
        Gary Moore - Over The Hills And Far Away        </li>, <li>
        Linkin Park -  Numb        </li>, <li>
        Vita De Vie -  Basul Si Cu Toba Mare        </li>, <li>
        Nazareth - Love Hurts        </li>, <li>
        U2 - I Still Haven't Found What I'm L        </li>, <li>
        Blink 182 -  All The Small Things        </li>, <li>
        Scorpions -  Wind Of Change        </li>, <li>
        Iggy Pop - The Passenger        </li>]

I want to get only the strings from this.

解决方案

Use beautiful soups - .strings method.

for string in soup.stripped_strings:
print(repr(string))

from the docs:

If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator:

or

These strings tend to have a lot of extra whitespace, which you can remove by using the .stripped_strings generator instead:

这篇关于去掉HTML标签以获取python中的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆