使用BeautifulSoup/Python从html文件中提取文本 [英] Extract text from html file with BeautifulSoup/Python

查看:103
本文介绍了使用BeautifulSoup/Python从html文件中提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从html文件中提取文本. html 文件如下所示:

I am trying to extract the text from a html file. The html file looks like this:

<li class="toclevel-1 tocsection-1">
    <a href="#Baden-Württemberg"><span class="tocnumber">1</span>
        <span class="toctext">Baden-Württemberg</span>
    </a>
</li>
<li class="toclevel-1 tocsection-2">
    <a href="#Bayern">
        <span class="tocnumber">2</span>
        <span class="toctext">Bayern</span>
    </a>
</li>
<li class="toclevel-1 tocsection-3">
    <a href="#Berlin">
        <span class="tocnumber">3</span>
        <span class="toctext">Berlin</span>
    </a>
</li>

我想从最后一个 span 标记中提取最后一个文本.在第一行中,它是 class ="toctext" 之后的Baden-Würtemberg",然后将其放入python列表中.

I want to extract the last text from the last spantag. In the first line it would be "Baden-Würtemberg" after class="toctext"and then put it to a python list.

在Python中,我尝试了以下操作:

in Python I tried the following:

names = soup.find_all("span",{"class":"toctext"})

我的输出是这个列表:

[<span class="toctext">Baden-Württemberg</span>, <span class="toctext">Bayern</span>, <span class="toctext">Berlin</span>]

那我怎么只提取标签之间的文本呢?

So how can I extract only the text between the tags?

感谢所有人

推荐答案

find_all 方法返回一个列表.遍历列表以获取文本.

The find_all method returns a list. Iterate over the list to get the text.

for name in names:
    print(name.text)

返回:

Baden-Württemberg
Bayern
Berlin

内置的python dir() type()方法总是很方便地检查对象.

The builtin python dir() and type() methods are always handy to inspect an object.

print(dir(names))

[...,
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'append',
 'clear',
 'copy',
 'count',
 'extend',
 'index',
 'insert',
 'pop',
 'remove',
 'reverse',
 'sort',
 'source']

这篇关于使用BeautifulSoup/Python从html文件中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆