从标签之间提取文本的有效方法 [英] Efficient way to extract text from between tags
问题描述
假设我有这样的东西:
var = '<li> <a href="/...html">Energy</a>
<ul>
<li> <a href="/...html">Coal</a> </li>
<li> <a href="/...html">Oil </a> </li>
<li> <a href="/...html">Carbon</a> </li>
<li> <a href="/...html">Oxygen</a> </li'
在标记之间提取文本的最佳(最有效)方法是什么?我应该为此使用正则表达式吗?我当前的技术依赖于在 li
标记上拆分字符串并使用 for
循环,只是想知道是否有更快的方法来实现此目的.
What is the best (most efficient) way to extract the text in between the tags? Should I use regex for this? My current technique relies on splitting the string on li
tags and using a for
loop, just wondering if there was a faster way to do this.
推荐答案
您可以使用美丽的汤一个>对于这种任务非常好.它非常简单,易于安装且带有大量文档.
You can use Beautiful Soup that is very good for this kind of task. It is very straightforward, easy to install and with a large documentation.
您的示例中的某些li标签未关闭.我已经进行了更正,这就是如何获取所有li标签的方法
Your example has some li tags not closed. I already made the corrections and this is how would be to get all the li tags
from bs4 import BeautifulSoup
var = '''<li> <a href="/...html">Energy</a></li>
<ul>
<li><a href="/...html">Coal</a></li>
<li><a href="/...html">Oil </a></li>
<li><a href="/...html">Carbon</a></li>
<li><a href="/...html">Oxygen</a></li>'''
soup = BeautifulSoup(var)
for a in soup.find_all('a'):
print a.string
它将打印:
能源
可可
石油
碳
氧气
Energy
Coa
Oil
Carbon
Oxygen
有关文档和更多示例,请参见BeautifulSoup doc
For documentation and more examples see the BeautifulSoup doc
这篇关于从标签之间提取文本的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!