通过迭代从文本文件中提取html标签,并将其附加到列表中,并忽略python中的所有其他字符 [英] Extract html tags from a text file through iteration and append them to a list and ignore all other characters in python
问题描述
我希望能够读取html文件并仅从其中提取标签.
I want to be able to read a html file and extract only the tags out of it.
- 一次从文件中读取一个字符,忽略所有内容以获取<"(也忽略<")
-
一次读取一个字符,然后将它们附加到字符串中,直到>"或空白(也忽略>")
- Read one character at a time from the file, ignoring everything to get "<"(ignore "<" as well)
Read one character at a time, appending them to a string until ">" or white space(ignore ">" as well)
<html>
<body>
<h1>This is test</h1>
<h2> This is test 2<h2>
</body>
<html>
with open('doc.txt', 'r') as f:
all_lines = []
# loop through all lines using f.readlines() method
for line in f.readlines():
new_line = []
# this is how you would loop through each alphabet
for chars in line:
new_line.append(chars)
all_lines.append(new_line)
print(all_lines)
我可以遍历文本文件并获得如下列表:
I can iterate through the text files and can get the list as below:
[[''lt;','h','t','m','l','>','\ n'],['<','b','o', 'd','y','>','\ n'],['<','/','b','o','d','y','>','\ n'],['<','/','h','t','m','l','>']]
[['<', 'h', 't', 'm', 'l', '>', '\n'], ['<', 'b', 'o', 'd', 'y', '>', '\n'], ['<', '/', 'b', 'o', 'd', 'y', '>', '\n'], ['<', '/', 'h', 't', 'm', 'l', '>']]
,但预期输出应为:[html,body,h1,/h1,/h2,/body,/html]
but the expected output should be : [html,body,h1,/h1,/h2,/body,/html]
推荐答案
In [10]: re.findall('<(.*?)>', html)
Out[10]: ['html', 'body', 'h1', '/h1', 'h2', 'h2', '/body', '/html']
只需使用 regex 或HTMLParser.
Simply use regex or a HTMLParser.
这篇关于通过迭代从文本文件中提取html标签,并将其附加到列表中,并忽略python中的所有其他字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!