在Python中的同一目录中解析HTML文件 [英] The Parsing of HTML files at the same directory in the Python
本文介绍了在Python中的同一目录中解析HTML文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我设计了解析HTML文件的代码:
I have designed the code parsing HTML files:
from bs4 import BeautifulSoup
import re
import os
from os.path import join
for (dirname, dirs, files) in os.walk('.'):
for filename in files:
if filename.endswith('.html'):
thefile = os.path.join(dirname, filename)
with open(thefile, 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, 'lxml')
Initialtext = soup.get_text()
MediumText = Initialtext.lower().split()
clean_tokens = [t for t in text2
if re.match(r'[^\W\d]*$', t)]
removementWords = ['here', 'than']
FinalResult = set()
for somewords in range(len(tokensToCheck)):
if tokensToCheck[somewords] not in removementWords:
FinalResult.add(tokensToCheck[somewords])
`在这些情况下,我一直很挣扎:
` I have struggled in these case:
1)它将代码保存在不同的列表中,而我需要一个包含来自各种文件的所有结果的列表;
1) It saves the code in different lists, while I need one list with all results from various files;
2)结果,我无法从其他文件中删除双打
2) As a result, I cannot delete the doubles from different files
我该如何处理?
推荐答案
我想我发现你错了.这是我更改的代码.
I think I found where you were wrong. Here's the code I changed a little bit.
from bs4 import BeautifulSoup
import re
import os
from os.path import join
# definition position should be here so that it can collect all results into one.
FinalResult = set()
for (dirname, dirs, files) in os.walk('.'):
for filename in files:
if filename.endswith('.html'):
thefile = os.path.join(dirname, filename)
with open(thefile, 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, 'lxml')
Initialtext = soup.get_text()
MediumText = Initialtext.lower().split()
clean_tokens = [t for t in text2
if re.match(r'[^\W\d]*$', t)]
removementWords = ['here', 'than']
# FinalResult = set() - definition position is wrong
for somewords in range(len(tokensToCheck)):
if tokensToCheck[somewords] not in removementWords:
FinalResult.add(tokensToCheck[somewords])
这篇关于在Python中的同一目录中解析HTML文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文