在Python中的同一目录中解析HTML文件 [英] The Parsing of HTML files at the same directory in the Python

查看：61 发布时间：2021/4/15 19:18:22 python html python-3.x parsing beautifulsoup

本文介绍了在Python中的同一目录中解析HTML文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我设计了解析HTML文件的代码:

I have designed the code parsing HTML files:

from bs4 import BeautifulSoup
import re
import os
from os.path import join

for (dirname, dirs, files) in os.walk('.'):
    for filename in files:
        if filename.endswith('.html'):
            thefile = os.path.join(dirname, filename)
            with open(thefile, 'r') as f:
                contents = f.read()
                soup = BeautifulSoup(contents, 'lxml')
                Initialtext = soup.get_text()
                MediumText = Initialtext.lower().split()

                clean_tokens = [t for t in text2
                                if re.match(r'[^\W\d]*$', t)]

                removementWords = ['here', 'than']

                FinalResult = set()
                for somewords in range(len(tokensToCheck)):
                    if tokensToCheck[somewords] not in removementWords:
                        FinalResult.add(tokensToCheck[somewords])

`在这些情况下，我一直很挣扎:

` I have struggled in these case:

1)它将代码保存在不同的列表中，而我需要一个包含来自各种文件的所有结果的列表；

1) It saves the code in different lists, while I need one list with all results from various files;

2)结果，我无法从其他文件中删除双打

2) As a result, I cannot delete the doubles from different files

我该如何处理?

推荐答案

我想我发现你错了.这是我更改的代码.

I think I found where you were wrong. Here's the code I changed a little bit.

from bs4 import BeautifulSoup
import re
import os
from os.path import join

# definition position should be here so that it can collect all results into one.
FinalResult = set() 

for (dirname, dirs, files) in os.walk('.'):
    for filename in files:
        if filename.endswith('.html'):
            thefile = os.path.join(dirname, filename)
            with open(thefile, 'r') as f:
                contents = f.read()
                soup = BeautifulSoup(contents, 'lxml')
                Initialtext = soup.get_text()
                MediumText = Initialtext.lower().split()

                clean_tokens = [t for t in text2
                                if re.match(r'[^\W\d]*$', t)]

                removementWords = ['here', 'than']

                # FinalResult = set() - definition position is wrong
                for somewords in range(len(tokensToCheck)):
                    if tokensToCheck[somewords] not in removementWords:
                        FinalResult.add(tokensToCheck[somewords])

这篇关于在Python中的同一目录中解析HTML文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在Python中的同一目录中解析HTML文件 [英] The Parsing of HTML files at the same directory in the Python

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

在Python中的同一目录中解析HTML文件 [英] The Parsing of HTML files at the same directory in the Python

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭