多个.html文件上的BeautifulSoup [英] BeautifulSoup on multiple .html files

查看:77
本文介绍了多个.html文件上的BeautifulSoup的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用此处建议的模型通过BeautifulSoup在固定标签之间提取信息在此处输入链接描述

I'm trying to extract information between fixed tags with BeautifulSoup by using the model suggested here enter link description here

我的文件夹中有很多.html文件,并且我想将使用BeautifulSoup脚本获得的结果保存为单独的.txt文件形式的另一个文件夹.这些.txt文件应具有与原始文件相同的名称,但仅包含提取的内容.我编写的脚本(请参见下文)可以成功处理文件,但不会将提取的位写出到各个文件中.

I have a lot of .html files in my folder and I want to save results obtained with a BeautifulSoup script into another folder in the form of individual .txt files. These .txt files should have the same name as original files but would contain only extracted content. The script I wrote (see below) processes files successfully but does not write extracted bits out to individual files.

import os
import glob
from bs4 import BeautifulSoup

dir_path = "C:My_folder\\tmp\\"

for file_name in glob.glob(os.path.join(dir_path, "*.html")):
    my_data = (file_name)
    soup = BeautifulSoup(open(my_data, "r").read())
    for i in soup.select('font[color="#FF0000"]'):
        print(i.text)
        file_path = os.path.join(dir_path, file_name)
        text = open(file_path, mode='r').read()
        results = i.text
        results_dir = "C:\\My_folder\\tmp\\working"
        results_file = file_name[:-4] + 'txt'
        file_path = os.path.join(results_dir, results_file)
        open(file_path, mode='w', encoding='UTF-8').write(results)

推荐答案

Glob返回完整路径.您将为找到的每个font元素重新打开文件,替换文件的内容.将文件的开头移到循环的 之外;您应该真正使用文件作为上下文管理器(使用with语句)以确保它们也再次正确关闭:

Glob returns full paths. You are re-opening the file for each font element you find, replacing the contents of the file. Move opening of the file outside the loop; you should really use files as context managers (with the with statement) to ensure they are closed properly again too:

import glob
import os.path
from bs4 import BeautifulSoup

dir_path = r"C:\My_folder\tmp"
results_dir = r"C:\My_folder\tmp\working"

for file_name in glob.glob(os.path.join(dir_path, "*.html")):
    with open(file_name) as html_file:
        soup = BeautifulSoup(html_file)

    results_file = os.path.splitext(file_name)[0] + '.txt'
    with open(os.path.join(results_dir, results_file), 'w') as outfile:        
        for i in soup.select('font[color="#FF0000"]'):
            print(i.text)
            outfile.write(i.text + '\n')

这篇关于多个.html文件上的BeautifulSoup的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆