在目录中打开多个文件时,BeautifulSoup MemoryError [英] BeautifulSoup MemoryError When Opening Several Files in Directory

查看:88
本文介绍了在目录中打开多个文件时,BeautifulSoup MemoryError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

上下文:每周,我都会以html文件的形式收到一份实验结果列表.每周大约有3,000个结果,每组结果都有与之相关的两个和四个表.对于每个结果/试验,我只关心存储在这些表之一中的一些标准信息.该表可以唯一标识,因为第一单元格第一列始终带有文本"Lab Results".

Context: Every week, I receive a list of lab results in the form of an html file. Each week, there are about 3,000 results with each set of results having between two and four tables associated with them. For each result/trial, I only care about some standard information that is stored in one of these tables. That table can be uniquely identified because the first cell, first column always has the text "Lab Results".

问题:当我一次处理每个文件时,以下代码非常有用.也就是说,我没有在目录上进行for循环,而是将get_data = open()指向特定文件.但是,我想获取过去几年的数据,而不想单独处理每个文件.因此,我使用了glob模块和一个for循环来循环浏览目录中的所有文件.我遇到的问题是,当我到达目录中的第三个文件时,出现了MemoryError.

Problem: The following code works great when I do each file at a time. That is, instead of doing a for loop over the directory, I point get_data = open() to a specific file. However, I want to grab the data from the past few years and would rather not do each file individually. Therefore, I used the glob module and a for loop to cycle through all the files in the directory. The issue I am having is I get a MemoryError by the time I get to the third file in the directory.

问题:是否可以清除/重置每个文件之间的内存?这样,我可以循环浏览目录中的所有文件,而不必分别粘贴每个文件名.正如您在下面的代码中看到的那样,我尝试使用del清除变量,但这没有用.

The Question: Is there a way to clear/reset the memory between each file? That way, I could cycle through all the files in the directory and not paste in each file name individually. As you can see in the code below, I tried clearing the variables with del, but that did not work.

谢谢.

from bs4 import BeautifulSoup
import glob
import gc

for FileName in glob.glob("\\Research Results\\*"):

    get_data = open(FileName,'r').read()

    soup = BeautifulSoup(get_data)

    VerifyTable = "Clinical Results"

    tables = soup.findAll('table')

    for table in tables:
        First_Row_First_Column = table.findAll('tr')[0].findAll('td')[0].text
        if VerifyTable == First_Row_First_Column.strip():
            v1 = table.findAll('tr')[1].findAll('td')[0].text
            v2 = table.findAll('tr')[1].findAll('td')[1].text

            complete_row = v1.strip() + ";" + v2.strip()

            print (complete_row)

            with open("Results_File.txt","a") as out_file:
                out_string = ""
                out_string += complete_row
                out_string += "\n"
                out_file.write(out_string)
                out_file.close()

    del get_data
    del soup
    del tables
    gc.collect()

print ("done")

推荐答案

我是一个非常初级的程序员,我也遇到了同样的问题.我做了三件事似乎可以解决问题:

I´m a very beginner programmer and I faced the same problem. I did three things that seemed to solve the problem:

  1. 还在迭代开始时调用垃圾回收('gc.collect()')
  2. 在迭代上转换解析,因此所有全局变量将变为局部变量,并在函数末尾被删除.
  3. 使用soupe.decompose()

我认为第二个更改可能解决了它,但是我没有时间检查它,并且我不想更改工作代码.

I think the second change probably solved it, but I didn´t have time to check it and I don´t want to change a working code.

对于此代码,解决方案将如下所示:

For the this code, the solution would be something like this:

from bs4 import BeautifulSoup
import glob
import gc

def parser(file):
    gc.collect()

    get_data = open(file,'r').read()

    soup = BeautifulSoup(get_data)
    get_data.close()
    VerifyTable = "Clinical Results"

    tables = soup.findAll('table')

    for table in tables:
        First_Row_First_Column = table.findAll('tr')[0].findAll('td')[0].text
        if VerifyTable == First_Row_First_Column.strip():
            v1 = table.findAll('tr')[1].findAll('td')[0].text
            v2 = table.findAll('tr')[1].findAll('td')[1].text

            complete_row = v1.strip() + ";" + v2.strip()

            print (complete_row)

            with open("Results_File.txt","a") as out_file:
                out_string = ""
                out_string += complete_row
                out_string += "\n"
                out_file.write(out_string)
                out_file.close()

    soup.decompose()
    gc.collect()
    return None


for filename in glob.glob("\\Research Results\\*"):
    parser(filename)

print ("done")

这篇关于在目录中打开多个文件时,BeautifulSoup MemoryError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆