打开目录中的每个文件/子文件夹并将结果打印到 .txt 文件 [英] Open every file/subfolder in directory and print results to .txt file
问题描述
目前我正在使用此代码:
At the moment I am working with this code:
from bs4 import BeautifulSoup
import glob
import os
import re
import contextlib
@contextlib.contextmanager
def stdout2file(fname):
import sys
f = open(fname, 'w')
sys.stdout = f
yield
sys.stdout = sys.__stdout__
f.close()
def trade_spider():
os.chdir(r"C:\Users\6930p\FLO'S DATEIEN\Master FAU\Sommersemester 2016\02_Masterarbeit\04_Testumgebung\01_Probedateien für Analyseaspekt\Independent Auditors Report")
with stdout2file("output.txt"):
for file in glob.iglob('**/*.html', recursive=True):
with open(file, encoding="utf8") as f:
contents = f.read()
soup = BeautifulSoup(contents, "html.parser")
for item in soup.findAll("ix:nonfraction"):
if re.match(".*AuditFeesExpenses", item['name']):
print(file.split(os.path.sep)[-1], end="| ")
print(item['name'], end="| ")
print(item.get_text())
trade_spider()
到目前为止,这很完美.但现在我遇到了另一个问题.如果我在一个没有子文件夹但只有文件的文件夹中搜索,这可以正常工作.但是,如果我尝试在具有子文件夹的文件夹上运行此代码,则它不起作用(它不打印任何内容!).此外,我想让我的结果打印到一个 .txt 文件中,而不包含整个路径.结果应该是这样的:
So far this works perfectly. But now I am stucked with another issue. If I search within a folder which has no subfolders but only files this works without problems. However if i try to run this code on a folder that has subfolders it doesn't work (it prints nothing!). Furthermore I would like to get my results print into a .txt file without having the whole path in it. The result should be like:
Filename.html| RegEX Match| HTML text
我已经得到了这个结果,但只在 PyCharm 中而不是在单独的 .txt 文件中.
I do get this result already, but only in PyCharm and not in a seperate .txt file.
总而言之,我有两个问题:
To sum up, I do have 2 questions:
- 我怎样才能浏览我定义的目录中的子文件夹?-> os.walk() 会是一个选项吗?
- 如何将结果打印到 .txt 文件中?-> sys.stdout 可以解决这个问题吗?
对这个问题的任何帮助表示赞赏!
Any help appreciated on this issue!
更新:它只将第一个文件的第一个结果打印到我的outout.txt"文件中(至少我认为它是第一个,因为它是我唯一子文件夹中的最后一个文件并且 recursive=true 被激活).知道为什么它不遍历所有其他文件吗?
UPDATE: It only prints the first results of the first file into my "outout.txt" file (at least I think it is the first as it is the last file in my only subfolder and recursive=true is activated). Any idea why it is not looping through all the other files?
UPDATE_2:问题已解决!最终代码可以在上面看到!
UPDATE_2: Question resolved! Final Code can be seen above!
推荐答案
对于子目录的遍历,有两种选择:
For walking in subdirectories, there are two options:
使用
**
与 glob 和参数recursive=True
(glob.glob('**/*.html')代码>).这仅适用于 Python 3.5+.如果目录树很大,我还建议使用
glob.iglob
而不是glob.glob
.
Use
**
with glob and the argumentrecursive=True
(glob.glob('**/*.html')
). This only works in Python 3.5+. I would also recommend usingglob.iglob
instead ofglob.glob
if the directory tree is large.
使用 os.walk
并手动或使用 fnmatch.filter
检查文件名(是否以 ".html"
结尾)>.
Use os.walk
and check the filenames (whether they end in ".html"
) manually or with fnmatch.filter
.
<小时>
关于打印成文件,还有几种方式:
Regarding the printing into a file, there are again several ways:
只需执行脚本并重定向标准输出,即
python3 myscript.py >myfile.txt
将print
的调用替换为写入模式下文件对象的.write()
方法`.
Replace calls to print
with a call to the .write()
method of a file object in write mode`.
继续使用打印,但给它参数 file=myfile
其中 myfile
再次是一个可写的文件对象.
Keep using print, but give it the argument file=myfile
where myfile
is again a writable file object.
也许最不引人注目的方法如下.首先,将其包含在某处:
edit: Maybe the most unobstrusive method would be the following. First, include this somewhere:
import contextlib
@contextlib.contextmanager
def stdout2file(fname):
import sys
f = open(fname, 'w')
sys.stdout = f
yield
sys.stdout = sys.__stdout__
f.close()
然后,在循环文件的那一行之前,添加这一行(并适当缩进):
And then, infront of the line in which you loop over the files, add this line (and appropriately indent):
with stdout2file("output.txt"):
这篇关于打开目录中的每个文件/子文件夹并将结果打印到 .txt 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!