将多个.txt文件转换为单个.csv文件(python) [英] Convert multiple .txt files into single .csv file (python)
问题描述
我需要将包含约4,000个.txt文件的文件夹转换为具有两列的单个.csv:(1)第1栏:文件名"(在原始文件夹中指定);(2)第2栏:内容"(其中应包含相应.txt文件中的所有文本).
I need to convert a folder with around 4,000 .txt files into a single .csv with two columns: (1) Column 1: 'File Name' (as specified in the original folder); (2) Column 2: 'Content' (which should contain all text present in the corresponding .txt file).
在这里,您可以看到我正在使用的一些文件.
Here you can see some of the files I am working with.
The most similar question to mine here is this one (Combine a folder of text files into a CSV with each content in a cell) but I could not implement any of the solutions presented there.
我最后尝试的是Nathaniel Verhaaren在上述问题中提出的Python代码,但我得到的错误与问题的作者完全相同(即使实施了一些建议):
The last one I tried was the Python code proposed in the aforementioned question by Nathaniel Verhaaren but I got the exact same error as the question's author (even after implementing some suggestions):
import os
import csv
dirpath = 'path_of_directory'
output = 'output_file.csv'
with open(output, 'w') as outfile:
csvout = csv.writer(outfile)
csvout.writerow(['FileName', 'Content'])
files = os.listdir(dirpath)
for filename in files:
with open(dirpath + '/' + filename) as afile:
csvout.writerow([filename, afile.read()])
afile.close()
outfile.close()
其他与我的问题类似的问题(例如,将多个.txt文件合并到一个csv 中,然后
Other questions which seemed similar to mine (for example, Python: Parsing Multiple .txt Files into a Single .csv File?, Merging multiple .txt files into a csv, and Converting 1000 text files into a single csv file) do not solve this exact problem I presented (and I could not adapt the solutions presented to my case).
推荐答案
我也有类似的要求,所以我写了以下课程
I had a similar requirement and so I wrote the following class
import os
import pathlib
import glob
import csv
from collections import defaultdict
class FileCsvExport:
"""Generate a CSV file containing the name and contents of all files found"""
def __init__(self, directory: str, output: str, header = None, file_mask = None, walk_sub_dirs = True, remove_file_extension = True):
self.directory = directory
self.output = output
self.header = header
self.pattern = '**/*' if walk_sub_dirs else '*'
if isinstance(file_mask, str):
self.pattern = self.pattern + file_mask
self.remove_file_extension = remove_file_extension
self.rows = 0
def export(self) -> bool:
"""Return True if the CSV was created"""
return self.__make(self.__generate_dict())
def __generate_dict(self) -> defaultdict:
"""Finds all files recursively based on the specified parameters and returns a defaultdict"""
csv_data = defaultdict(list)
for file_path in glob.glob(os.path.join(self.directory, self.pattern), recursive = True):
path = pathlib.Path(file_path)
if not path.is_file():
continue
content = self.__get_content(path)
name = path.stem if self.remove_file_extension else path.name
csv_data[name].append(content)
return csv_data
@staticmethod
def __get_content(file_path: str) -> str:
with open(file_path) as file_object:
return file_object.read()
def __make(self, csv_data: defaultdict) -> bool:
"""
Takes a defaultdict of {k, [v]} where k is the file name and v is a list of file contents.
Writes out these values to a CSV and returns True when complete.
"""
with open(self.output, 'w', newline = '') as csv_file:
writer = csv.writer(csv_file, quoting = csv.QUOTE_ALL)
if isinstance(self.header, list):
writer.writerow(self.header)
for key, values in csv_data.items():
for duplicate in values:
writer.writerow([key, duplicate])
self.rows = self.rows + 1
return True
可以像这样使用
...
myFiles = r'path/to/files/'
outputFile = r'path/to/output.csv'
exporter = FileCsvExport(directory = myFiles, output = outputFile, header = ['File Name', 'Content'], file_mask = '.txt')
if exporter.export():
print(f"Export complete. Total rows: {exporter.rows}.")
在我的示例目录中,这将返回
In my example directory, this returns
导出完成.总行数:6.
Export complete. Total rows: 6.
注意:行
不计算标题(如果存在)
Note: rows
does not count the header if present
这生成了以下CSV文件:
This generated the following CSV file:
"File Name","Content"
"Test1","This is from Test1"
"Test2","This is from Test2"
"Test3","This is from Test3"
"Test4","This is from Test4"
"Test5","This is from Test5"
"Test5","This is in a sub-directory"
可选参数:
-
header
:获取将被写入CSV的第一行的字符串列表.默认None
. -
file_mask
:取一个可用于指定文件类型的字符串.例如,.txt
将导致它仅匹配.txt
文件.默认None
. -
walk_sub_dirs
:如果设置为False,它将不在子目录中搜索.默认True
. -
remove_file_extension
:如果设置为False,将导致文件名写入包含扩展名的文件名;例如,File.txt
,而不只是File
.默认True
.
header
: Takes a list of strings that will be written as the first line in the CSV. DefaultNone
.file_mask
: Takes a string that can be used to specify the file type; for example,.txt
will cause it to only match.txt
files. DefaultNone
.walk_sub_dirs
: If set to False, it will not search in sub-directories. DefaultTrue
.remove_file_extension
: If set to False, it will cause the file name to be written with the file extension included; for example,File.txt
instead of justFile
. DefaultTrue
.
这篇关于将多个.txt文件转换为单个.csv文件(python)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!