Python 从大文本文件中读取完整行的块(列值拆分为多行) [英] Python Read chunks of complete rows from large text file (column values split across multiple rows)
问题描述
我想分块读取一个大的 .txt 文件 (c.2.5GB),然后在加载到数据库之前执行一些操作.
I want to read a large .txt file (c.2.5GB) in chunks and then perform some operations before loading into the database.
该文件只有 2 列(列分隔符为 ¬
),并用 双引号
限定.第二列中的值可以跨越多行(示例如下).我想过使用这个 answer 但是问题是它可能会处理不完整的行,因为它取决于预设的 chunk size
.有人可以帮忙吗?我在下面提供了示例数据和代码.
The file only has 2 columns (column delimiter is ¬
) and is qualified with double quotes
. The values in the second column could span across multiple lines (sample below). I thought of using this answer but the issue would be that it might process incomplete lines as it depends on preset chunk size
. Can someone please help? I've included the sample data and code below.
示例数据 (Sample_load_file.txt)
"LINE_ID"¬"LINE_TEXT"
"C1111-G00-BC222"¬"this line is
split into
multiple lines
% All needs to be read into 1 line
% Currently that's not happening
"
"C22-f0-333"¬"2nd row. This line is
split into
multiple lines
% All needs to be read into 1 line
% Currently that's not happening
*******************************************************************
This line also includes the column delimiter within text qualifier
*******************************************************************
# !¬!¬!¬|
"
代码
import pandas as pd
import os
from dbconnection import DBConnection
path = r'C:\Sample_load_file.txt'
db = DBConnection(server ='XXXX', database='XXXX')
def read_in_chunks(file_object, chunk_size=1024):
#Lazy load to read a file piece by piece (avoiding moemory issues)
#Default chunk size: 1k.
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
def process_chunk(data=piece):
#Build a list of lines based on ' "\n" ' as custom separator
data = data.split('"\n"')
#Split each line based on ' "¬" ' as custom separator
data = [line.split('"¬"') for line in data]
#Cleanup remaining double quotes
data = [[e.replace('"', '') for e in line] for line in data]
#Check the number of columns
number_of_cols = len(str(data[0]).split('¬'))
number_of_cols
#Load data into a dataframe
df = pd.DataFrame(data)
#Reformat dataframe
df.columns = df.iloc[0] # Set first row as column index
df = df.iloc[1:].reset_index(drop=True) # Drop first line and reset index
#Split the first column into two
try:
df[['LINE_ID', 'LINE_TEXT']] = df['LINE_ID¬LINE_TEXT'].str.split('¬',expand=True)
except:
print('Error')
del df['LINE_ID¬LINE_TEXT']
#Add metadata
df['loaded_by'] = 'XXXX'
df['file_line_number'] = range(2,len(df)+2)
df['load_date'] = pd.datetime.now()
df['source_file'] = path
df['loading_script'] = r'Load_Extracts.ipynb'
#Load in SQL db
df.to_sql('SQL_table_name', db.engine, schema='dbo', index=False, if_exists='append')
#Load text file
with open(path) as f:
for piece in read_in_chunks(f):
process_data(piece)
推荐答案
如果 LINE_ID
适合一行,您可以尝试使用生成器,该生成器利用多行记录的第一行包含 "¬"
:
If LINE_ID
fits in one line you could try using a generator that leverages that the first line of a multiline record contains "¬"
:
def make_records(file):
current = []
for line in file:
line = line.rstrip()
if '"¬"' in line:
if current:
yield " ".join(current)
current = [line]
else:
current.append(line)
yield " ".join(current)
使用示例输入:
>>> import io
>>>
>>> s = '''"LINE_ID"¬"LINE_TEXT"
... "C1111-G00-BC222"¬"this line is
... split into
... multiple lines
... % All needs to be read into 1 line
... % Currently that's not happening
... "
... "C22-f0-333"¬"2nd row. This line is
... split into
... multiple lines
... % All needs to be read into 1 line
... % Currently that's not happening
... *******************************************************************
... This line also includes the column delimiter within text qualifier
... *******************************************************************
... # !¬!¬!¬|
... "'''
>>> f = io.StringIO(s)
>>> for record in make_records(f):
... print(record)
...
"LINE_ID"¬"LINE_TEXT"
"C1111-G00-BC222"¬"this line is split into multiple lines % All needs to be read into 1 line % Currently that's not happening "
"C22-f0-333"¬"2nd row. This line is split into multiple lines % All needs to be read into 1 line % Currently that's not happening ******************************************************************* This line also includes the column delimiter within text qualifier ******************************************************************* # !¬!¬!¬| "
注意:您可能想要更改生成器 yield
的内容,例如 list
或 tuple
而不是 str代码>,删除双引号,跳过第一行,根据您的需要.我使用
io.StringIO
仅用于说明目的,您将从正常"读取文件.
Notes: You may want to change what the generator yield
s, e.g., list
or tuple
instead of str
, remove double quotes, skip the first row, based on your needs. I used io.StringIO
for illustration purposes only, you will read from a "normal" file.
这篇关于Python 从大文本文件中读取完整行的块(列值拆分为多行)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!