Python 从大文本文件中读取完整行的块(列值拆分为多行) [英] Python Read chunks of complete rows from large text file (column values split across multiple rows)

查看:21
本文介绍了Python 从大文本文件中读取完整行的块(列值拆分为多行)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想分块读取一个大的 .txt 文件 (c.2.5GB),然后在加载到数据库之前执行一些操作.

I want to read a large .txt file (c.2.5GB) in chunks and then perform some operations before loading into the database.

该文件只有 2 列(列分隔符为 ¬),并用 双引号 限定.第二列中的值可以跨越多行(示例如下).我想过使用这个 answer 但是问题是它可能会处理不完整的行,因为它取决于预设的 chunk size.有人可以帮忙吗?我在下面提供了示例数据和代码.

The file only has 2 columns (column delimiter is ¬) and is qualified with double quotes. The values in the second column could span across multiple lines (sample below). I thought of using this answer but the issue would be that it might process incomplete lines as it depends on preset chunk size. Can someone please help? I've included the sample data and code below.

示例数据 (Sample_load_file.txt)

"LINE_ID"¬"LINE_TEXT"
"C1111-G00-BC222"¬"this line is
split into
multiple lines
% All needs to be read into 1 line
% Currently that's not happening
"
"C22-f0-333"¬"2nd row. This line is
split into
multiple lines
% All needs to be read into 1 line
% Currently that's not happening
  *******************************************************************
  This line also includes the column delimiter within text qualifier
  *******************************************************************
  # !¬!¬!¬|
"

代码

import pandas as pd
import os
from dbconnection import DBConnection

path = r'C:\Sample_load_file.txt'
db = DBConnection(server ='XXXX', database='XXXX')

def read_in_chunks(file_object, chunk_size=1024):
    #Lazy load to read a file piece by piece (avoiding moemory issues)
    #Default chunk size: 1k.
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data
        
def process_chunk(data=piece):
    #Build a list of lines based on ' "\n" ' as custom separator
    data = data.split('"\n"')
    
    #Split each line based on ' "¬" ' as custom separator
    data = [line.split('"¬"') for line in data]
    
    #Cleanup remaining double quotes
    data = [[e.replace('"', '') for e in line] for line in data]
    
    #Check the number of columns
    number_of_cols = len(str(data[0]).split('¬'))
    number_of_cols
    
    #Load data into a dataframe
    df = pd.DataFrame(data)
    
    #Reformat dataframe
    df.columns = df.iloc[0] # Set first row as column index
    df = df.iloc[1:].reset_index(drop=True) # Drop first line and reset index
    
    #Split the first column into two
    try:
        df[['LINE_ID', 'LINE_TEXT']] = df['LINE_ID¬LINE_TEXT'].str.split('¬',expand=True)
    except:
        print('Error')
    del df['LINE_ID¬LINE_TEXT']
    
    #Add metadata
    df['loaded_by'] = 'XXXX'
    df['file_line_number'] = range(2,len(df)+2)
    df['load_date'] = pd.datetime.now()
    df['source_file'] = path
    df['loading_script'] = r'Load_Extracts.ipynb'    
    
    #Load in SQL db
    df.to_sql('SQL_table_name', db.engine, schema='dbo', index=False, if_exists='append')
    
#Load text file
with open(path) as f:
    for piece in read_in_chunks(f):
        process_data(piece)

推荐答案

如果 LINE_ID 适合一行,您可以尝试使用生成器,该生成器利用多行记录的第一行包含 "¬":

If LINE_ID fits in one line you could try using a generator that leverages that the first line of a multiline record contains "¬":

def make_records(file):
    current = []
    for line in file:
        line = line.rstrip()
        if '"¬"' in line:
            if current:
                yield " ".join(current)
            current = [line]
        else:
            current.append(line)
    yield " ".join(current)

使用示例输入:

>>> import io
>>> 
>>> s = '''"LINE_ID"¬"LINE_TEXT"
... "C1111-G00-BC222"¬"this line is
... split into
... multiple lines
... % All needs to be read into 1 line
... % Currently that's not happening
... "
... "C22-f0-333"¬"2nd row. This line is
... split into
... multiple lines
... % All needs to be read into 1 line
... % Currently that's not happening
...   *******************************************************************
...   This line also includes the column delimiter within text qualifier
...   *******************************************************************
...   # !¬!¬!¬|
... "'''
>>> f = io.StringIO(s)
>>> for record in make_records(f):
...    print(record)
... 
"LINE_ID"¬"LINE_TEXT"
"C1111-G00-BC222"¬"this line is split into multiple lines % All needs to be read into 1 line % Currently that's not happening "
"C22-f0-333"¬"2nd row. This line is split into multiple lines % All needs to be read into 1 line % Currently that's not happening   *******************************************************************   This line also includes the column delimiter within text qualifier   *******************************************************************   # !¬!¬!¬| "

注意:您可能想要更改生成器 yield 的内容,例如 listtuple 而不是 str,删除双引号,跳过第一行,根据您的需要.我使用 io.StringIO 仅用于说明目的,您将从正常"读取文件.

Notes: You may want to change what the generator yields, e.g., list or tuple instead of str, remove double quotes, skip the first row, based on your needs. I used io.StringIO for illustration purposes only, you will read from a "normal" file.

这篇关于Python 从大文本文件中读取完整行的块(列值拆分为多行)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆