Airflow:S3FileTransformOperator:Python脚本在本地运行良好,但在S3FileTransformOperator中运行不正常:尝试了不同的方法 [英] Airflow:S3FileTransformOperator:Python Script runs fine in local but NOT in S3FileTransformOperator:Tried different approach

查看:131
本文介绍了Airflow:S3FileTransformOperator:Python脚本在本地运行良好,但在S3FileTransformOperator中运行不正常:尝试了不同的方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

要求:编辑最后一行的S3文件,删除双引号和多余的管道,然后将其上传回s3路径

Requirement: Edit the S3 file for the last row and remove double-quotes and extra pipeline and upload it back the same file back to s3 path

操作员

     cleanup = S3FileTransformOperator(
                task_id='cleanup', 
                source_s3_key='s3://path/outbound/incoming.txt',
                dest_s3_key='s3://path/outbound/incoming.txt',
                replace=True,
     transform_script='/usr/local/airflow/dags/scripts/clean_up.py'
            )

>方法1方法

问题:可以在本地运行python脚本,但是在Airflow中运行时会引发如下错误

Issue: Able to run the python script locally and but while running in Airflow it threw an error as below

错误: cannot mmap an empty file

检查以下output readline : b''

登录

    [2020-07-07 19:21:20,706] {s3_file_transform_operator.py:115} INFO - Downloading source S3 file s3://path/outbound/incoming.txt
[2020-07-07 19:21:24,224] {s3_file_transform_operator.py:124} INFO - Dumping S3 file s3://path/outbound/incoming.txt contents to local file /tmp/tmp9ihtv1up
[2020-07-07 19:21:59,988] {s3_file_transform_operator.py:145} INFO - Output:
[2020-07-07 19:22:00,183] {s3_file_transform_operator.py:147} INFO - Error in updating the file. Message: cannot mmap an empty file
[2020-07-07 19:22:00,183] {s3_file_transform_operator.py:147} INFO - Starting data cleaning...
[2020-07-07 19:22:00,183] {s3_file_transform_operator.py:147} INFO - input readline : b'"4405348400"|""|""|0|"R"|""|""|""|""|""|""|"23 Main"|"St"|""|""|""|"Holmdel"|"NJ"|"07733"|"N"\n'
[2020-07-07 19:22:00,183] {s3_file_transform_operator.py:147} INFO - b'TR|4826301'
[2020-07-07 19:22:00,183] {s3_file_transform_operator.py:147} INFO - output readline : b''
[2020-07-07 19:22:00,187] {s3_file_transform_operator.py:147} INFO - Traceback (most recent call last):
[2020-07-07 19:22:00,187] {s3_file_transform_operator.py:147} INFO -   File "/usr/local/airflow/dags/scripts/neustar_sid_clean_up.py", line 41, in <module>
[2020-07-07 19:22:00,187] {s3_file_transform_operator.py:147} INFO -     perform_cleanup(input, output)
[2020-07-07 19:22:00,187] {s3_file_transform_operator.py:147} INFO -   File "/usr/local/airflow/dags/scripts/neustar_sid_clean_up.py", line 27, in perform_cleanup
[2020-07-07 19:22:00,187] {s3_file_transform_operator.py:147} INFO -     with closing(mmap.mmap(output.fileno(), 0, access=mmap.ACCESS_WRITE)) as mm:
[2020-07-07 19:22:00,188] {s3_file_transform_operator.py:147} INFO - ValueError: cannot mmap an empty file
[2020-07-07 19:22:00,497] {__init__.py:1580} ERROR - Transform script failed: 1
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/airflow/models/__init__.py", line 1436, in _run_raw_task
    result = task_copy.execute(context=context)
  File "/usr/local/lib/python3.6/site-packages/airflow/operators/s3_file_transform_operator.py", line 153, in execute
    "Transform script failed: {0}".format(process.returncode)
airflow.exceptions.AirflowException: Transform script failed: 1

代码:

#!/usr/bin/env python3
import re
from contextlib import closing
import mmap
import sys
import logging
logger = logging.getLogger(__name__)


def clnup(input, output):
    try:
        with open(input, 'r+b') as input, open(output, 'r+b') as output:
        print(f'input readline : {input.readline()}')
        with closing(mmap.mmap(input.fileno(), 0, access=mmap.ACCESS_READ)) as mm:
            start_of_line = mm.rfind(b'\n', 0, len(mm) - 1) + 1
            line = mm[start_of_line:].rstrip(b'\r\n')
            last_line = line.decode('utf-8').replace("\"", "")
            last_line = re.sub('[|]*$', '', last_line).encode('utf-8')
            print(last_line)
        print(f'output readline : {output.readline()}')
        with closing(mmap.mmap(output.fileno(), 0, access=mmap.ACCESS_WRITE)) as mm:
            print(output.readline())
            start_of_line = mm.rfind(b'\n', 0, len(mm) - 1) + 1
        output.seek(start_of_line)  # Move to where old line began
        output.write(last_line)  # Overwrite existing line with new line
        output.truncate()
    except Exception as ex:
            logger.error(f'Error in updating the file. Message: {ex}')
            raise

input = sys.argv[1]
output = sys.argv[2]

print("Starting cleaning...")
perform_cleanup(input, output)
print("Completed cleaning!")

>方法2方法

问题:尝试使用以下代码在本地运行,并且工作正常,但在使用Airflow运行时,它不适用于大文件,并将文件替换为空文件

Issue : Have tried running locally with below code and works fine but when running with Airflow it is not working for big file and replaces the file as an empty file

记录小文件:

[2020-07-07 20:35:37,892] {s3_file_transform_operator.py:115} INFO - Downloading source S3 file s3://path/incoming.2020-07-07.txt
[2020-07-07 20:35:41,981] {s3_file_transform_operator.py:124} INFO - Dumping S3 file s3://path/incoming.2020-07-07.txt contents to local file /tmp/tmp3v_6i1go
[2020-07-07 20:35:42,115] {s3_file_transform_operator.py:145} INFO - Output:
[2020-07-07 20:35:42,293] {s3_file_transform_operator.py:147} INFO - Starting data cleaning...
[2020-07-07 20:35:42,293] {s3_file_transform_operator.py:147} INFO - Completed data cleaning!
[2020-07-07 20:35:42,298] {s3_file_transform_operator.py:158} INFO - Transform script successful. Output temporarily located at /tmp/tmp8uo9t2lk
[2020-07-07 20:35:42,298] {s3_file_transform_operator.py:161} INFO - Uploading transformed file to S3
[2020-07-07 20:35:43,983] {s3_file_transform_operator.py:168} INFO - Upload successful

用于大文件日志:

[2020-07-07 20:25:37,892] {s3_file_transform_operator.py:115} INFO - Downloading source S3 file s3://path/incoming.2020-07-07.txt
[2020-07-07 20:25:52,027] {s3_file_transform_operator.py:124} INFO - Dumping S3 file s3://path/incoming.2020-07-07.txt contents to local file /tmp/tmpgayy9hg9
[2020-07-07 20:26:26,256] {s3_file_transform_operator.py:145} INFO - Output:
[2020-07-07 20:26:29,137] {s3_file_transform_operator.py:158} INFO - Transform script successful. Output temporarily located at /tmp/tmpui1i28r6
[2020-07-07 20:26:29,137] {s3_file_transform_operator.py:161} INFO - Uploading transformed file to S3

代码2:

#!/usr/bin/env python3
import re
from contextlib import closing
import mmap
import sys
import logging
import os
logger = logging.getLogger(__name__)
"""
Read the last line of the file and remove the double quotes and extra delimiters
and write back to the file.
"""
def clnup(input, output):
    try:
        with open(input, 'r+b') as myfile:
            with closing(mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE)) as mm:
                start_of_line = mm.rfind(b'\n', 0, len(mm) - 1) + 1
                line = mm[start_of_line:].rstrip(b'\r\n')
                last_line = line.decode('utf-8').replace("\"", "")
                last_line = re.sub('[|]*$', '', last_line).encode('utf-8')
            myfile.seek(start_of_line)  # Move to where old line began
            myfile.write(last_line)  # Overwrite existing line with new line
            myfile.truncate()
        with open(input, 'r+b') as myfile:
            f = open("temp.txt", "w+b")
            f.write(myfile.read())
        with open("temp.txt", 'r+b') as myfile:
            f = open(output, "w+b")
            f.write(myfile.read())
        os.remove("temp.txt")
    except Exception as ex:
            logger.error(f'Error in updating the file. Message: {ex}')
            raise
input = sys.argv[1]
output = sys.argv[2]
print("Starting data cleaning...")
clnup(input, output)
print("Completed data cleaning!")

(edited)
if you check the log for big file ,below is missing

[2020-07-07 20:35:42,293] {s3_file_transform_operator.py:147} INFO - Starting data cleaning...
[2020-07-07 20:35:42,293] {s3_file_transform_operator.py:147} INFO - Completed data cleaning!

>方法方法3

问题:尝试使用以下代码在本地运行并可以正常运行,但在使用Airflow运行时,它将文件替换为空文件

Issue: Have tried running locally with below code and works fine but when running with Airflow it replaces the file as an empty file

代码:

#!/usr/bin/env python3
import re
from contextlib import closing
import mmap
import sys
import logging
import os
logger = logging.getLogger(__name__)

input = sys.argv[1]


def clnup(input):
    try:
        with open(input, 'r+b') as myfile:
            with closing(mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE)) as mm:
                start_of_line = mm.rfind(b'\n', 0, len(mm) - 1) + 1
                line = mm[start_of_line:].rstrip(b'\r\n')
                last_line = line.decode('utf-8').replace("\"", "")
                last_line = re.sub('[|]*$', '', last_line).encode('utf-8')
            myfile.seek(start_of_line)  # Move to where old line began
            myfile.write(last_line)  # Overwrite existing line with new line
            myfile.truncate()
    except Exception as ex:
            logger.error(f'Error in updating the file. Message: {ex}')
            raise

print("Starting data cleaning...")
clnup(input)
print("Completed data cleaning!")

推荐答案

您正在用单个字符串填充读取整个文件的所有内存. 您必须同时使用readline(不带 s )读取一行:它将返回一个迭代器.循环迭代器,然后一次将其保存一行.

You are filling all the memory reading the whole file in a single string. You have to read one line at the time with readline ( without s ): it will return an iterator. Loop the iterator and then save it one line at the time.

这篇关于Airflow:S3FileTransformOperator:Python脚本在本地运行良好,但在S3FileTransformOperator中运行不正常:尝试了不同的方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆