无法在Azure Batch上运行任务:启动后节点进入不可用状态 [英] Unable to run tasks on Azure Batch:Nodes go into unusable state after starting-up

查看:121
本文介绍了无法在Azure Batch上运行任务:启动后节点进入不可用状态的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Azure Batch并行化Python应用程序。我在Python客户端脚本中遵循的工作流程是:
1)使用 blobxfer将本地文件上传到Azure Blob容器实用程序(输入容器)



2)在使用服务主体帐户和azure-cli登录后,启动批处理服务以处理输入容器中的文件。 / p>

3)通过使用Azure Batch在节点上分布的python应用程序将文件上传到输出容器。



我遇到的问题与我在此处阅读的问题非常相似,但不幸的是,本文未给出解决方案。



我确定这是azure_batch_create_task()函数的命令行一切都变混乱了,但我却无法动弹。我已完成了该文档建议的所有操作:将应用程序包安装到Azure批处理计算节点,请查看我的客户端Python脚本,让我知道我在做什么错!



编辑3:
这个问题看起来与这篇文章中描述的非常相似:
无法将应用程序路径传递给Tasks



更新2:



我能够克服找不到文件/目录错误我不特别喜欢使用肮脏的黑客程序。我将python应用程序放在了用于创建VM的用户的主目录中,处理所需的所有目录都在该目录的工作目录中创建了。



我仍然想知道如何通过使用应用程序包方式将工作流部署到节点上来运行工作流。



更新3



我已经更新了客户端代码和python应用,以反映最新的更改。重要的事情是相同的.....



我将对他/她提出的@fparks点进行评论。



我打算在Azure Batch中使用的原始python应用程序包含许多模块和一些配置文件,以及用于Python包的相当长的requirements.txt文件.Azure还建议在这种情况下使用自定义图像。
在我的情况下,每个任务还​​下载python模块有点不合理,因为1个任务等于多页pdf,预期的工作量是25k多页pdf
我使用CLI是因为Python SDK的文档稀疏节点变得无法使用,已经解决了。我确实同意blobxfer错误。

解决方案

答案和一些观察结果:


  1. 我不清楚为什么需要自定义图像。您可以使用平台映像,即 Canonical,UbuntuServer,18.04-LTS ,然后只需安装所需的内容即可作为启动任务。可以简单地通过18.04中的apt安装Python3.6。当您实际上使用平台映像+启动任务可能更快,更稳定时,您可能会过早地通过选择自定义映像来优化工作流程。

  2. 您的脚本在Python中,但您正在调用转到Azure CLI。您可能需要考虑直接使用 Azure Batch Python SDK 样本)。

  3. 当节点不可用时,应首先检查该节点是否有错误。 You should see if the ComputeNodeError field is populated. Additionally, you can try to fetch stdout.txt and stderr.txt files from the startup directory to diagnose what’s going on. You can do both of these actions in the Azure Portal or via Batch Explorer. If that doesn’t work, you can fetch the compute node service logs and file a support request. However, typically unusable means that your custom image was provisioned incorrectly, you have a virtual network with an NSG misconfigured, or you have an application package that is incorrect.

  4. Your application package consists of a single python file; instead use a resource file. Simply upload the script to Azure Storage blob and reference it in your task as a Resource File using a SAS URL. See the --resource-files argument in az batch task create if using the CLI. Your command to invoke would then simply be python3 pdf_processing.py (assuming you keep the resource file downloading to the task working directory).

  5. If you insist on using an application package, consider using a task application package instead. This will decouple your node startup issues potentially originating from bad application packages to debugging task executions instead.

  6. The blobxfer error is pretty clear. Your locale is not set properly. The easy way to fix this is to set the environment variables for the task. See the --environment-settings argument if using the CLI and set two environment variables LC_ALL=C.UTF-8 and LANG=C.UTF-8 as part of your task.


I am trying to parallelize a Python App using Azure Batch.The workflow that I have followed in the Python client-side script is: 1) Upload Local files to Azure Blob Container using blobxfer utility (input-container)

2) Start Batch service to Process the files in input-container after logging in using the service principal account with azure-cli.

3) Upload the files to output-container through the python app distributed across the Nodes with Azure Batch.

I am experiencing a problem very similar to the one I read here but unfortunately no solution was given in this post. Nodes go into Unusable State

I will now give the relevant information so that one can reproduce this error:

The image that was used for Azure Batch is custom.

1) Ubuntu Server 18.04 LTS was chosen as the OS for the VM and the following ports were opened-ssh,http,https.The rest of the setting were kept default in the Azure portal.

2)The following script was run once the server was available.

sudo apt-get install build-essential checkinstall -y
sudo apt-get install libreadline-gplv2-dev  libncursesw5-dev libssl-dev 
libsqlite3-dev tk-dev libgdbm-dev libc6-dev libbz2-dev -y
cd /usr/src
sudo wget https://www.python.org/ftp/python/3.6.6/Python-3.6.6.tgz
sudo tar xzf Python-3.6.6.tgz
cd Python-3.6.6
sudo ./configure --enable-optimizations
sudo make altinstall
sudo pip3.6 install --upgrade pip
sudo pip3.6 install pymupdf==1.13.20
sudo pip3.6 install tqdm==4.19.9
sudo pip3.6 install sentry-sdk==0.4.1
sudo pip3.6 install blobxfer==1.5.0
sudo pip3.6 install azure-cli==2.0.47

3) An Image of this server was created using the process outlined in this link. Creating VM Image in Azure Linux Also during deprovision the user was not deleted:sudo waagent -deprovision

4) The Resource Id of the image was noted from the Azure Portal.This will be supplied as one of the parameters in the python-client-side script

The packages installed on the client side server where the python script for Batch would run

sudo pip3.6 install tqdm==4.19.9
sudo pip3.6 install sentry-sdk==0.4.1
sudo pip3.6 install blobxfer==1.5.0
sudo pip3.6 install azure-cli==2.0.47
sudo pip3.6 install pandas==0.22.0

The Resources used during Azure Batch were created in the following way:

1) Service Principal account with contributor privileges was created using the cmd.

$az ad sp create-for-rbac --name <SERVICE-PRINCIPAL-ACCOUNT>

2) Resource-Group,Batch-Account and Storage associated with Batch Account were created in the following way:

$ az group create --name <RESOURCE-GROUP-NAME> --location eastus2
$ az storage account create --resource-group <RESOURCE-GROUP-NAME> --name <STORAGE-ACCOUNT-NAME> --location eastus2 --sku Standard_LRS
$ az batch account create --name <BATCH-ACCOUNT-NAME> --storage-account <STORAGE-ACCOUNT-NAME> --resource-group <RESOURCE-GROUP-NAME> --location eastus2

The client-side Python script which initiates the upload and processing: (Update 3)

import subprocess
import os
import time
import datetime
import tqdm
import pandas
import sys
import fitz
import parmap
import numpy as np
import sentry_sdk
import multiprocessing as mp


def batch_upload_local_to_azure_blob(azure_username,azure_password,azure_tenant,azure_storage_account,azure_storage_account_key,log_dir_path):
    try:
        subprocess.check_output(["az","login","--service-principal","--username",azure_username,"--password",azure_password,"--tenant",azure_tenant])
    except subprocess.CalledProcessError:
        sentry_sdk.capture_message("Invalid Azure Login Credentials")
        sys.exit("Invalid Azure Login Credentials")
    dir_flag=False
    while dir_flag==False:
        try:
            no_of_dir=input("Enter the number of directories to upload:")
            no_of_dir=int(no_of_dir)
            if no_of_dir<0:
                print("\nRetry:Enter an integer value")   
            else: 
                dir_flag=True
        except ValueError:
            print("\nRetry:Enter an integer value")
    dir_path_list=[]
    for dir in range(no_of_dir):
        path_exists=False
        while path_exists==False:
            dir_path=input("\nEnter the local absolute path of the directory no.{}:".format(dir+1))
            print("\n")
            dir_path=dir_path.replace('"',"")
            path_exists=os.path.isdir(dir_path)
            if path_exists==True:
                dir_path_list.append(dir_path)
            else:
                print("\nRetry:Enter a valid directory path")
    timestamp = time.time()
    timestamp_humanreadable= datetime.datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d-%H-%M-%S')
    input_azure_container="pdf-processing-input"+"-"+timestamp_humanreadable
    try:
        subprocess.check_output(["az","storage","container","create","--name",input_azure_container,"--account-name",azure_storage_account,"--auth-mode","login","--fail-on-exist"])
    except subprocess.CalledProcessError:
        sentry_sdk.capture_message("Invalid Azure Storage Credentials.")
        sys.exit("Invalid Azure Storage Credentials.")
    log_file_path=os.path.join(log_dir_path,"upload-logs"+"-"+timestamp_humanreadable+".txt")
    dir_upload_success=[]
    dir_upload_failure=[]
    for dir in tqdm.tqdm(dir_path_list,desc="Uploading Directories"):
        try:
            subprocess.check_output(["blobxfer","upload","--remote-path",input_azure_container,"--storage-account",azure_storage_account,\
            "--enable-azure-storage-logger","--log-file",\
            log_file_path,"--storage-account-key",azure_storage_account_key,"--local-path",dir]) 
            dir_upload_success.append(dir)
        except subprocess.CalledProcessError:
            sentry_sdk.capture_message("Failed to upload directory: {}".format(dir))
            dir_upload_failure.append(dir)
    return(input_azure_container)

def query_azure_storage(azure_storage_container,azure_storage_account,azure_storage_account_key,blob_file_path):
    try:
        blob_list=subprocess.check_output(["az","storage","blob","list","--container-name",azure_storage_container,\
        "--account-key",azure_storage_account_key,"--account-name",azure_storage_account,"--auth-mode","login","--output","tsv"])
        blob_list=blob_list.decode("utf-8")
        with open(blob_file_path,"w") as f:
            f.write(blob_list)
        blob_df=pandas.read_csv(blob_file_path,sep="\t",header=None)
        blob_df=blob_df.iloc[:,3]
        blob_df=blob_df.to_frame(name="container_files")
        blob_df=blob_df.assign(container=azure_storage_container)
        return(blob_df)
    except subprocess.CalledProcessError:
        sentry_sdk.capture_message("Invalid Azure Storage Credentials")
        sys.exit("Invalid Azure Storage Credentials.")

def analyze_files_for_tasks(data_split,azure_storage_container,azure_storage_account,azure_storage_account_key,download_folder):
    try:
        blob_df=data_split
        some_calculation_factor=2
        analyzed_azure_blob_df=pandas.DataFrame()
        analyzed_azure_blob_df=analyzed_azure_blob_df.assign(container="empty",container_files="empty",pages="empty",max_time="empty")
        for index,row in blob_df.iterrows():
            file_to_analyze=os.path.join(download_folder,row["container_files"])
            subprocess.check_output(["az","storage","blob","download","--container-name",azure_storage_container,"--file",file_to_analyze,"--name",row["container_files"],\
            "--account-name",azure_storage_account,"--auth-mode","key"])        #Why does login auth not work for this while we are multiprocessing
            doc=fitz.open(file_to_analyze)
            page_count=doc.pageCount
            analyzed_azure_blob_df=analyzed_azure_blob_df.append([{"container":azure_storage_container,"container_files":row["container_files"],"pages":page_count,"max_time":some_calculation_factor*page_count}])
            doc.close()
            os.remove(file_to_analyze)
        return(analyzed_azure_blob_df)
    except Exception as e:
        sentry_sdk.capture_exception(e)


def estimate_task_completion_time(azure_storage_container,azure_storage_account,azure_storage_account_key,azure_blob_df,azure_blob_downloads_file_path):
    try: 
        cores=mp.cpu_count()                                           #Number of CPU cores on your system
        partitions = cores-2  
        timestamp = time.time()
        timestamp_humanreadable= datetime.datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d-%H-%M-%S')
        file_download_location=os.path.join(azure_blob_downloads_file_path,"Blob_Download"+"-"+timestamp_humanreadable)
        os.mkdir(file_download_location)
        data_split = np.array_split(azure_blob_df,indices_or_sections=partitions,axis=0)
        analyzed_azure_blob_df=pandas.concat(parmap.map(analyze_files_for_tasks,data_split,azure_storage_container,azure_storage_account,azure_storage_account_key,file_download_location,\
        pm_pbar=True,pm_processes=partitions))
        analyzed_azure_blob_df=analyzed_azure_blob_df.reset_index(drop=True)
        return(analyzed_azure_blob_df)
    except Exception as e:
        sentry_sdk.capture_exception(e)
        sys.exit("Unable to Estimate Job Completion Status")

def azure_batch_create_pool(azure_storage_container,azure_resource_group,azure_batch_account,azure_batch_account_endpoint,azure_batch_account_key,vm_image_name,no_nodes,vm_compute_size,analyzed_azure_blob_df):
    timestamp = time.time()
    timestamp_humanreadable= datetime.datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d-%H-%M-%S')
    pool_id="pdf-processing"+"-"+timestamp_humanreadable
    try:
        subprocess.check_output(["az","batch","account","login","--name", azure_batch_account,"--resource-group",azure_resource_group])
    except subprocess.CalledProcessError:
        sentry_sdk.capture_message("Unable to log into the Batch account")
        sys.exit("Unable to log into the Batch account")
    #Pool autoscaling formula would go in here
    try:
        subprocess.check_output(["az","batch","pool","create","--account-endpoint",azure_batch_account_endpoint, \
        "--account-key",azure_batch_account_key,"--account-name",azure_batch_account,"--id",pool_id,\
        "--node-agent-sku-id","batch.node.ubuntu 18.04",\
        "--image",vm_image_name,"--target-low-priority-nodes",str(no_nodes),"--vm-size",vm_compute_size])
        return(pool_id)
    except subprocess.CalledProcessError:
        sentry_sdk.capture_message("Unable to create a Pool corresponding to Container:{}".format(azure_storage_container))
        sys.exit("Unable to create a Pool corresponding to Container:{}".format(azure_storage_container))

def azure_batch_create_job(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,pool_info):
    timestamp = time.time()
    timestamp_humanreadable= datetime.datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d-%H-%M-%S')
    job_id="pdf-processing-job"+"-"+timestamp_humanreadable
    try:
    subprocess.check_output(["az","batch","job","create","--account-endpoint",azure_batch_account_endpoint,"--account-key",\
    azure_batch_account_key,"--account-name",azure_batch_account,"--id",job_id,"--pool-id",pool_info])
    return(job_id)
    except subprocess.CalledProcessError:
        sentry_sdk.capture_message("Unable to create a Job on the Pool :{}".format(pool_info))
        sys.exit("Unable to create a Job on the Pool :{}".format(pool_info))

def azure_batch_create_task(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,pool_info,job_info,azure_storage_account,azure_storage_account_key,azure_storage_container,analyzed_azure_blob_df):
    print("\n")
    for i in tqdm.tqdm(range(180),desc="Waiting for the Pool to Warm-up"):
        time.sleep(1)
    successful_task_list=[]
    unsuccessful_task_list=[]
    input_azure_container=azure_storage_container 
    output_azure_container= "pdf-processing-output"+"-"+input_azure_container.split("-input-")[-1]
    try:
        subprocess.check_output(["az","storage","container","create","--name",output_azure_container,"--account-name",azure_storage_account,"--auth-mode","login","--fail-on-exist"])
    except subprocess.CalledProcessError:
        sentry_sdk.cpature_message("Unable to create an output container")
        sys.exit("Unable to create an output container")
    print("\n")
    pbar = tqdm.tqdm(total=analyzed_azure_blob_df.shape[0],desc="Creating and distributing Tasks")
    for index,row in analyzed_azure_blob_df.iterrows():
        try:
            task_info="mytask-"+str(index)
            subprocess.check_output(["az","batch","task","create","--task-id",task_info,"--job-id",job_info,"--command-line",\
            "python3 /home/avadhut/pdf_processing.py {} {} {}".format(input_azure_container,output_azure_container,row["container_files"])])
            pbar.update(1)
        except subprocess.CalledProcessError:
            sentry_sdk.capture_message("unable to create the Task: mytask-{}".format(i))
            pbar.update(1)
    pbar.close()

def wait_for_tasks_to_complete(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,job_info,task_file_path,analyzed_azure_blob_df):
        try:
            print(analyzed_azure_blob_df)
            nrows_tasks_df=analyzed_azure_blob_df.shape[0]
            print("\n")
            pbar=tqdm.tqdm(total=nrows_tasks_df,desc="Waiting for task to complete")
            for index,row in analyzed_azure_blob_df.iterrows():
                task_list=subprocess.check_output(["az","batch","task","list","--job-id",job_info,"--account-endpoint",azure_batch_account_endpoint,"--account-key",azure_batch_account_key,"--account-name",azure_batch_account,\
                "--output","tsv"])
                task_list=task_list.decode("utf-8")
                with open(task_file_path,"w") as f:
                    f.write(task_list)
                task_df=pandas.read_csv(task_file_path,sep="\t",header=None)
                task_df=task_df.iloc[:,21]
                active_task_list=[]
                for x in task_df:
                    if x =="active":
                        active_task_list.append(x)
                if len(active_task_list)>0:
                    time.sleep(row["max_time"])  #This time can be changed in accordance with the time taken to complete each task
                    pbar.update(1)
                    continue
                else:
                    pbar.close()
                    return("success")
            pbar.close()
            return("failure")
        except subprocess.CalledProcessError:
            sentry_sdk.capture_message("Error in retrieving task status")

def azure_delete_job(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,job_info):
    try:
        subprocess.check_output(["az","batch","job","delete","--job-id",job_info,"--account-endpoint",azure_batch_account_endpoint,"--account-key",azure_batch_account_key,"--account-name",azure_batch_account,"--yes"])
    except subprocess.CalledProcessError:
        sentry_sdk.capture_message("Unable to delete Job-{}".format(job_info))

def azure_delete_pool(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,pool_info):
    try:
        subprocess.check_output(["az","batch","pool","delete","--pool-id",pool_info,"--account-endpoint",azure_batch_account_endpoint,"--account-key",azure_batch_account_key,"--account-name",azure_batch_account,"--yes"])
    except subprocess.CalledProcessError:
        sentry_sdk.capture_message("Unable to delete Pool--{}".format(pool_info))

if __name__=="__main__":
    print("\n")
    print("-"*40+"Azure Batch processing POC"+"-"*40)
    print("\n")

    #Credentials and initializations
    sentry_sdk.init(<SENTRY-CREDENTIALS>) #Sign-up for a Sentry trail account
    azure_username=<AZURE-USERNAME>
    azure_password=<AZURE-PASSWORD>
    azure_tenant=<AZURE-TENANT>
    azure_resource_group=<RESOURCE-GROUP-NAME>
    azure_storage_account=<STORAGE-ACCOUNT-NAME>
    azure_storage_account_key=<STORAGE-KEY>
    azure_batch_account_endpoint=<BATCH-ENDPOINT>
    azure_batch_account_key=<BATCH-ACCOUNT-KEY>
    azure_batch_account=<BATCH-ACCOUNT-NAME>
    vm_image_name=<VM-IMAGE>
    vm_compute_size="Standard_A4_v2"
    no_nodes=2
    log_dir_path="/home/user/azure_batch_upload_logs/"
    azure_blob_downloads_file_path="/home/user/blob_downloads/"
    blob_file_path="/home/user/azure_batch_upload.tsv"
    task_file_path="/home/user/azure_task_list.tsv"


    input_azure_container=batch_upload_local_to_azure_blob(azure_username,azure_password,azure_tenant,azure_storage_account,azure_storage_account_key,log_dir_path)

    azure_blob_df=query_azure_storage(input_azure_container,azure_storage_account,azure_storage_account_key,blob_file_path)

    analyzed_azure_blob_df=estimate_task_completion_time(input_azure_container,azure_storage_account,azure_storage_account_key,azure_blob_df,azure_blob_downloads_file_path)

    pool_info=azure_batch_create_pool(input_azure_container,azure_resource_group,azure_batch_account,azure_batch_account_endpoint,azure_batch_account_key,vm_image_name,no_nodes,vm_compute_size,analyzed_azure_blob_df)

    job_info=azure_batch_create_job(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,pool_info)

    azure_batch_create_task(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,pool_info,job_info,azure_storage_account,azure_storage_account_key,input_azure_container,analyzed_azure_blob_df)

    task_status=wait_for_tasks_to_complete(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,job_info,task_file_path,analyzed_azure_blob_df)

    if task_status=="success":
        azure_delete_job(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,job_info)
        azure_delete_pool(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,pool_info)
        print("\n\n")
        sys.exit("Job Complete")
    else:
        azure_delete_job(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,job_info)
        azure_delete_pool(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,pool_info)
        print("\n\n")
        sys.exit("Job Unsuccessful")

cmd used to create the zip file:

zip pdf_process_1.zip pdf_processing.py

The Python App that was packaged in zip file and uploaded to batch through the client-side script

(Update 3)

import os
import fitz
import subprocess
import argparse
import time
from tqdm import tqdm
import sentry_sdk
import sys
import datetime

def azure_active_directory_login(azure_username,azure_password,azure_tenant):
    try:
        azure_login_output=subprocess.check_output(["az","login","--service-principal","--username",azure_username,"--password",azure_password,"--tenant",azure_tenant])
    except subprocess.CalledProcessError:
        sentry_sdk.capture_message("Invalid Azure Login Credentials")
        sys.exit("Invalid Azure Login Credentials")

def download_from_azure_blob(azure_storage_account,azure_storage_account_key,input_azure_container,file_to_process,pdf_docs_path):
    file_to_download=os.path.join(input_azure_container,file_to_process)
    try:
        subprocess.check_output(["az","storage","blob","download","--container-name",input_azure_container,"--file",os.path.join(pdf_docs_path,file_to_process),"--name",file_to_process,"--account-key",azure_storage_account_key,\
        "--account-name",azure_storage_account,"--auth-mode","login"])
    except subprocess.CalledProcessError:
        sentry_sdk.capture_message("unable to download the pdf file")
        sys.exit("unable to download the pdf file")

def pdf_to_png(input_folder_path,output_folder_path):
    pdf_files=[x for x in os.listdir(input_folder_path) if x.endswith((".pdf",".PDF"))]
    pdf_files.sort()
    for pdf in tqdm(pdf_files,desc="pdf--->png"):
        doc=fitz.open(os.path.join(input_folder_path,pdf))
        page_count=doc.pageCount
        for f in range(page_count):
            page=doc.loadPage(f)
            pix = page.getPixmap()
            if pdf.endswith(".pdf"):
                png_filename=pdf.split(".pdf")[0]+"___"+"page---"+str(f)+".png"
                pix.writePNG(os.path.join(output_folder_path,png_filename))
            elif pdf.endswith(".PDF"):
                png_filename=pdf.split(".PDF")[0]+"___"+"page---"+str(f)+".png"
                pix.writePNG(os.path.join(output_folder_path,png_filename))


def upload_to_azure_blob(azure_storage_account,azure_storage_account_key,output_azure_container,png_docs_path):
    try:
        subprocess.check_output(["az","storage","blob","upload-batch","--destination",output_azure_container,"--source",png_docs_path,"--account-key",azure_storage_account_key,\
        "--account-name",azure_storage_account,"--auth-mode","login"])
    except subprocess.CalledProcessError:
        sentry_sdk.capture_message("Unable to upload file to the container")


if __name__=="__main__":
    #Credentials 
    sentry_sdk.init(<SENTRY-CREDENTIALS>)
    azure_username=<AZURE-USERNAME>
    azure_password=<AZURE-PASSWORD>
    azure_tenant=<AZURE-TENANT>
    azure_storage_account=<AZURE-STORAGE-NAME>
    azure_storage_account_key=<AZURE-STORAGE-KEY>
    try:
        parser = argparse.ArgumentParser()
        parser.add_argument("input_azure_container",type=str,help="Location to download files from")
        parser.add_argument("output_azure_container",type=str,help="Location to upload files to")
        parser.add_argument("file_to_process",type=str,help="file link in azure blob storage")
        args = parser.parse_args()
        timestamp = time.time()
        timestamp_humanreadable= datetime.datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d-%H-%M-%S')
        task_working_dir=os.getcwd()
        file_to_process=args.file_to_process
        input_azure_container=args.input_azure_container
        output_azure_container=args.output_azure_container
        pdf_docs_path=os.path.join(task_working_dir,"pdf_files"+"-"+timestamp_humanreadable)
        png_docs_path=os.path.join(task_working_dir,"png_files"+"-"+timestamp_humanreadable)
        os.mkdir(pdf_docs_path)
        os.mkdir(png_docs_path)
    except Exception as e:
        sentry_sdk.capture_exception(e)
    azure_active_directory_login(azure_username,azure_password,azure_tenant)
    download_from_azure_blob(azure_storage_account,azure_storage_account_key,input_azure_container,file_to_process,pdf_docs_path)
    pdf_to_png(pdf_docs_path,png_docs_path)
    upload_to_azure_blob(azure_storage_account,azure_storage_account_key,output_azure_container,png_docs_path)

Update 1: I have solved the Server Nodes going into unusable state error.The way I solved this issue is:

1) I did not use the cmds I mentioned above to set up Python env 3.6 on Ubuntu as Ubuntu 18.04 LTS comes with its own python 3 environment.Initially I had googled "Install Python 3 on Ubuntu" and had gotten this Python 3.6 installation on Ubuntu link.Avoided this step completely during the server set-up. All I did was install these packages this time.

sudo apt-get install -y python3-pip
sudo -H pip3 install tqdm==4.19.9
sudo -H pip3 install sentry-sdk==0.4.1
sudo -H pip3 install blobxfer==1.5.0
sudo -H pip3 install pandas==0.22.0

The Azure cli was installed on the machine using the cmds in this link Install Azure CLI with apt

2) Created a snapshot of the OS-disk and then created the image out of this snapshot and finally referencing this image in the client-side script.

I am now faced with another issue where the stderr.txt files on the node tell me that:

  python3: can't open file '$AZ_BATCH_APP_PACKAGE_pdfprocessingapp/pdf_processing.py': [Errno 2] No such file or directory

Logging in to the server with the random user I see that the directory _azbatch is created but there are no contents inside this directory.

I know for certain that it is the command line of the azure_batch_create_task() function that things are going haywire but I am not able to put my finger on it.I have done everything that this docs recommends:Install app packages to Azure Batch Compute Nodes Please review my client-side Python script and let me know on what I am doing wrong!

Edit 3: The problem looks very similar to the one described in this post: Unable to pass app path to Tasks

Update 2:

I was able to overcome the file/directory not found error using a dirty hack which i am not particularly fond of.I placed the python app in the home directory of the user which was used to create the VM and all the directories required for processing were created in the working directory of the task.

I still would want to know how I would run the workflow by using the application package way to deploy it to the node.

Update 3

I have updated the client side code and python app to reflect the latest changes made. Things that are significant are the same.....

I will comment on @fparks points that he/she has raised.

The Original python App that I intend to use in Azure Batch contains many modules and some config files and a quite lengthy requirements.txt file for Python packages.Azure also recommends using custom Image in such cases. Also downloading the python modules per Task is a bit irrational in my case as 1 task is equal to a multipage pdfs and my expected workload is 25k multipage pdfs I used CLI because the docs for Python SDK were sparse and hard to follow.The nodes going into unusable state has been solved.I do agree with you on the blobxfer error.

解决方案

Answers and a few observations:

  1. It is unclear to me why you need a custom image. You can use a platform image, i.e., Canonical, UbuntuServer, 18.04-LTS, and then just install what you need as part of the start task. Python3.6 can simply be installed via apt in 18.04. You may be prematurely optimizing your workflow by opting for a custom image when in fact using a platform image + start task may be faster and stable.
  2. Your script is in Python, yet you are calling out to the Azure CLI. You may want to consider directly using the Azure Batch Python SDK instead (samples).
  3. When nodes go unusable, you should first examine the node for errors. You should see if the ComputeNodeError field is populated. Additionally, you can try to fetch stdout.txt and stderr.txt files from the startup directory to diagnose what's going on. You can do both of these actions in the Azure Portal or via Batch Explorer. If that doesn't work, you can fetch the compute node service logs and file a support request. However, typically unusable means that your custom image was provisioned incorrectly, you have a virtual network with an NSG misconfigured, or you have an application package that is incorrect.
  4. Your application package consists of a single python file; instead use a resource file. Simply upload the script to Azure Storage blob and reference it in your task as a Resource File using a SAS URL. See the --resource-files argument in az batch task create if using the CLI. Your command to invoke would then simply be python3 pdf_processing.py (assuming you keep the resource file downloading to the task working directory).
  5. If you insist on using an application package, consider using a task application package instead. This will decouple your node startup issues potentially originating from bad application packages to debugging task executions instead.
  6. The blobxfer error is pretty clear. Your locale is not set properly. The easy way to fix this is to set the environment variables for the task. See the --environment-settings argument if using the CLI and set two environment variables LC_ALL=C.UTF-8 and LANG=C.UTF-8 as part of your task.

这篇关于无法在Azure Batch上运行任务:启动后节点进入不可用状态的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆