将数据从多个csv文件复制到一个csv文件中 [英] Copy data from multiple csv files into one csv file
问题描述
我的azure blob存储中有多个csv文件,我希望使用azure数据工厂管道将其附加到一个也存储在azure blob存储中的csv文件中.问题在于,源文件中的所有列均未出现在接收器文件中,反之亦然,并且所有源文件也不相同.我只想将我需要的列从源文件映射到接收器文件中的列.数据工厂中的复制活动不允许我这样做.
正如@LeonYue所说,它现在不支持Azure Data Factory.但是,根据我的经验,作为一种解决方法,您可以考虑使用pandas
创建一个Python脚本来做到这一点,并作为Azure应用服务的WebJob或在Azure VM上运行,以在Azure存储和其他Azure服务之间加速.>
解决方法的步骤如下.
-
也许这些csv文件都在Azure Blob存储的容器中,因此您需要通过 read_csv 函数,代码如下.
from azure.storage.blob.baseblobservice import BaseBlobService from azure.storage.blob import ContainerPermissions from datetime import datetime, timedelta account_name = '<your account name>' account_key = '<your account key>' container_name = '<your container name>' service = BaseBlobService(account_name=account_name, account_key=account_key) token = service.generate_container_shared_access_signature(container_name, permission=ContainerPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1),) blob_names = service.list_blob_names(container_name) blob_urls_with_token = (f"https://{account_name}.blob.core.windows.net/{container_name}/{blob_name}?{token}" for blob_name in blob_names) #print(list(blob_urls_with_token))
-
通过
read_csv
函数直接读取csv文件以获取熊猫数据框.import pandas as pd for blob_url_with_token in blob_urls_with_token: df = pd.read_csv(blob_url_with_token)
-
您可以按照需要通过熊猫操作这些数据框,然后通过使用适用于Python的Azure存储SDK将其作为单个csv文件写入Azure Blob存储.
希望有帮助.
I have multiple csv files in my azure blob storage which I wish to append into one csv file also stored in azure blob storage using the azure data factory pipeline. The problem is that all the columns of the source files are not present in the sink file and vice versa and also all the source files are not identical. I just want to map the columns I need from source files to the columns in sink file. The copy activity in the data factory is not allowing me to do so.
As @LeonYue said, it doesn't support on Azure Data Factory now. However, per my experience, as a workaround solution, you can consider to create a Python script using pandas
to do that and run as WebJob of Azure App Service or on Azure VM for acceleration between Azure Storage and other Azure services.
The steps of the workaround solution is like below.
Maybe these csv files are all in a container of Azure Blob Storage, so you need to list them in container via
list_blob_names
and generate their urls with sas token for pandas read_csv function, the code as below.from azure.storage.blob.baseblobservice import BaseBlobService from azure.storage.blob import ContainerPermissions from datetime import datetime, timedelta account_name = '<your account name>' account_key = '<your account key>' container_name = '<your container name>' service = BaseBlobService(account_name=account_name, account_key=account_key) token = service.generate_container_shared_access_signature(container_name, permission=ContainerPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1),) blob_names = service.list_blob_names(container_name) blob_urls_with_token = (f"https://{account_name}.blob.core.windows.net/{container_name}/{blob_name}?{token}" for blob_name in blob_names) #print(list(blob_urls_with_token))
To directly read csv file by
read_csv
function to get a pandas dataframe.import pandas as pd for blob_url_with_token in blob_urls_with_token: df = pd.read_csv(blob_url_with_token)
You can follow your want to operate these dataframe by pandas, and then write to Azure Blob Storage as a single csv file by using Azure Storage SDK for Python.
Hope it helps.
这篇关于将数据从多个csv文件复制到一个csv文件中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!