通过AWS上的API在粘合表上添加分区? [英] Add a partition on glue table via API on AWS?
问题描述
我有一个S3存储桶,该存储桶不断填充新数据,我正在使用Athena和Glue查询该数据,问题是,如果胶水不知道创建了新分区,它就不会搜索它需要在那里搜索。如果我每次需要一个新分区时都要进行一次API调用来运行Glue搜寻器,那么这样做太昂贵了,因此最好的解决方案是告诉胶水添加了一个新分区,即在其属性表中创建一个新分区。我浏览了AWS文档,但没有走运,我将Java与AWS结合使用。有帮助吗?
I have an S3 bucket which is constantly being filled with new data, I am using Athena and Glue to query that data, the thing is if glue doesn't know that a new partition is created it doesn't search that it needs to search there. If I make an API call to run the Glue crawler each time I need a new partition is too expensive so the best solution to do this is to tell glue that a new partition is added i.e to create a new partition is in it's properties table. I looked through AWS documentation but no luck, I am using Java with AWS. Any help?
推荐答案
您可能想使用 batch_create_partition()
胶水用于注册新分区的api。它不需要MSCK REPAIR TABLE或重新爬网之类的昂贵操作。
You may want to use batch_create_partition()
glue api to register new partitions. It doesn't require any expensive operation like MSCK REPAIR TABLE or re-crawling.
我有一个类似的用例,为此,我编写了一个执行以下操作的python脚本-
I had a similar use case for which I wrote a python script which does the below -
步骤1-获取表信息并从中解析注册分区所需的必要信息。
# Fetching table information from glue catalog
logger.info("Fetching table info for {}.{}".format(l_database, l_table))
try:
response = l_client.get_table(
CatalogId=l_catalog_id,
DatabaseName=l_database,
Name=l_table
)
except Exception as error:
logger.error("Exception while fetching table info for {}.{} - {}"
.format(l_database, l_table, error))
sys.exit(-1)
# Parsing table info required to create partitions from table
input_format = response['Table']['StorageDescriptor']['InputFormat']
output_format = response['Table']['StorageDescriptor']['OutputFormat']
table_location = response['Table']['StorageDescriptor']['Location']
serde_info = response['Table']['StorageDescriptor']['SerdeInfo']
partition_keys = response['Table']['PartitionKeys']
步骤2-生成列表字典,其中每个列表都包含创建单个分区的信息。所有列表的结构相同,但其分区特定值将更改(年,月,日,小时)
def generate_partition_input_list(start_date, num_of_days, table_location,
input_format, output_format, serde_info):
input_list = [] # Initializing empty list
today = datetime.utcnow().date()
if start_date > today: # To handle scenarios if any future partitions are created manually
start_date = today
end_date = today + timedelta(days=num_of_days) # Getting end date till which partitions needs to be created
logger.info("Partitions to be created from {} to {}".format(start_date, end_date))
for input_date in date_range(start_date, end_date):
# Formatting partition values by padding required zeroes and converting into string
year = str(input_date)[0:4].zfill(4)
month = str(input_date)[5:7].zfill(2)
day = str(input_date)[8:10].zfill(2)
for hour in range(24): # Looping over 24 hours to generate partition input for 24 hours for a day
hour = str('{:02d}'.format(hour)) # Padding zero to make sure that hour is in two digits
part_location = "{}{}/{}/{}/{}/".format(table_location, year, month, day, hour)
input_dict = {
'Values': [
year, month, day, hour
],
'StorageDescriptor': {
'Location': part_location,
'InputFormat': input_format,
'OutputFormat': output_format,
'SerdeInfo': serde_info
}
}
input_list.append(input_dict.copy())
return input_list
步骤3-调用batch_create_partition()API
for each_input in break_list_into_chunks(partition_input_list, 100):
create_partition_response = client.batch_create_partition(
CatalogId=catalog_id,
DatabaseName=l_database,
TableName=l_table,
PartitionInputList=each_input
)
单个限制为100个分区api调用,因此,如果要创建100个以上的分区,则需要将列表分成多个块并对其进行迭代。
There is a limit of 100 partitions in a single api call, So if you are creating more than 100 partitions then you will need to break your list into chunks and iterate over it.
这篇关于通过AWS上的API在粘合表上添加分区?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!