如何使用 Python 优化大型数据集的 API 调用? [英] How to optimize API calls for a large dataset using Python?

查看:26
本文介绍了如何使用 Python 优化大型数据集的 API 调用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目标:向 API 发送地址列表并提取某些信息(例如:指示地址是否位于洪水区的标志).

Objective: Send a list of addresses to an API and extract certain information(eg: a flag which indicates if an address is in a flood zone or not).

解决方案:适用于小数据的 Python 脚本.

Solution: Working Python script for small data.

问题:我想针对大量输入优化当前的解决方案.如何提高 API 调用的性能.如果我有 100,000 个地址,我当前的解决方案会失败吗?这会减慢 HTTP 调用的速度吗?我会收到 TIME 请求吗?API 是否会阻止 API 调用的数量?

Problem: I want to optimize my current solution for large input. How to improve the performance of the API calls. If I have 100,000 addresses will my current solution fail? Will this slow down the HTTP calls? Will I get a request TIME out? Does the API resist the number of API calls being made?

  • 输入:地址列表

样本输入

777 布罗克顿大道,阿宾顿 MA 2351

777 Brockton Avenue, Abington MA 2351

30 Memorial Drive, Avon MA 2322

30 Memorial Drive, Avon MA 2322

我当前的解决方案适用于小型数据集.

My current solution works well for a small dataset.

# Creating a function to get lat & long of the existing adress and then detecting the zone in fema
def zonedetect(addrs):
    global geolocate
    geocode_result = geocode(address=addrs, as_featureset=True)
    latitude = geocode_result.features[0].geometry.x
    longitude = geocode_result.features[0].geometry.y
    url = "https://hazards.fema.gov/gis/nfhl/rest/services/public/NFHL/MapServer/28/query?where=1%3D1&text=&objectIds=&time=&geometry="+str(latitude)+"%2C"+str(longitude)+"&geometryType=esriGeometryPoint&inSR=4326&spatialRel=esriSpatialRelIntersects&relationParam=&outFields=*&returnGeometry=true&returnTrueCurves=false&maxAllowableOffset=&geometryPrecision=&outSR=&returnIdsOnly=false&returnCountOnly=false&orderByFields=&groupByFieldsForStatistics=&outStatistics=&returnZ=false&returnM=false&gdbVersion=&returnDistinctValues=false&resultOffset=&resultRecordCount=&queryByDistance=&returnExtentsOnly=false&datumTransformation=&parameterValues=&rangeValues=&f=json"
    response = req.get(url)
    parsed_data = json.loads(response.text)
    formatted_data = json_normalize(parsed_data["features"])
    formatted_data["Address_1"] = addrs

    #Exception handling
    if response.status_code == 200:
        geolocate = geolocate.append(formatted_data, ignore_index = True)
    else: 
        print("Request to {} failed".format(postcode))

# Reading every adress from existing dataframe
for i in range(len(df.index)):
    zonedetect(df["Address"][i])

除了使用上面的 for 循环之外,还有一种替代方法.我可以批量处理这个逻辑吗?

Instead of using the for loop above is there an alternative. Can I process this logic in a batch?

推荐答案

hazards.fema.gov 服务器发送 100,000 个请求肯定会导致其服务器速度变慢,但主要会影响您的脚本,因为您需要等待每个单独的 HTTP 请求排队并做出响应,这可能需要很长时间来处理.

Sending 100,000 requests to the hazards.fema.gov server will definitely cause some slow downs on their server but it will mostly impact your script as you will need to wait for every single HTTP request to be queued and responded to which could take an extremely long time to process.

最好为您需要的所有内容发送一个 REST 查询,然后再处理逻辑.查看 REST API,您会发现 geometry URL 参数能够接受 geometryMultiPoint 来自文档.下面是一个多点的例子:

What would be better is to send one REST query for everything you will need and then handle the logic afterwards. Looking at the REST API, you can find that the geometry URL parameter is able to accept a geometryMultiPoint from the docs. Here is an example of a multiPoint:

{
  "points" : [[-97.06138,32.837],[-97.06133,32.836],[-97.06124,32.834],[-97.06127,32.832]],
  "spatialReference" : {"wkid" : 4326}
}

所以你可以做的是创建一个对象来存储你想要查询的所有点:

So what you can do is make an object to store all the points you want to query:

multipoint = { points: [], spatialReference: { wkid: 4326}

当您循环时,将纬度/经度点附加到多点列表中:

And when you loop, append the lat/long point to the multipoint list:

for i in range(len(df.index)):
    address = df["Address"][i]
    geocode_result = geocode(address=addrs, as_featureset=True)
    latitude = geocode_result.features[0].geometry.x
    longitude = geocode_result.features[0].geometry.y
    multiPoint.points.append([latitude, longitude])

然后,您可以在查询中将多点设置为 geometry,这样只会产生一个 API 请求,而不是每个点一个.

Then you can set the multipoint as the geometry in your query which results in just one API request instead of one for each point.

这篇关于如何使用 Python 优化大型数据集的 API 调用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆