创建循环以将URL提取到json和csv [英] create loop to extract urls to json and csv

查看:75
本文介绍了创建循环以将URL提取到json和csv的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我设置了一个回收37900条记录的循环.由于url/服务器的设置方式,每个url中最多只能显示200条记录.每个url以'skip = 200'或200的倍数结尾,以循环到显示下一个200条记录的下一个url页面.最终,我想遍历所有网址并将它们附加为表格.相关发布的无法使用分页限制循环最后一个URL a>

I set up a loop to scrap with 37900 records. Due to the way the url/ server is being set up, there's a limit of 200 records displayed in each url. Each url ends with 'skip=200', or mulitiple of 200 to loop to the next url page where the next 200 records are displayed. Eventually I want to loop through all urls and append them as a table. The related posted unable to loop the last url with paging limits

我创建了两个循环,如下所示-一个循环用于使用skip =每200条记录创建一个url,另一个循环获取每个url的响应,然后另一个循环读取json并将它们附加到单个数据帧中.

I created two loops shown as below - one for creating urls with skip= every 200 records, and another one to get response of each of these urls, then another loop to read json and append them to a single dataframe.

我不确定第二个循环中缺少什么-到目前为止,它仅为第一个URL页面生成json,但不会为后续页面生成json.我觉得usl json不会附加到列表json = [],因此它可以防止循环并将json附加到csv中.感谢您提供有关修改循环和改进这些代码的任何建议!

I'm not sure what's missing in my second loop - so far it only produces json for the first URL page but not the subsequent pages. I have the feeling that the usl jsons are not appended to the list json = [] and so it prevents looping and append the jsons in csv. Any suggestions on modifying the loops and improving these codes are appreciated!

import pandas as pd
import requests
import json

records = range(37900)
skip = records[0::200]

Page = []
for i in skip:
    endpoint = "https://~/Projects?&$skip={}".format(i)
    Page.append(endpoint)

jsnlist = []
for j in Page:
    response = session.get(j) #session here refers to requests.Session() I had to set up to authenticate my access to these urls
    responsejs = response.json()
    responsejsval = responsejs['value'] #I only want to extract header called 'value' in each json
    with open('response2jsval.json', 'w') as outfile:
        json.dump(jsnlist, outfile)

concat = pd.DataFrame()
for k in jsnlist:
        df = pd.DataFrame(k) #list to df
        concat = concat.append(df, ignore_index = True)
        print(concat)


推荐答案

我没有什么可测试的

我认为您对此过于复杂.此后,您已经编辑了问题,但有几点要说明:

I think you massively over-complicated this. You've since edited the question but there's a couple of points to make:

  1. 您定义了jsnlist = [],但从不使用它.为什么?
  2. 您调用了自己的对象json(现在消失了,但是我不确定您是否理解为什么).调用您自己的对象json只会取代实际的模块,整个代码在进入循环之前都会停顿下来
  3. 完全没有理由在尝试创建数据框之前将数据保存到磁盘
  4. 以写入模式('w')打开.json文件将在循环的每次迭代中擦除所有现有数据
  5. 将JSON附加到文件时,读回时不会提供有效的格式进行解析.充其量,它可能是JSONLines
  6. 在循环中添加DataFrame的复杂性非常差,因为它每次都需要复制原始数据.
  1. You define jsnlist = [] but never use it. Why?
  2. You called your own object json (now gone but I'm not sure whether you understand why). Calling your own object json will just supersede the actual module, and the whole code will grind to a halt before you even got into a loop
  3. There is no reason at all to save this data to disk before trying to create a dataframe
  4. Opening the .json file in write mode ('w') will wipe all existing data on each iteration of your loop
  5. Appending JSON to a file will not give a valid format to be parsed when read back in. At best, it might be JSONLines
  6. Appending DataFrames in a loop has terrible complexity because it requires copying of the original data each time.

您的方法将是这样的:

import pandas as pd
import requests
import json

records = range(37900)
skip = records[0::200]

Page = []
for i in skip:
    endpoint = "https://~/Projects?&$skip={}".format(i)
    Page.append(endpoint)

jsnlist = []
for j in Page:
    response = session.get(j) #session here refers to requests.Session() I had to set up to authenticate my access to these urls
    responsejs = response.json()
    responsejsval = responsejs['value'] #I only want to extract header called 'value' in each json
    jsnlist.append(responsejsval)

df = pd.DataFrame(jsnlist)

df = pd.DataFrame(jsnlist)可能需要做一些工作,但是您需要证明我们面临的挑战.我需要看到responsejs['value']才能完全回答.

df = pd.DataFrame(jsnlist) might take some work, but you'll need to show what we're up against. I'd need to see responsejs['value'] to answer fully.

这篇关于创建循环以将URL提取到json和csv的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆