如何使用 for 循环自动抓取网页 [英] How to automate webscraping with for loop

查看:68
本文介绍了如何使用 for 循环自动抓取网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含两列的 df:id 和 url.id 包含项目 id,url 包含网站链接,我想用它来抓取父项目的 id.这是我拥有的 df 示例:

I have a df with two columns: id and url. id contains project ids, and url contains website links which I would like to use for scraping ids of parent projects. Here is a sample of df that I have:

这是一个示例 df:

df <- structure(list(id = c("P173165", "P175875", "P175841", "P175730"
), url = c("https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P173165&apilang=en", 
"https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P175875&apilang=en", 
"https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P175841&apilang=en", 
"https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P175730&apilang=en"
)), row.names = c(NA, -4L), class = c("data.table", "data.frame"))

> df
        id                                                                                 url
1: P173165 https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P173165&apilang=en
2: P175875 https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P175875&apilang=en
3: P175841 https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P175841&apilang=en
4: P175730 https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P175730&apilang=en

@Sirius 建议我可以使用以下代码抓取父项目 ID:

I was suggested by @Sirius that I can scrape parent project ids by using the following code:

library(jsonlite)

#let's do an example for row 1

json_data <- fromJSON("https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P173165&apilang=en")
json_data$projects[["P173165"]]$parentprojid

如你所见,我从第一行输入了 url;然后我从第一行输入 id.此代码输出父项目 ID:

As you see, I input the url from the first row; and then I input the id from the first row. This code outputs a parent project id:

[1] "P147665"

我想编写一个代码来自动化这个过程,并创建一个包含父项目 ID 的变量.这是我想要实现的:

I want to write a code that would automatise this process, and would create a variable that would contain the parent projects' ids. This is want I want to achieve:

        id                                                                                 url par_proj_id
1: P173165 https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P173165&apilang=en     P147665
2: P175875 https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P175875&apilang=en     P173883
3: P175841 https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P175841&apilang=en     P170267
4: P175730 https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P175730&apilang=en     P173799

我想我应该在这里使用 for 循环,但我不确定.有任何想法吗?非常感谢您的帮助.

I guess I should be using a for loop here, but I'm not sure. Any ideas? I'd appreciate any help a lot.

推荐答案

这很简单,但我会选择异步,这样您就不必等待每个方案.

This is pretty simple, but I'd go with async so you don't have to wait for each one.


ids <- c("P173165", "P175875", "P175841", "P175730")

df <- data.table(
    id=ids,
    url = sprintf(
        "https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=%s&apilang=en",
        ids
    )
)

library(remotes)
remotes::install_github("r-lib/async")
library(async)

async_get <- async(function(url,project) {
    http_get(url)$
        then(function(x) { rawToChar(x$content)})$
        then(function(x) { fromJSON(x)})$
        then(function(x) { x$projects[[1]]$parentprojid } )
})
parent.ids <- df$synchronise(async_map(df$url, async_get, .limit=5)) ## 5 is a nice limit not to bombard the site

df$par_proj_id <- parent.ids

请参阅此页面 有关异步的更多信息.

See this page for more info on async.

这篇关于如何使用 for 循环自动抓取网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆