使用R收集数据-多个网址 [英] Gathering data using R - multiple urls

查看：130 发布时间：2020/5/4 5:29:34 r xml loops web-scraping

本文介绍了使用R收集数据-多个网址的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个具有几列和几行的数据框-一些包含信息，一些填充有NA，应将其替换为某些数据.

I have a dataframe which has a several columns and rows - some contain information, some are filled with NA, which should be replaced with certain data.

行代表特定工具，列包含给定行中工具的各种详细信息.数据框的最后一列为每个工具都有一个url，然后将使用该url来获取空列的数据:

The rows represent specific instruments and columns contain various details of the instrument in a given row. The last column of the dataframe has a url for each instrument, which then will be used to grab data for empty columns:

 Issuer  NIN or ISIN           Type Nominal Value # of Bonds Issue Volume Start Date End Date
1 NBRK KZW1KD079112 discount notes            NA         NA           NA         NA       NA
2 NBRK KZW1KD079146 discount notes            NA         NA           NA         NA       NA
3 NBRK KZW1KD079153 discount notes            NA         NA           NA         NA       NA
4 NBRK KZW1KD089137 discount notes            NA         NA           NA         NA       NA

 URL
1 http://www.kase.kz/en/gsecs/show/NTK007_1911
2 http://www.kase.kz/en/gsecs/show/NTK007_1914
3 http://www.kase.kz/en/gsecs/show/NTK007_1915
4 http://www.kase.kz/en/gsecs/show/NTK008_1913

例如，使用以下代码，我获得了NBRK KZW1KD079112行中第一个乐器的详细信息:

For example, with the following code I get the details for the first instrument in the row NBRK KZW1KD079112:

sp = readHTMLTable(newd$URL[[1]])
sp[[4]]

其中提供以下内容:

                                            V1                                                              

    V2
1                                     Trading code                                                         NTK007_1911
2                               List of securities                                                            official
3                              System of quotation                                                               price
4                                Unit of quotation                                   nominal value percentage fraction
5                               Quotation currency                                                                 KZT
6                               Quotation accuracy                                                        4 characters
7                       Trade lists admission date                                                            04/21/17
8                               Trade opening date                                                            04/24/17
9                       Trade lists exclusion date                                                            04/28/17
10                                        Security                                                                <NA>
11                                     Bond's name short-term notes of the National Bank of the Republic of Kazakhstan
12                                            NSIN                                                        KZW1KD079112
13                   Currency of issue and service                                                                 KZT
14               Nominal value in issue's currency                                                              100.00
15                      Number of registered bonds                                                       1,929,319,196
16                     Number of bonds outstanding                                                       1,929,319,196
17                               Issue volume, KZT                                                     192,931,919,600
18 Settlement basis (days in month / days in year)                                                        actual / 365
19                       Date of circulation start                                                            04/21/17
20                          Circulation term, days                                                                   7
21              Register fixation date at maturity                                                            04/27/17
22                        Principal repayment date                                                            04/28/17
23                                    Paying agent                          Central securities depository JSC (Almaty)
24                                       Registrar                          Central securities depository JSC (Almaty)

从此，我将只保留:

14               Nominal value in issue's currency                                                              100.00
16                     Number of bonds outstanding                                                       1,929,319,196
17                               Issue volume, KZT                                                     192,931,919,600
19                       Date of circulation start                                                            04/21/17
22                        Principal repayment date                                                            04/28/17

然后我将所需的数据复制到初始数据帧并继续进行下一行...数据帧由100多个行组成，并且将不断变化.

I then will copy the needed data to the initial dataframe and carry on with the next row ... The dataframe consist of 100+ rows and will keep changing.

我将不胜感激.

更新:

好像我需要的数据并不总是在sp[[4]]中.有时它的sp[[7]]，也许将来它会是完全不同的表.有什么方法可以在抓取的表中查找信息并标识可以进一步用于收集数据的特定表?

Looks like the data that I need are not always in sp[[4]]. Sometimes its sp[[7]], maybe in the future it will be totally different table. Is there any way that looks for the information in the scraped tables and identifies a specific table that could be used further to collect data?:

sp = readHTMLTable(newd$URL[[1]])
sp[[4]]

推荐答案

library(XML)
library(reshape2)
library(dplyr)

name = c(
"NBRK KZW1KD079112 discount notes",                                           
"NBRK KZW1KD079146 discount notes",                                        
"NBRK KZW1KD079153 discount notes",                                         
"NBRK KZW1KD089137 discount notes")                                           

URL = c(
"http://www.kase.kz/en/gsecs/show/NTK007_1911",
"http://www.kase.kz/en/gsecs/show/NTK007_1914",
"http://www.kase.kz/en/gsecs/show/NTK007_1915",
"http://www.kase.kz/en/gsecs/show/NTK008_1913")

# data
instruments <- data.frame(name, URL, stringsAsFactors = FALSE)

# define the columns wanted and the mapping to desired name
# extend to all wanted columns
wanted <- c("Nominal value in issue's currency" = "Nominal Value",
            "Number of bonds outstanding" = "# of Bonds Issue")

# function returns a data frame of wanted columns for given URL
getValues <- function (name, url) {
  # get the table and rename columns
  sp = readHTMLTable(url, stringsAsFactors = FALSE)
  df <- sp[[4]]
  names(df) <- c("full_name", "value")

  # filter and remap wanted columns
  result <- df[df$full_name %in% names(wanted),]
  result$column_name <- sapply(result$full_name, function(x) {wanted[[x]]})

  # add the identifier to every row
  result$name <- name
  return (result[,c("name", "column_name", "value")])
}

# invoke function for each name/URL pair - returns list of data frames
columns <- apply(instruments[,c("name", "URL")], 1, function(x) {getValues(x[["name"]], x[["URL"]])})

# bind using dplyr:bind_rows to make a tall data frame
tall <- bind_rows(columns)

# make wide using dcast from reshape2
wide <- dcast(tall, name ~ column_name, id.vars = "value")

wide

#                               name # of Bonds Issue Nominal Value
# 1 NBRK KZW1KD079112 discount notes    1,929,319,196        100.00
# 2 NBRK KZW1KD079146 discount notes    1,575,000,000        100.00
# 3 NBRK KZW1KD079153 discount notes      701,390,693        100.00
# 4 NBRK KZW1KD089137 discount notes    1,380,368,000        100.00

    enter code here

这篇关于使用R收集数据-多个网址的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用R收集数据-多个网址 [英] Gathering data using R - multiple urls

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用R收集数据-多个网址 [英] Gathering data using R - multiple urls

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭