使用R收集数据-多个网址 [英] Gathering data using R - multiple urls
问题描述
我有一个具有几列和几行的数据框-一些包含信息,一些填充有NA,应将其替换为某些数据.
I have a dataframe which has a several columns and rows - some contain information, some are filled with NA, which should be replaced with certain data.
行代表特定工具,列包含给定行中工具的各种详细信息.数据框的最后一列为每个工具都有一个url,然后将使用该url来获取空列的数据:
The rows represent specific instruments and columns contain various details of the instrument in a given row. The last column of the dataframe has a url for each instrument, which then will be used to grab data for empty columns:
Issuer NIN or ISIN Type Nominal Value # of Bonds Issue Volume Start Date End Date
1 NBRK KZW1KD079112 discount notes NA NA NA NA NA
2 NBRK KZW1KD079146 discount notes NA NA NA NA NA
3 NBRK KZW1KD079153 discount notes NA NA NA NA NA
4 NBRK KZW1KD089137 discount notes NA NA NA NA NA
URL
1 http://www.kase.kz/en/gsecs/show/NTK007_1911
2 http://www.kase.kz/en/gsecs/show/NTK007_1914
3 http://www.kase.kz/en/gsecs/show/NTK007_1915
4 http://www.kase.kz/en/gsecs/show/NTK008_1913
例如,使用以下代码,我获得了NBRK KZW1KD079112
行中第一个乐器的详细信息:
For example, with the following code I get the details for the first instrument in the row NBRK KZW1KD079112
:
sp = readHTMLTable(newd$URL[[1]])
sp[[4]]
其中提供以下内容:
V1
V2
1 Trading code NTK007_1911
2 List of securities official
3 System of quotation price
4 Unit of quotation nominal value percentage fraction
5 Quotation currency KZT
6 Quotation accuracy 4 characters
7 Trade lists admission date 04/21/17
8 Trade opening date 04/24/17
9 Trade lists exclusion date 04/28/17
10 Security <NA>
11 Bond's name short-term notes of the National Bank of the Republic of Kazakhstan
12 NSIN KZW1KD079112
13 Currency of issue and service KZT
14 Nominal value in issue's currency 100.00
15 Number of registered bonds 1,929,319,196
16 Number of bonds outstanding 1,929,319,196
17 Issue volume, KZT 192,931,919,600
18 Settlement basis (days in month / days in year) actual / 365
19 Date of circulation start 04/21/17
20 Circulation term, days 7
21 Register fixation date at maturity 04/27/17
22 Principal repayment date 04/28/17
23 Paying agent Central securities depository JSC (Almaty)
24 Registrar Central securities depository JSC (Almaty)
从此,我将只保留:
14 Nominal value in issue's currency 100.00
16 Number of bonds outstanding 1,929,319,196
17 Issue volume, KZT 192,931,919,600
19 Date of circulation start 04/21/17
22 Principal repayment date 04/28/17
然后我将所需的数据复制到初始数据帧并继续进行下一行...数据帧由100多个行组成,并且将不断变化.
I then will copy the needed data to the initial dataframe and carry on with the next row ... The dataframe consist of 100+ rows and will keep changing.
我将不胜感激.
更新:
好像我需要的数据并不总是在sp[[4]]
中.有时它的sp[[7]]
,也许将来它会是完全不同的表.有什么方法可以在抓取的表中查找信息并标识可以进一步用于收集数据的特定表?
Looks like the data that I need are not always in sp[[4]]
. Sometimes its sp[[7]]
, maybe in the future it will be totally different table. Is there any way that looks for the information in the scraped tables and identifies a specific table that could be used further to collect data?:
sp = readHTMLTable(newd$URL[[1]])
sp[[4]]
推荐答案
library(XML)
library(reshape2)
library(dplyr)
name = c(
"NBRK KZW1KD079112 discount notes",
"NBRK KZW1KD079146 discount notes",
"NBRK KZW1KD079153 discount notes",
"NBRK KZW1KD089137 discount notes")
URL = c(
"http://www.kase.kz/en/gsecs/show/NTK007_1911",
"http://www.kase.kz/en/gsecs/show/NTK007_1914",
"http://www.kase.kz/en/gsecs/show/NTK007_1915",
"http://www.kase.kz/en/gsecs/show/NTK008_1913")
# data
instruments <- data.frame(name, URL, stringsAsFactors = FALSE)
# define the columns wanted and the mapping to desired name
# extend to all wanted columns
wanted <- c("Nominal value in issue's currency" = "Nominal Value",
"Number of bonds outstanding" = "# of Bonds Issue")
# function returns a data frame of wanted columns for given URL
getValues <- function (name, url) {
# get the table and rename columns
sp = readHTMLTable(url, stringsAsFactors = FALSE)
df <- sp[[4]]
names(df) <- c("full_name", "value")
# filter and remap wanted columns
result <- df[df$full_name %in% names(wanted),]
result$column_name <- sapply(result$full_name, function(x) {wanted[[x]]})
# add the identifier to every row
result$name <- name
return (result[,c("name", "column_name", "value")])
}
# invoke function for each name/URL pair - returns list of data frames
columns <- apply(instruments[,c("name", "URL")], 1, function(x) {getValues(x[["name"]], x[["URL"]])})
# bind using dplyr:bind_rows to make a tall data frame
tall <- bind_rows(columns)
# make wide using dcast from reshape2
wide <- dcast(tall, name ~ column_name, id.vars = "value")
wide
# name # of Bonds Issue Nominal Value
# 1 NBRK KZW1KD079112 discount notes 1,929,319,196 100.00
# 2 NBRK KZW1KD079146 discount notes 1,575,000,000 100.00
# 3 NBRK KZW1KD079153 discount notes 701,390,693 100.00
# 4 NBRK KZW1KD089137 discount notes 1,380,368,000 100.00
enter code here
这篇关于使用R收集数据-多个网址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!