在R的列表中合并不同的表 [英] Combine different tables in a list in R

查看:77
本文介绍了在R的列表中合并不同的表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

更新:下面的代码似乎有效

我不确定这个问题的答案,因此,我很抱歉不好我尝试寻找使用Apply组合列表中的不同元素,但这似乎不起作用。

I'm not entire sure to how this question, so I apologise if this is worded badly. I tried looking for "combine different elements of a list using apply" but that doesn't seem to work.

无论如何,由于抓取网站的结果,我有两个向量提供识别信息,并且有一个包含许多不同表的列表。简化版本如下所示:

Anyways, as the result of scraping a website, I have two vectors giving identifying information and a list that contains a number of different tables. A simplified version looks something like this:

respondents <- c("A", "B")
questions <- c("question1", "question2")

df1 <- data.frame(
   option = c("yes", "no"),
   percentage = c(70, 30), stringsAsFactors = FALSE)

df2 <- data.frame(
   option= c("today", "yesterday"),
   percentage =c(30, 70), stringsAsFactors = FALSE)

df3 <- data.frame(
   option = c("yes", "no"),
   percentage = c(60, 40), stringsAsFactors = FALSE)

df4 <- data.frame(
    option= c("today", "yesterday"),
    percentage =c(20, 80), stringsAsFactors = FALSE)

lst <- list(df1, df2, df3, df4)

前两个表是第一个参与者的问题和答案,后两个表是第二个参与者的问题。我想做的是创建两个表,其中包含两个参与者的问题的答案。因此,我想要看起来像这样的东西:

The first two tables are questions and responses from the first participant, and the second two tables are questions are from the second participant. What i would like to do is to create two tables that contain the answers to the questions for the two participants. So I would like something that looks like this:

question1 <- data.frame(
   option = c("yes", "no"),
   A = c(70, 30),
   B = c(60, 40), stringsAsFactors = FALSE)

question2 <- data.frame(
   option = c("today", "yesterday"),
   A = c(30, 70),
   B = c(20, 80), stringsAsFactors = FALSE)

在我的情况下,我收到了来自51位参与者的122条回应,并要求将表1 -122来自第一个参与者,接下来的122个表来自第二个参与者,依此类推。最终,我想拥有122个表(每个问题一个表),每个表包含与每个参与者对应的51列。我或多或少不知道如何执行此操作,因此,我将不胜感激。

In my case, I have 122 responses from 51 participants, and it ordered so that tables 1-122 are from the first participant, the next 122 tables are from the second participant, etc. Ultimately, then, I would like to have 122 tables (one table per question), with each table containing 51 columns that correspond to each participant. I am more or less at a loss as to how to do this, so I would appreciate any suggestions.

现在应该可以使用

library("RCurl")
library("XML")

# Get the data
## Create URL address

mainURL <- 'http://www4.uwm.edu/FLL/linguistics/dialect/staticmaps/'
stateURL <- 'states.html'
url  <-  paste0(mainURL, stateURL)

## Download URL

tmp <- getURL(url)

## Parse
tmp  <-  htmlTreeParse(tmp, useInternalNodes = TRUE)

## Extract page addresses and save to subURL
subURL  <-  unlist(xpathSApply(tmp, '//a[@href]', xmlAttrs))


## Remove pages that aren't state's names
subURL  <- subURL[-(1:4)]


## Show first four states
head(subURL, 4)



#  Get questions 
## Select first state
suburl  <-  subURL[1]

## Paste it at the end of the main URL
url <- paste0(mainURL, suburl)


## Download URL
tmp  <- getURL(url)

## Read data from html 

tb <- readHTMLTable(tmp, stringsAsFactors = FALSE)


##Remove empty strings
Questions  <- Questions[Questions!= '']


# Create objects to populate later

stateNames <- rep('', length(subURL))

## Populate stateNames

### Remove state_ from stateNames
stateNames <- gsub('state_','',subURL)

### Remove .html from stateNames
stateNames <- gsub('.html','',stateNames)

# Remove pictures in the data representing IPA symbols with their names      (e.g., names of the pictures)

## Get url
url <- paste0(mainURL, subURL)
tmp <- getURL(url) 

## Replace .gif with _
tmp <- gsub(".gif>", '_', tmp)

## Replace "<img\\s+src=./images/" with _
tmp <- gsub("<img\\s+src=./images/", '_', tmp)


# Read in data

tb <- readHTMLTable(tmp, stringsAsFactors = FALSE)


## Subset 2nd and 4th columns and apply to every item on list
tb <-  lapply(tb, function(x) x[,c(2,4)])

## Remove quotation marks, percent sign and convert to number; apply to every item

tb <-  lapply(tb, function(x) {
  x [,2 ] = gsub('\\(','',x[,2] )
  x [,2 ] = gsub('%\\)','',x[,2])
  x [,2 ] = as.numeric(x[,2])
  x
}
)

## Assign column names to all dataframes
tb <- lapply(tb, setNames , nm = c("option", "percentage"))

#get rid of extra tables
tb1 <- tb[-seq(1, length(tb), by=123)] 

## Function to clean data sets

f1 <- function(list1){ Reduce(function(...) merge(..., by= 'option', all=TRUE), list1) }; res <- lapply(1:122, function(i) {indx <- seq(i, length(tb), by=122); f1(tb[indx])})

## Function to merge datasets together
res1 <- lapply(1:122, function(i) f1(tb1[seq(i, length(tb1), by=122)]))

## Create names for the states
stateNames2 <- c("option", stateNames)

# Rename columns in the new dataframes
res2 <- lapply(res1, setNames , nm = stateNames2)

# Test to see whether it works
test <- res2[[122]]


推荐答案

感谢akrun(请参阅评论),我可以使用它。完整的代码在这里:

Thanks to akrun (see comments), I got this to work. The full code is here:

library("RCurl")
library("XML")


# Get the data
## Create URL address



mainURL <- 'http://www4.uwm.edu/FLL/linguistics/dialect/staticmaps/'
stateURL <- 'states.html'
url  <-  paste0(mainURL, stateURL)
url

## Download URL

tmp <- getURL(url)

## Parse
tmp  <-  htmlTreeParse(tmp, useInternalNodes = TRUE)

## Extract page addresses and save to subURL
subURL  <-  unlist(xpathSApply(tmp, '//a[@href]', xmlAttrs))


## Remove pages that aren't state's names
subURL  <- subURL[-(1:4)]


## Show first four states
head(subURL, 4)



#  Get questions
## Select first state
suburl  <-  subURL[1]

## Paste it at the end of the main URL
url <- paste0(mainURL, suburl)


## Download URL
tmp  <- getURL(url)

## Read data from html 

tb <- readHTMLTable(tmp, stringsAsFactors = FALSE)

## Remove first column
Questions  <- tb[[1]][,1]


##Remove empty strings
Questions  <- Questions[Questions!= '']

# Create objects to populate later



 survey <-  vector(length(subURL), mode = "list")
i <- 1
stateNames <- rep('', length(subURL))



## Populate stateNames

### Remove state_ from stateNames
stateNames <- gsub('state_','',subURL)


### Remove .html from stateNames
stateNames <- gsub('.html','',stateNames)



# Remove pictures in the data representing IPA symbols with their names (e.g., names of the pictures)

## Get url
url <- paste0(mainURL, subURL)
tmp <- getURL(url) 


## Replace .gif with _

tmp <- gsub(".gif>", '_', tmp)

## Replace "<img\\s+src=./images/" with _

tmp <- gsub("<img\\s+src=./images/", '_', tmp)


# Read in data

tb <- readHTMLTable(tmp, stringsAsFactors = FALSE)

#tb <- tb[-1]


## Subset 2nd and 4th columns and apply to every item on list
tb <-  lapply(tb, function(x) x[,c(2,4)])


## Remove quotation marks, percent sign and convert to number; apply to every item

tb <-  lapply(tb, function(x) {
    x [,2 ] = gsub('\\(','',x[,2] )
    x [,2 ] = gsub('%\\)','',x[,2])
    x [,2 ] = as.numeric(x[,2])
    x
}
)


## Assign column names to all dataframes

tb <- lapply(tb, setNames , nm = c("option", "percentage"))

## Remove unneeded dataframes in list

tb1 <- tb[-seq(1, length(tb), by=123)]


## Function to clean data sets

f1 <- function(list1){ Reduce(function(...) merge(..., by= 'option', all=TRUE), list1) }; res <- lapply(1:122, function(i) {indx <- seq(i, length(tb), by=122); f1(tb[indx])})

## Function to merge datasets together
res1 <- lapply(1:122, function(i) f1(tb1[seq(i, length(tb1), by=122)]))

## Create names for the states
stateNames2 <- c("Options", stateNames)

# Rename columns in the new dataframes
res2 <- lapply(res1, setNames , nm = stateNames2)

# Test to see whether it works
test <- res2[[1]]

这篇关于在R的列表中合并不同的表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆