我如何使用 R(Rcurl/XML 包?!)来抓取这个网页? [英] How can I use R (Rcurl/XML packages ?!) to scrape this webpage?

查看:41
本文介绍了我如何使用 R(Rcurl/XML 包?!)来抓取这个网页?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个(有点复杂的)网络抓取挑战,我希望完成,并且希望找到一些方向(无论您想分享的任何级别):

I have a (somewhat complex) web scraping challenge that I wish to accomplish and would love for some direction (to whatever level you feel like sharing) here goes:

我想浏览此链接中的所有物种页面":

I would like to go through all the "species pages" present in this link:

http://gtrnadb.ucsc.edu/

所以对于他们中的每一个,我都会去:

So for each of them I will go to:

  1. 物种页面链接(例如:http://gtrnadb.ucsc.edu/Aero_pern/)
  2. 然后到二级结构"页面链接(例如:http:///gtrnadb.ucsc.edu/Aero_pern/Aero_pern-structs.html)

在该链接内,我希望删除页面中的数据,以便我有一个包含此数据的长列表(例如):

Inside that link I wish to scrap the data in the page so that I will have a long list containing this data (for example):

chr.trna3 (1-77)    Length: 77 bp
Type: Ala   Anticodon: CGC at 35-37 (35-37) Score: 93.45
Seq: GGGCCGGTAGCTCAGCCtGGAAGAGCGCCGCCCTCGCACGGCGGAGGcCCCGGGTTCAAATCCCGGCCGGTCCACCA
Str: >>>>>>>..>>>>.........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<....

每一行都有自己的列表(在每个动物列表中每个trna"的列表中)

Where each line will have it's own list (inside the list for each "trna" inside the list for each animal)

我记得我遇到过 Rcurl 和 XML(在 R 中)可以完成这样的任务.但我不知道如何使用它们.所以我想要的是:1. 关于如何构建这样的代码的一些建议.2. 以及如何学习执行此类任务所需的知识的建议.

I remember coming across the packages Rcurl and XML (in R) that can allow for such a task. But I don't know how to use them. So what I would love to have is: 1. Some suggestion on how to build such a code. 2. And recommendation for how to learn the knowledge needed for performing such a task.

感谢您的帮助,

塔尔

推荐答案

Tal,

您可以使用 R 和 XML 包来执行此操作,但是(该死)这是您试图解析的一些格式不佳的 HTML.事实上,在大多数情况下,您希望使用 readHTMLTable() 函数,在前面的主题中介绍过.

You could use R and the XML package to do this, but (damn) that is some poorly formed HTML you are trying to parse. In fact, in most cases your would want to be using the readHTMLTable() function, which is covered in this previous thread.

然而,鉴于这个丑陋的 HTML,我们将不得不使用 RCurl 包来提取原始 HTML 并创建一些自定义函数来解析它.这个问题有两个组成部分:

Given this ugly HTML, however, we will have to use the RCurl package to pull the raw HTML and create some custom functions to parse it. This problem has two components:

  1. 从基础网页(http://gtrnadb.ucsc.edu/) 使用 RCurl 包中的 getURLContent() 函数和一些正则表达式魔法:-)
  2. 然后获取 URL 列表并抓取您要查找的数据,然后将其粘贴到 data.frame 中.
  1. Get all of the genome URLS from the base webpage (http://gtrnadb.ucsc.edu/) using the getURLContent() function in the RCurlpackage and some regex magic :-)
  2. Then take that list of URLS and scrape the data you are looking for, and then stick it into a data.frame.

所以,这里是...

library(RCurl)

### 1) First task is to get all of the web links we will need ##
base_url<-"http://gtrnadb.ucsc.edu/"
base_html<-getURLContent(base_url)[[1]]
links<-strsplit(base_html,"a href=")[[1]]

get_data_url<-function(s) {
    u_split1<-strsplit(s,"/")[[1]][1]
    u_split2<-strsplit(u_split1,'\\"')[[1]][2]
    ifelse(grep("[[:upper:]]",u_split2)==1 & length(strsplit(u_split2,"#")[[1]])<2,return(u_split2),return(NA))
}

# Extract only those element that are relevant
genomes<-unlist(lapply(links,get_data_url))
genomes<-genomes[which(is.na(genomes)==FALSE)]

### 2) Now, scrape the genome data from all of those URLS ###

# This requires two complementary functions that are designed specifically
# for the UCSC website. The first parses the data from a -structs.html page
# and the second collects that data in to a multi-dimensional list
parse_genomes<-function(g) {
    g_split1<-strsplit(g,"\n")[[1]]
    g_split1<-g_split1[2:5]
    # Pull all of the data and stick it in a list
    g_split2<-strsplit(g_split1[1],"\t")[[1]]
    ID<-g_split2[1]                             # Sequence ID
    LEN<-strsplit(g_split2[2],": ")[[1]][2]     # Length
    g_split3<-strsplit(g_split1[2],"\t")[[1]]
    TYPE<-strsplit(g_split3[1],": ")[[1]][2]    # Type
    AC<-strsplit(g_split3[2],": ")[[1]][2]      # Anticodon
    SEQ<-strsplit(g_split1[3],": ")[[1]][2]     # ID
    STR<-strsplit(g_split1[4],": ")[[1]][2]     # String
    return(c(ID,LEN,TYPE,AC,SEQ,STR))
}

# This will be a high dimensional list with all of the data, you can then manipulate as you like
get_structs<-function(u) {
    struct_url<-paste(base_url,u,"/",u,"-structs.html",sep="")
    raw_data<-getURLContent(struct_url)
    s_split1<-strsplit(raw_data,"<PRE>")[[1]]
    all_data<-s_split1[seq(3,length(s_split1))]
    data_list<-lapply(all_data,parse_genomes)
    for (d in 1:length(data_list)) {data_list[[d]]<-append(data_list[[d]],u)}
    return(data_list)
}

# Collect data, manipulate, and create data frame (with slight cleaning)
genomes_list<-lapply(genomes[1:2],get_structs) # Limit to the first two genomes (Bdist & Spurp), a full scrape will take a LONG time
genomes_rows<-unlist(genomes_list,recursive=FALSE) # The recursive=FALSE saves a lot of work, now we can just do a straigh forward manipulation
genome_data<-t(sapply(genomes_rows,rbind))
colnames(genome_data)<-c("ID","LEN","TYPE","AC","SEQ","STR","NAME")
genome_data<-as.data.frame(genome_data)
genome_data<-subset(genome_data,ID!="</PRE>")   # Some malformed web pages produce bad rows, but we can remove them

head(genome_data)

生成的数据框包含与每个基因组条目相关的七列:ID、长度、类型、序列、字符串和名称.名称列包含基础基因组,这是我对数据组织的最佳猜测.这是它的样子:

The resulting data frame contains seven columns related to each genome entry: ID, length, type, sequence, string, and name. The name column contains the base genome, which was my best guess for data organization. Here it what it looks like:

head(genome_data)
                                   ID   LEN TYPE                           AC                                                                       SEQ
1     Scaffold17302.trna1 (1426-1498) 73 bp  Ala     AGC at 34-36 (1459-1461) AGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTTTCCA
2   Scaffold20851.trna5 (43038-43110) 73 bp  Ala   AGC at 34-36 (43071-43073) AGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTCTCCA
3   Scaffold20851.trna8 (45975-46047) 73 bp  Ala   AGC at 34-36 (46008-46010) TGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTCTCCA
4     Scaffold17302.trna2 (2514-2586) 73 bp  Ala     AGC at 34-36 (2547-2549) GGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACAGGGATCGATGCCCGGGTTCTCCA
5 Scaffold51754.trna5 (253637-253565) 73 bp  Ala AGC at 34-36 (253604-253602) CGGGGGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTCCTCCA
6     Scaffold17302.trna4 (6027-6099) 73 bp  Ala     AGC at 34-36 (6060-6062) GGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGAGTTCTCCA
                                                                        STR  NAME
1 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
2 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
3 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
4 >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>.>>>.......<<<.<<<<<<<<. Spurp
5 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
6 >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<......>>>>.......<<<<.<<<<<<<. Spurp

我希望这会有所帮助,并感谢您提供有趣的周日下午 R 挑战!

I hope this helps, and thanks for the fun little Sunday afternoon R challenge!

这篇关于我如何使用 R(Rcurl/XML 包?!)来抓取这个网页?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆