为“周期表"抓取维基页面和所有链接 [英] Scraping a wiki page for the "Periodic table" and all the links
问题描述
我想抓取以下 wiki 文章:http://en.wikipedia.org/wiki/Periodic_table
I wish to scrape the following wiki article: http://en.wikipedia.org/wiki/Periodic_table
这样我的 R 代码的输出将是一个包含以下列的表格:
So that the output of my R code will be a table with the following columns:
- 化学元素简称
- 化学元素全称
- 化学元素维基页面的 URL
(显然每个化学元素都有一行)
(and with a row for each chemical element, obviously)
我正在尝试使用 XML 包获取页面内的值,但似乎一开始就被卡住了,所以我希望能提供一个有关如何操作的示例(和/或相关示例的链接)
I am trying to get to the values inside the page using the XML package, but seems to be stuck in the beginning, so I'd appreciate an example on how to do it (and/or links to relevant examples)
library(XML)
base_url<-"http://en.wikipedia.org/wiki/Periodic_table"
base_html<-getURLContent(base_url)[[1]]
parsed_html <- htmlTreeParse(base_html, useInternalNodes = TRUE)
xmlChildren(parsed_html)
getNodeSet(parsed_html, "//html", c(x = base_url))
[[1]]
attr(,"class")
[1] "XMLNodeSet"
推荐答案
Tal -- 我以为这会很容易.我将向您指出 readHTMLTable(),这是我在 XML 包中最喜欢的函数.哎呀,它的帮助页面甚至显示了抓取维基百科页面的示例!
Tal -- I thought this was going to be easy. I was going to point you to readHTMLTable(), my favorite function in the XML package. Heck, its help page even shows an example of scraping a Wikipedia page!
可惜,这不是你想要的:
But alas, this is not what you want:
library(XML)
url = 'http://en.wikipedia.org/wiki/Periodic_table'
tables = readHTMLTable(html)
# ... look through the list to find the one you want...
table = tables[3]
table
$`NULL`
Group # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
1 Period <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 1 1H 2He <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 2 3Li 4Be 5B 6C 7N 8O 9F 10Ne <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
4 3 11Na 12Mg 13Al 14Si 15P 16S 17Cl 18Ar <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
5 4 19K 20Ca 21Sc 22Ti 23V 24Cr 25Mn 26Fe 27Co 28Ni 29Cu 30Zn 31Ga 32Ge 33As 34Se 35Br 36Kr
6 5 37Rb 38Sr 39Y 40Zr 41Nb 42Mo 43Tc 44Ru 45Rh 46Pd 47Ag 48Cd 49In 50Sn 51Sb 52Te 53I 54Xe
7 6 55Cs 56Ba * 72Hf 73Ta 74W 75Re 76Os 77Ir 78Pt 79Au 80Hg 81Tl 82Pb 83Bi 84Po 85At 86Rn
8 7 87Fr 88Ra ** 104Rf 105Db 106Sg 107Bh 108Hs 109Mt 110Ds 111Rg 112Cn 113Uut 114Uuq 115Uup 116Uuh 117Uus 118Uuo
9 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
10 * Lanthanoids 57La 58Ce 59Pr 60Nd 61Pm 62Sm 63Eu 64Gd 65Tb 66Dy 67Ho 68Er 69Tm 70Yb 71Lu <NA> <NA>
11 ** Actinoids 89Ac 90Th 91Pa 92U 93Np 94Pu 95Am 96Cm 97Bk 98Cf 99Es 100Fm 101Md 102No 103Lr <NA> <NA>
名称不见了,原子序数变成了符号.
The names are gone and the atomic number runs into the symbol.
所以回到绘图板...
我的 DOM walk-fu 不是很强大,所以这不是很漂亮.它获取表格单元格中的每个链接,只保留那些具有标题"属性的链接(即符号所在的位置),并将您想要的内容粘贴到 data.frame 中.它也获取页面上所有其他此类链接,但我们很幸运,元素是前 118 个此类链接:
My DOM walking-fu is not very strong, so this isn't pretty. It gets every link in a table cell, only keeps those with a "title" attribute (that's where the symbol is), and sticks what you want in a data.frame. It gets every other such link on the page, too, but we're lucky and the elements are the first 118 such links:
library(XML)
library(plyr)
url = 'http://en.wikipedia.org/wiki/Periodic_table'
# don't forget to parse the HTML, doh!
doc = htmlParse(url)
# get every link in a table cell:
links = getNodeSet(doc, '//table/tr/td/a')
# make a data.frame for each node with non-blank text, link, and 'title' attribute:
df = ldply(links, function(x) {
text = xmlValue(x)
if (text=='') text=NULL
symbol = xmlGetAttr(x, 'title')
link = xmlGetAttr(x, 'href')
if (!is.null(text) & !is.null(symbol) & !is.null(link))
data.frame(symbol, text, link)
} )
# only keep the actual elements -- we're lucky they're first!
df = head(df, 118)
head(df)
symbol text link
1 Hydrogen H /wiki/Hydrogen
2 Helium He /wiki/Helium
3 Lithium Li /wiki/Lithium
4 Beryllium Be /wiki/Beryllium
5 Boron B /wiki/Boron
6 Carbon C /wiki/Carbon
这篇关于为“周期表"抓取维基页面和所有链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!