在 rvest 中抓取位置数据 [英] Scraping location data in rvest

查看:40
本文介绍了在 rvest 中抓取位置数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试从我使用 rvest 的 url 列表中抓取纬度/经度数据.每个 URL 都有一个带有特定位置的嵌入式谷歌地图,但 URL 本身不显示 API 所采用的路径.

I'm currently trying to scrape latitude/longitude data from a list of urls I have using rvest. Each URL has an embedded google map with a specific location, but the urls themselves don't show the path that the API is taking.

查看页面源代码时,我看到我要找的部分在这里:

When looking at the page source, I see that the part I'm after is here:

<script type="text/javascript" src="http://maps.google.com/maps/api/js?sensor=false">
</script>
<script type="text/javascript">
function initialize() {
var myLatlng = new google.maps.LatLng(43.805170,-70.722084);
var myOptions = {
  zoom: 16,
  center: myLatlng,
  mapTypeId: google.maps.MapTypeId.SATELLITE
}
var map = new google.maps.Map(document.getElementById("map_canvas"), myOptions);

var marker = new google.maps.Marker({
    position: myLatlng, 
    map: map,
    title:"F.E. Wood & Sons - Natural Energy"
});   

现在,如果我能得到包含 LatLng(....) 输入的那一行,我就可以使用一些字符串解析操作来导出所有 URL 的纬度和经度值.

Now, if I can just get the line that has the LatLng(....) input, I can use some string parsing operations to derive the latitude and longitude values for all of the URLs.

我编写了以下代码来获取我的数据:

I've written the following code to get my data:

require(rvest)
require(magrittr)
fetchLatLong<-function(url){
  url<-as.character(url)
  solNum<-html(url)%>%
    html_nodes("#map_canvas")%>%
    html_attr("script")
}

(map_canvas"选择器是使用 selectorGadget 找到的;您可以在此处).

(the "map_canvas" selector was found using the selectorGadget; you can view the entire source here).

我在阅读我所追求的内容时遇到了最糟糕的情况.我尝试了许多节点和节点组合,但无济于事.我玩过 phantom.js,但问题是它不是我想要的 js 渲染的 html 内容:我正在寻找写入页面代码的 API 查询输入(或者,至少,在我的业余眼中似乎是).

I'm having the worst time getting this to read what I'm after. I've tried many nodes and combinations of nodes, to no avail. I've played around with phantom.js, but the problem is that it's not js-rendered html content I'm after: I'm looking for the API query input, which is written into the page code (or, at least, to my amateur eye appears to be).

有人有什么建议吗?

推荐答案

这似乎有效:

library(rvest)
library(magrittr)
library(stringr)

pg <- html("http://biomassmagazine.com/plants/view/2285")

pg %>% 
  html_nodes("div.pad20 > script") %>% 
  extract2(2) %>% 
  html_text %>% 
  str_match_all("LatLng\\(([[:digit:]\\.\\-]+),([[:digit:]\\.\\-]+)") %>% 
  extract2(1) %>% 
  extract(2:3) -> lat_lng

lat_lng

## [1] "43.805170"  "-70.722084"

这篇关于在 rvest 中抓取位置数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆