Rvest,html_nodes返回空列表和字符串,使用网站 [英] Rvest, html_nodes return empty list and string, wield website

查看:73
本文介绍了Rvest,html_nodes返回空列表和字符串,使用网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于此网站: https://www.coinopsy.com/dead-coins/,我正在使用R和rvest包来抓取名称,摘要等信息,以形成自己的表单.我已经在其他网站上做到了这一点,并且确实很成功,但是这个很奇怪.

For this website: https://www.coinopsy.com/dead-coins/, I'm using R and the rvest package to scrap names, summary, etc., that kind of info, to make my own form. I've done this with other websites and it was really successful, but this one is odd.

我使用SelectorGadget(在以前的工作中很有用)来找出css节点的名称,但是 html_nodes html_text 返回空字符,我不知道不知道这是否是因为网站的格式完全不同!

I used SelectorGadget, which is useful, in my previous jobs, to figure out the css nodes' names, but html_nodes and html_text return empty character, I don't know if it's because the website is structured under a totally different format!

css代码示例:

td class="all sorting_1">a class="coin_name" href="007coin">007Coin /a>/td>

a class="coin_name" href="007coin">007Coin /a>

url <- "https://www.coinopsy.com/dead-coins/"

webpage <- read_html(url)

Item_html <- html_nodes(webpage,'.coin_name')

Item <- html_text(Item_html)

> Item

character(0)

有人可以帮我解决这个问题吗?

Can someone help me out on this issue?

推荐答案

如果在浏览器中禁用javascript,则会看到该内容未加载.如果您随后检查html,您将看到数据存储在script标签中.可能是在浏览器中运行javascript时将其加载到了表格中.Javascript不能与您使用的方法一起运行.您可以从响应html中提取数组的javascript数组.然后解析为一个数据帧.我是R的新手,因此想知道在这种情况下如何做到这一点.最后,我将在python中包含一个完整的示例.如果我的研究有成果,我将进行更新.否则,您可以从 data 中返回的字符串中正则表达式内容.

If you disable javascript in the browser you will see that that content is not loaded. If you then inspect the html you will see the data is stored in a script tag; presumably loaded into the table when javascript runs in the browser. Javascript doesn't run with the method you are using. You can extract the javascript array of arrays from the response html. Then parse into a dataframe. I am new to R so looking into how this can be done in this case. I will include a full example with python at the end. I will update if my research yields something. Otherwise, you can regex out contents from returned string in data.

library(rvest)
library(stringr)
library(magrittr)

url = 'https://www.coinopsy.com/dead-coins/'
r <- read_html(url) %>%
  html_node('body') %>%
  html_text() %>%
  toString()
data <- str_match_all(r,'var table_data = (.*?);')
data <- data[[1]][,2]  # string representation of list of lists
#step to convert string to object
#step to convert object to dataframe


在python中,有一个 ast 库使转换变得容易,下面的结果是您在页面上看到的表.


In python there is the ast library which makes the conversion easy and the result of the below is the table you see on the page.

import requests
import re
import ast
import pandas as pd

r = requests.get('https://www.coinopsy.com/dead-coins/')
p = re.compile(r'var table_data = (.*?);')   #p1 = re.compile(r'(\[".*?"\])')
data = p.findall(r.text)[0]
listings = ast.literal_eval(data)
df = pd.DataFrame(listings)
print(df)


目前,我找不到可以进行上述转换的库.下面是合并和感觉效率低下的丑陋方法.我欢迎您提出改进方面的建议(尽管以后可能需要进行代码审查).我还在看这个,所以会更新.

Currently I can't find a library which does the conversion I mentioned. Below is ugly way of combining and feels inefficient. I would welcome suggestions on improvements (though that may be for code review later). I'm still looking at this so will update.

library(rvest)
library(stringr)
library(magrittr)

url = 'https://www.coinopsy.com/dead-coins/'
headers <- c("Column To Drop","Name","Summary","Project Start Date","Project End Date","Founder","urlId")
# https://www.coinopsy.com/dead-coins/bigone-token/  where bigone-token is urlId

r <- read_html(url) %>%
  html_node('body') %>%
  html_text() %>%
  toString()
data <- str_match_all(r,'var table_data = (.*?);')
data <- data[[1]][,2]

z <- substr(data, start = 2, stop = nchar(data)-1) %>% str_match_all(., "\\[(.*?)\\]")
z <- z[[1]][,2]

for(i in seq(1,length(z))){
  if(i==1){
    df <- rapply(as.list(strsplit(z[i], ",")[[1]][2:7]), function(x) trimws(sub("'(.*?)'", "\\1", x)))
  }else{
    df <- rbind(df,rapply(as.list(strsplit(z[i], ",")[[1]][2:7]), function(x) trimws(sub("'(.*?)'", "\\1", x))))
  }
}

这篇关于Rvest,html_nodes返回空列表和字符串,使用网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆