复制cURL时出现意外符号-R [英] unexpected symbol in copying a cURL - R

查看:77
本文介绍了复制cURL时出现意外符号-R的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从google chrome控制台复制一个cURL,将其粘贴到R脚本中的之间,当我尝试将其保存到对象时,R表示其中存在意外符号。

I am copying a cURL from the google chrome console, I paste it in my R script between "" and when I am trying to save it to an object, R says there is an unexpected symbol in it.

有人可以解释这是为什么以及如何解决吗?

Can someone explain why this is and how to solve it?

httpbinrhcurl <- "curl 'http://www.domainia.nl/quarantaine/2018/12/15' -H 'Connection: keep-alive' -H 'Cache-Control: max-age=0' -H 'Upgrade-Insecure-Requests: 1' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36' -H 'Origin: http://www.domainia.nl' -H 'Content-Type: application/x-www-form-urlencoded' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8' -H 'Referer: http://www.domainia.nl/quarantaine/2018/12/15' -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: en-US,en;q=0.9,nl;q=0.8' -H 'Cookie: ASP.NET_SessionId=1rq1dcm2rxrxhejcv2apj1nz; _ga=GA1.2.1720697664.1544465383; cookieconsent_dismissed=yes; _gid=GA1.2.1174161929.1544888026; _gat=1' --data '__EVENTTARGET=ctl00%24ContentPlaceHolder1%24gvDomain&__EVENTARGUMENT=Page%246&__VIEWSTATE=%2BfCpnCxB3CBeL3p0AJRxX709sVZd46FhL5m2WythvucpdaUpCnCyUxSa%2BEidpSc9wEzRF6wNxg8LpEQV8%2BzUZpzAPL8nX4hXXd282D77F%2BphaIBhktpf4j8Wj31S3LIC5QRM2V6lHEWuJEbAJbuk68wwOBwa2zOVseYwNs%2FQywJT2H5wWBPzxALrQo5Dui76GArI3RRCSyKUgrq97euwYy4zq5b1Y3NgFWi8nOyu4VWuih%2BRANxczyqirApMmInEpCVFHqbNYulM8iKpi5Ay1jU3k2fC8I87LhC8oFWOPglZXhMuMSxR66FM%2BMOk9T%2BUbSrESAFA9AIywFhk7gRy%2FGMTgmDA%2BdyCbmnnIzJbGy6sr02V0I08l8UxzxfiaRCTx1paJqN3Co7tgq4D%2FqPmkvzlU8w1sMEqpdDWrXnX2cDswYbcEmg01w3PfgD66f4yezNJb4PQdjURDLwYsb6BkPum9iwXzRMi9OHQoULpN%2BUGFZh2BqgJMcI6R9OLbTZXs%2B%2FiiIVwyO8zeYjw8gMIjLZeu2TTdduhgiVBLUbBRZOQ%2BZwAIL6rZNoKF7Vop0BeSXTuA%2F6bCf1Z5g9Yss4WC%2Bre81YbVKImtrnUoqBsRQLvt6EtDUYxxJ1NDfdWYC8HivFw1FHvzwf0EiH2wtKFQ0w7eBNa5Hhahs05UOaqo%2FLSNxZJCsbcJfSwAQlo2nIRAVLSEEibbqYfgFtsgxtvv7oRl0a6eci2OU2skCfXHsLCTBJhzGD3RfUALn9o9bwTx%2FYbYYrGtbCuty%2FFmFVVFQvNN9UGJntsNaxxB0SIjIwamTKL7H%2Bi1npGQbV%2B%2F%2BnPJbzHX8UPWgeT0qB1xBIOc59H7TCE%2BEcFyTDdCprQd7XYmeZ1yKySaDDCtrUc3sP38YUzgzP4R8ktyJg52ZVS7tN4o3lfLEmPb2gV29xxZlLs%2Ft7XuA7xeJei9AXlhZn8rZEsu0qvZwnmk2%2F7ArQYXv%2B2kk58H6223NwCwtoS53emfJuuOgn6x1AsWcIc4xavzYu4bhE5GBIe378PQMvav0GcXghx0YAmYnsQw5x%2BYkcU66e81XKGgDS0nXOOWabpnJwdG1%2B43wziuDsx9Yx%2Fy0FznCAWjdPbMf1uzFPk8yxErqzShuz2W3mX%2FeXmkcnP%2Fr41GUiYW4gfCfYYNSPDORY0gITK%2FZR9yq9eex2dJLanL0g9m7g2aaGrhCG%2FwDM72YP1HVkoOguF%2BFcNUoye%2BeVhi5cKavn9mt%2FjSrvctYV6exY0VqNhqJZBR9H0VaqZDO8X7EvMAeCBs822VOUgDHsKLWmINEDhx%2ByK0NpbHMv60U6ZDIUFMU7VQP%2BtK1d5XXzgPu6McG2pU3GuV%2FVbCowFjOX06cSSJy5PCfzvvW6t5nIDtz37p3pMKMZ3Bkbitmo9O88RAHtk5m1XvFG3U3XBb1wFDKEoB6g5P%2BfeLrVt26DP3XprYW4uvsPlWg3%2FlgiJAieljwQteNB6YP3dFIe6Jtpn0KCvOZvkD1TiHVxht9GBvKRPsQncAUAdk%2Bh0Es2U63UOl9BSiNvJHRXIeJtdwLCiB%2F09ymmGig%2BVzXkOs%2FJJA4toKUljFDNZA2c0eJO675v5flLrb9F53Hch%2B%2FJ8gEGWLCPJRj5UzKHnPV2Ln80lFQFmiI%2BKgvnGlDNXs2KzcbzL468kSziFOATaxHg2NRlTxcRi7eHW3WbWp3s7l804cLMutZP0S7jualsRRNCdwL0hnvAP3eYBBUiNGg%3D%3D&__VIEWSTATEGENERATOR=39107D5A&__VIEWSTATEENCRYPTED=&__EVENTVALIDATION=FVV2FXaDZFD8X3%2Faf1efOYoZkPsG5aGc%2FETy%2BPpSaJViSFwo08G2Z%2FD3oNgkB4gPg3vavcSK2z1GyI%2BWo9Nu12mjPGuSzGefVyonOnrV1Tx9nsqyOEhGinftQPWBa29BXIAnAhb3XxtTj7SEAeR0KYR%2FsPYpD3tPoUjMrLhmHmj4RM%2BOZGYZDj4B2LdpVg%2B1RIYBhs3rdEVxzuhlhAPlgiKvhO8v46wlweLVy47Y4ZOrzks2z2fm%2BHXNDVD8RXTPKl%2FnTUobsv9iZ9imVAs8DS7I8WkrPtH3TJS7jeFS1NGd3eJIXwCZmpsplMe5tY3wHjYLBPijk0dsH7%2B8DvuQ3byNcAz4H%2B92IH%2Fw%2BInvxjc52xdSrOCqKUFWUOIAGYsrG%2FtTphEc0XYuqujHxZAXFcP%2FhPDMSDUzTzoQc2VsHqqI9UthFCX9z9TpVX8Th4puc3pIiRS9%2BQHDlp%2FDiq81AI39IThV6W824NnY9uudiuizQmer8CA7DTts%2BVE5qYDsa%2BJKgyMmBP3YHOhsfPssFLhZ7sFhC%2FQwmf9s6X0rizu1aI%2FzRacJGyjm2C4PMtuT0ylmUvMu%2Bx%2FEOUrMNzKO2UXX07jTxwRByiJ13ud7JEW2U6s1Y7Dnvf3%2F0klDACBXPrTUQy0bnLgGypiv2grtCEyXHePltZkUddwM0eEbS3Kcl3fbDEgq97RaomGxCDpA2VRlOLHZZkLYOasMXCNtC9yo1gaXJcNE1ONwyTVn1%2Fi8gIDAo%2BPqNYLRS4fdEQC%2BDvrGoCvcfr5PlIwxyHn6zHAJGJJWsxunbWjhYX9yFDd8nFEVcIJEHiYHCvQBG%2Bxo6gUSwfza0L6v4k1wTa2SJhLoGQS4Slbl6hVaiawb8M8iOgLwnIYyBgRGoIOFsBhcauh0UO8dIMTNi6LRXiFpIK8VXEa4kG8enfIoJWlj2Yp06FL8mxpQXpQJkZrajseuQ5gtxk6vh3ZanVn3XDg5BrBxB2zUZNDmcVlSWOrw%2Fza6g9mm52q%2Ff%2FtdxrS4xYrTcJcavDkt%2Bhh4Y7Brhf3x0oDtyzk6WeLj5l13ZNjEcnqmhRlluY2Y1VCr9fQzrJFH8NZBiKa4pNVv2lfTKXqje1AogYX9LqUC4JoVfkXT1ip9%2BATWAApQpW7Z%2FjxXcUzN53xZopDP3UtpWQ8uPhBwaz2FRlNpQV7cv1QyoawOZDqIck88J9yEUiDKYDHczSkgK5AAeTENbJZSsxtMUJxIA97XyRWunk5WqmixcAQW8GV5QFT8c3yS35TafJ0bNW5U5vN6BByZiGAJg96r8sJrwZ6RkkFXHgPDGlivdxc594qqEDQTAaBo%2Fj2AJjk%2Frs793XMP61%2B6ahQIF9iAqoA8Iq6RaJSgs%2FOP%2BVrsONlNv%2BXCtndmE97M602%2F%2BVJmrQle%2F8ySQLlrmBdafdJGBSZBNrDeqhS4%2Bs9dbTarV7AvUwVjZUgTIhJ1JTgGGu09kgVOe5FDd89KJ6D9xFNUzAMJiyK7H%2BbX3Mf5KdZgmVnyehE%3D&ctl00%24ddlState=quarantaine&ctl00%24tbSearch=' --compressed"


推荐答案

好吧,我会很生气。他们没有使用ASP.NET的一些较粗糙的功能,因此这非常简单。正如我在该网站上的类似问题中指出的那样,对于不存在的robots.txt中的抓取似乎没有任何限制,也没有我可以找到的任何条款/条件。

Well, I'll be darned. They don't use some of the more gnarly features of ASP.NET so this is really straightforward. As I noted in a similar question on this site, there does not appear to any restrictions on scraping in the non-existent robots.txt nor any terms/conditions that I could find.

library(httr)
library(rvest)
library(docxtractr) # for data frame cleaning helper utilities
library(tidyverse)

让我们获得第一页:

httr::GET(
  url = "http://www.domainia.nl/quarantaine/2018/12/15"
) -> res

pg <- httr::content(res)

现在,我们将提取表:

html_node(pg, xpath = ".//table[contains(., 'Domein')]") %>% 
  html_table(fill=TRUE, trim=TRUE) %>% 
  select(2:6) %>%                    # The table is full of junk so we trim it off
  docxtractr::assign_colnames(3) %>% # The column headers in in row 3
  docxtractr::mcga() %>%             # Make the column names great again
  tbl_df() -> pg_one

将其分配到我们将添加到的列表中:

Assign it to a list that we'll be adding to:

pgs <- list(pg01 = pg_one)

现在,遍历其余选项卡(如果超过10个,可以通过提取分页行并获取最大/最后一个 td来做超过10的额外操作)。

Now, go over the remaining tabs (you can do the extra bit required to go past 10 if there's more than 10 by extracting the pagination row and getting the max/last td).

在循环中,我们提取视图状态字段,设置其他 POST 正文参数并增加我们得到的页面。我们发出 POST ,将新表提取到列表中,并对其余页面进行泡沫/冲洗/重复:

Inside the loop, we extract the view state fields, setup the other POST body parameters and increment the page we're getting. We issue the POST, extract the new table into the list and lather/rinse/repeat for the remaining pages:

for (pg_num in 2:10) {

  Sys.sleep(5) # be kind since you don't own the server or pay for the bandwidth

  hinputs <- html_nodes(pg, "input[type='hidden']")
  hinputs <- as.list(setNames(html_attr(hinputs, "value"), html_attr(hinputs, "name")))
  hinputs$`ctl00$tbSearch` <- ""
  hinputs$`ctl00$ddlState` <- "quarantaine"
  hinputs$`__EVENTTARGET` <- "ctl00$ContentPlaceHolder1$gvDomain"
  hinputs$`__EVENTARGUMENT` <- sprintf("Page$%s", pg_num)

  httr::POST(
    url = "http://www.domainia.nl/quarantaine/2018/12/15",
    encode = "form",
    body = hinputs
  ) -> res

  httr::content(res) %>%
    html_node(xpath = ".//table[contains(., 'Domein')]") %>% 
    html_table(fill=TRUE, trim=TRUE) %>% 
    select(2:6) %>% 
    docxtractr::assign_colnames(3) %>% 
    docxtractr::mcga() %>% 
    tbl_df() -> pgs[[sprintf("pg_%02s", pg_num)]] # assign it to a new named list entry

}

最后,合并所有这些行:

Finally, combine all those rows:

bind_rows(pgs)
## # A tibble: 954 x 5
##    domein              status      archive geregistreerd_op uit_quarantaine
##    <chr>               <chr>       <chr>   <chr>            <chr>          
##  1 0172design.nl       quarantaine 0       ""               15-12-2018     
##  2 0172designs.nl      quarantaine 0       ""               15-12-2018     
##  3 0172kleding.nl      quarantaine 0       ""               15-12-2018     
##  4 0172online.nl       quarantaine 0       ""               15-12-2018     
##  5 123shows.nl         quarantaine 0       ""               15-12-2018     
##  6 123story.nl         quarantaine 0       ""               15-12-2018     
##  7 21018dagen.nl       quarantaine 0       ""               15-12-2018     
##  8 22academy.nl        quarantaine 0       ""               15-12-2018     
##  9 22aviationcampus.nl quarantaine 0       ""               15-12-2018     
## 10 22campus.nl         quarantaine 0       ""               15-12-2018     
## # ... with 944 more rows

这篇关于复制cURL时出现意外符号-R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆