用R从JavaScript中提取数据 [英] Extracting data from javascript with R

查看:125
本文介绍了用R从JavaScript中提取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

感谢您对此感兴趣。



我给了[单调乏味]的任务,看看某些药物的来源国是什么,因为它们是注册的哥伦比亚食品和药物管理局。该机构使用一个JavaScript(.jsp扩展名)的网站,我想知道是否有可能自动化该过程。
这是查找的一步一步:


  1. 前往代理网站:代理商咨询网站

  2. 在下拉列表中选择Medicamentos左边

  3. 在expendiente下(最上面的最右边的框)写下我们要查找的号码(我必须检查的900+中的两个是:2203和3519)。
  4. 点击搜索按钮(buscar)

  5. 点击下表中的链接

  6. 理想情况下,获取以FABRICANTE(制造商)开头的表格行,但能够保存文档就足够了(我打算在以后使用R来获取/清理/分析数据)。 >
  7. 点击清理按钮(nueva consulta)

  8. 从第3步到第7步重新开始。

    我完全不知道这是否可以完成,如果有的话,所以我会很感激任何可以让我从任何方向开始的指导(除了我现在手头的那个:手工查看它们!)。我对R和一些VB很熟悉,但如果可以用任何其他语言,我会试试看。



    我尝试过:

    p>


    • 我试图找到与从javascript中提取数据有关的任何信息,但是我发现的大部分内容都与使用javascript将数据从将不同类型的数据库转换为html / xml;或者只从一个响应中提取数据(这不是我想要自动化的部分),因为一旦我处于响应中,仅查看[源县]的值就很容易了。consult部分是最难的!)。我觉得如此偏离轨道,以至于我无法充分地搜索。我非常感谢指导/想法/起始者

    • 我已经用检查员(firefox)打开了代理网站,但在发现变量expediente是获得expediente的价值(不是很有用!)。我不知道是否可以(以及如何)在页面上迭代以更改该变量的值。 谢谢! / p>

      解决方案

      我已经使用 phantomjs RSelenium 包。有关如何设置 phantomjs 的详细信息可以在 http://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-saucelabs.html#id2a
      <$ c可以直接驱动$ c> phantomjs ,而不需要Selenium Server详细信息这里。由于其无头的本质,对于您所勾画的任务应该更加快捷。



      您的问题的第一部分可以实现如下:

        appURL<  - http://web.sivicos.gov.co:8080/consultas/consultas/consreg_encabcum.jsp
      library( RSelenium)
      pJS < - phantom()
      remDr < - remoteDriver(browserName =phantom)
      remDr $ open()
      remDr $ navigate(appURL)
      #获取选择框(MEDICAMENTOS)的第三个列表项(
      webElem< - remDr $ findElement(css,select [name ='grupo'] option:nnth-child(3))
      webElem $ clickElement()#选择此元素
      #发送文本到输入值=name =expediente
      webElem< - remDr $ findElement(css,input [name ='expediente'])
      webElem $ sendKeysToElement(list(2203))
      #点击Buscar按钮
      remDr $ findElement(id,INPUT2)$ clickElement()

      现在表单已填入并点击链接。数​​据位于iframe中名称= DATOS
      Iframes需要切换到:

      $ $ p $ #切换到数据库iframe
      remDr $ switchToFrame(remDr $ findElement(css,iframe [name ='datos']))
      remDr $ findElement(css,a)$ clickElement()#点击iframe中给出的链接

      #得到结果数据

      appData< - remDr $ getPageSource()[[1]]
      #关闭幻影js
      pJS $ stop()

      iframe的数据现在包含在 appData 。作为一个例子,我们使用简单的提取函数 readHTMLTable 来查看第三个表:

        readHTMLTable(appData,which = 3)
      V1 V2 V3 V4 V5 V6
      1 Presentacion Comercial< NA> < NA> < NA> < NA> < NA>
      2 Expediente Consec Termino Unidad / Medida Cantidad Descripcion
      3 000002203 01 0176 ml 60,00 FRASCO AMBAR POR 60 ML
      4 000002203 02 0176 ml 120,00 FRASCO AMBAR POR 120 ML
      5 000002203 03 0176 ml 90,00 FRASCO AMBAR POR 90 ML
      V7 V8 V9
      1 NA< < NA> < NA>
      2详情请见Estado Fecha Inactiv
      3 2007/01/30 Activo
      4 2007/01/30 Activo
      5 2012/03/15 Activo


      Thanks for taking interest in this.

      I was given the [tedious] task to look what is the country of origin of some medicins, as they are registered with the colombian food and drug administration. The agency uses a website with a javascript (.jsp extension) and I would like to know if it is possible to automate the process. This is the step by step of the lookup:

      1. Go to agency's website: Agency's consult site
      2. Select "Medicamentos" in the droplist in the left
      3. Under "expendiente" (rigthmost box in the top) write the number we're looking for (two of the 900+ I have to check are: 2203 and 3519). Radio-button selection is indifferent.
      4. hit search button ("buscar")
      5. Click the link presented in the table below
      6. Ideally, get the table line that starts with FABRICANTE (manufacturer), but being able to save the document would be enough (I plan to get/clean/analyze the data using R later on).
      7. Hit the clean button ("nueva consulta")
      8. Start all over from steps 3 to 7.

      I don't have the slightest idea whether this could be accomplished, and if so, how; so I'd appreciate any guidance that allow me to start in any direction (other than the one I have at hand now: looking them by hand!). I'm familiar with R and some VB, but if it's possible in any other language, I'll give it a try.

      What I've tried:

      • I tried to find any information related to extracting data from javascript, but most of what I've found is related to using javascript to pass data from different sort of databases into html/xml; or extrating the data from only one response (that's not the part I want to automate, as once I'm at the response, it would be easy to only look at the value [county of origin]. The "consult" part is the hardest!). I've felt so off-track that I think I'm clueless as to how to search adequately. Guidance / ideas /starters are much appreciated
      • I've opened the agency's site with the inspector (firefox), but stoped just after finding that the variable "expediente" is the one that gets the value for "expediente" (not very useful!). I don't know if possible (and how to) iterate on the page to change the value for that variable.

      Thanks!

      解决方案

      I have used phantomjs with the RSelenium package. Details on how to setup phantomjs can be found at http://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-saucelabs.html#id2a phantomjs can be driven directly without the need for a Selenium Server details here . It should be alot quicker for the task you outline due to its headless nature.

      The first part of your question can be achieved as follows:

      appURL <- "http://web.sivicos.gov.co:8080/consultas/consultas/consreg_encabcum.jsp"
      library(RSelenium)
      pJS <- phantom()
      remDr <- remoteDriver(browserName = "phantom")
      remDr$open()
      remDr$navigate(appURL)
      # Get the third list item of the select box (MEDICAMENTOS)
      webElem <- remDr$findElement("css", "select[name='grupo'] option:nth-child(3)")
      webElem$clickElement() # select this element
      # Send text to input value="" name="expediente
      webElem <- remDr$findElement("css", "input[name='expediente']")
      webElem$sendKeysToElement(list(2203))
      # Click the Buscar button
      remDr$findElement("id", "INPUT2")$clickElement()
      

      Now the form has been filled in and the link clicked. The data is in an iframe with name="datos". Iframes need to be switched to:

      # switch to datos iframe
      remDr$switchToFrame(remDr$findElement("css", "iframe[name='datos']"))
      remDr$findElement("css", "a")$clickElement() # click the link given in the iframe
      
      # get the resulting data
      
      appData <- remDr$getPageSource()[[1]]
      # close phantom js
      pJS$stop()
      

      The data for the iframe is now contained in appData. As an example we look at the third table using the simple extraction function readHTMLTable:

      readHTMLTable(appData, which = 3)
      V1     V2      V3              V4       V5                      V6
      1 Presentacion Comercial   <NA>    <NA>            <NA>     <NA>                    <NA>
        2             Expediente Consec Termino Unidad / Medida Cantidad             Descripcion
      3              000002203     01    0176              ml    60,00  FRASCO AMBAR POR 60 ML
      4              000002203     02    0176              ml   120,00 FRASCO AMBAR POR 120 ML
      5              000002203     03    0176              ml    90,00  FRASCO AMBAR POR 90 ML
      V7     V8            V9
      1       <NA>   <NA>          <NA>
        2 Fecha insc Estado Fecha Inactiv
      3 2007/01/30 Activo              
      4 2007/01/30 Activo              
      5 2012/03/15 Activo 
      

      这篇关于用R从JavaScript中提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

      查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆