用R从JavaScript中提取数据 [英] Extracting data from javascript with R
问题描述
感谢您对此感兴趣。
我给了[单调乏味]的任务,看看某些药物的来源国是什么,因为它们是注册的哥伦比亚食品和药物管理局。该机构使用一个JavaScript(.jsp扩展名)的网站,我想知道是否有可能自动化该过程。
这是查找的一步一步:
- 前往代理网站:代理商咨询网站
- 在下拉列表中选择Medicamentos左边
- 在expendiente下(最上面的最右边的框)写下我们要查找的号码(我必须检查的900+中的两个是:2203和3519)。
- 点击搜索按钮(buscar)
- 点击下表中的链接
- 理想情况下,获取以FABRICANTE(制造商)开头的表格行,但能够保存文档就足够了(我打算在以后使用R来获取/清理/分析数据)。 >
- 点击清理按钮(nueva consulta)
- 从第3步到第7步重新开始。
我完全不知道这是否可以完成,如果有的话,所以我会很感激任何可以让我从任何方向开始的指导(除了我现在手头的那个:手工查看它们!)。我对R和一些VB很熟悉,但如果可以用任何其他语言,我会试试看。
我尝试过:
p>
- 我试图找到与从javascript中提取数据有关的任何信息,但是我发现的大部分内容都与使用javascript将数据从将不同类型的数据库转换为html / xml;或者只从一个响应中提取数据(这不是我想要自动化的部分),因为一旦我处于响应中,仅查看[源县]的值就很容易了。consult部分是最难的!)。我觉得如此偏离轨道,以至于我无法充分地搜索。我非常感谢指导/想法/起始者
- 我已经用检查员(firefox)打开了代理网站,但在发现变量expediente是获得expediente的价值(不是很有用!)。我不知道是否可以(以及如何)在页面上迭代以更改该变量的值。 谢谢! / p>
解决方案我已经使用
phantomjs
和RSelenium
包。有关如何设置phantomjs
的详细信息可以在 http://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-saucelabs.html#id2a
<$ c可以直接驱动$ c> phantomjs ,而不需要Selenium Server详细信息这里。由于其无头的本质,对于您所勾画的任务应该更加快捷。
您的问题的第一部分可以实现如下:
appURL< - http://web.sivicos.gov.co:8080/consultas/consultas/consreg_encabcum.jsp
library( RSelenium)
pJS < - phantom()
remDr < - remoteDriver(browserName =phantom)
remDr $ open()
remDr $ navigate(appURL)
#获取选择框(MEDICAMENTOS)的第三个列表项(
webElem< - remDr $ findElement(css,select [name ='grupo'] option:nnth-child(3))
webElem $ clickElement()#选择此元素
#发送文本到输入值=name =expediente
webElem< - remDr $ findElement(css,input [name ='expediente'])
webElem $ sendKeysToElement(list(2203))
#点击Buscar按钮
remDr $ findElement(id,INPUT2)$ clickElement()
现在表单已填入并点击链接。数据位于iframe中
名称= DATOS
。
Iframes需要切换到:
$ $ p $#切换到数据库iframe
remDr $ switchToFrame(remDr $ findElement(css,iframe [name ='datos']))
remDr $ findElement(css,a)$ clickElement()#点击iframe中给出的链接
#得到结果数据
appData< - remDr $ getPageSource()[[1]]
#关闭幻影js
pJS $ stop()
iframe的数据现在包含在
appData
。作为一个例子,我们使用简单的提取函数readHTMLTable
来查看第三个表:readHTMLTable(appData,which = 3)
V1 V2 V3 V4 V5 V6
1 Presentacion Comercial< NA> < NA> < NA> < NA> < NA>
2 Expediente Consec Termino Unidad / Medida Cantidad Descripcion
3 000002203 01 0176 ml 60,00 FRASCO AMBAR POR 60 ML
4 000002203 02 0176 ml 120,00 FRASCO AMBAR POR 120 ML
5 000002203 03 0176 ml 90,00 FRASCO AMBAR POR 90 ML
V7 V8 V9
1 NA< < NA> < NA>
2详情请见Estado Fecha Inactiv
3 2007/01/30 Activo
4 2007/01/30 Activo
5 2012/03/15 Activo
Thanks for taking interest in this.
I was given the [tedious] task to look what is the country of origin of some medicins, as they are registered with the colombian food and drug administration. The agency uses a website with a javascript (.jsp extension) and I would like to know if it is possible to automate the process. This is the step by step of the lookup:
- Go to agency's website: Agency's consult site
- Select "Medicamentos" in the droplist in the left
- Under "expendiente" (rigthmost box in the top) write the number we're looking for (two of the 900+ I have to check are: 2203 and 3519). Radio-button selection is indifferent.
- hit search button ("buscar")
- Click the link presented in the table below
- Ideally, get the table line that starts with FABRICANTE (manufacturer), but being able to save the document would be enough (I plan to get/clean/analyze the data using R later on).
- Hit the clean button ("nueva consulta")
- Start all over from steps 3 to 7.
I don't have the slightest idea whether this could be accomplished, and if so, how; so I'd appreciate any guidance that allow me to start in any direction (other than the one I have at hand now: looking them by hand!). I'm familiar with R and some VB, but if it's possible in any other language, I'll give it a try.
What I've tried:
- I tried to find any information related to extracting data from javascript, but most of what I've found is related to using javascript to pass data from different sort of databases into html/xml; or extrating the data from only one response (that's not the part I want to automate, as once I'm at the response, it would be easy to only look at the value [county of origin]. The "consult" part is the hardest!). I've felt so off-track that I think I'm clueless as to how to search adequately. Guidance / ideas /starters are much appreciated
- I've opened the agency's site with the inspector (firefox), but stoped just after finding that the variable "expediente" is the one that gets the value for "expediente" (not very useful!). I don't know if possible (and how to) iterate on the page to change the value for that variable.
Thanks!
解决方案I have used
phantomjs
with theRSelenium
package. Details on how to setupphantomjs
can be found at http://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-saucelabs.html#id2aphantomjs
can be driven directly without the need for a Selenium Server details here . It should be alot quicker for the task you outline due to its headless nature.The first part of your question can be achieved as follows:
appURL <- "http://web.sivicos.gov.co:8080/consultas/consultas/consreg_encabcum.jsp" library(RSelenium) pJS <- phantom() remDr <- remoteDriver(browserName = "phantom") remDr$open() remDr$navigate(appURL) # Get the third list item of the select box (MEDICAMENTOS) webElem <- remDr$findElement("css", "select[name='grupo'] option:nth-child(3)") webElem$clickElement() # select this element # Send text to input value="" name="expediente webElem <- remDr$findElement("css", "input[name='expediente']") webElem$sendKeysToElement(list(2203)) # Click the Buscar button remDr$findElement("id", "INPUT2")$clickElement()
Now the form has been filled in and the link clicked. The data is in an iframe with
name="datos"
. Iframes need to be switched to:# switch to datos iframe remDr$switchToFrame(remDr$findElement("css", "iframe[name='datos']")) remDr$findElement("css", "a")$clickElement() # click the link given in the iframe # get the resulting data appData <- remDr$getPageSource()[[1]] # close phantom js pJS$stop()
The data for the iframe is now contained in
appData
. As an example we look at the third table using the simple extraction functionreadHTMLTable
:readHTMLTable(appData, which = 3) V1 V2 V3 V4 V5 V6 1 Presentacion Comercial <NA> <NA> <NA> <NA> <NA> 2 Expediente Consec Termino Unidad / Medida Cantidad Descripcion 3 000002203 01 0176 ml 60,00 FRASCO AMBAR POR 60 ML 4 000002203 02 0176 ml 120,00 FRASCO AMBAR POR 120 ML 5 000002203 03 0176 ml 90,00 FRASCO AMBAR POR 90 ML V7 V8 V9 1 <NA> <NA> <NA> 2 Fecha insc Estado Fecha Inactiv 3 2007/01/30 Activo 4 2007/01/30 Activo 5 2012/03/15 Activo
这篇关于用R从JavaScript中提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文