如何通过Mechanize和Nokogiri抓取数据? [英] How do I scrape data through Mechanize and Nokogiri?

查看:102
本文介绍了如何通过Mechanize和Nokogiri抓取数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个可从 http://www.screener.in/获取HTML的应用程序

I am working on an application which gets the HTML from http://www.screener.in/.

我可以输入"Atul Auto Ltd"之类的公司名称并提交,然后从

I can enter a company name like "Atul Auto Ltd" and submit it and, from the next page, scrape the following details: "CMP/BV" and "CMP".

我正在使用以下代码:

require 'mechanize'
require 'rubygems'
require 'nokogiri'

Company_name='Atul Auto Ltd.'
agent = Mechanize.new
page = agent.get('http://www.screener.in/')
form = agent.page.forms[0]
print agent.page.forms[0].fields
agent.page.forms[0]["q"]=Company_name
button = agent.page.forms[0].button_with(:value => "Search Company")
pages=agent.submit(form, button)
puts pages.at('.//*[@id="top"]/div[3]/div/table/tbody/tr/td[11]')
# not getting any output.

代码将我带到正确的页面,但是我不知道如何查询以获取所需的数据.

The code is taking me to the right page but I am don't know how to query to get the required data.

我尝试了不同的尝试,但是没有成功.

I tried different things but was unsuccessful.

如果可能的话,有人可以向我介绍一个不错的教程,该教程解释了如何从HTML页面中抓取特定的类. 第一个"CMP/BV"的XPath是:

If possible, can someone point me towards a nice tutorial which explains how to scrape a particular class from an HTML page. The XPath of the first "CMP/BV" is:

//*[@id="top"]/div[3]/div/table/tbody/tr/td[11]

但是它没有给出任何输出.

but it is not giving any output.

推荐答案

使用 Nokogiri 我会如下所示:

使用CSS选择器

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('http://www.screener.in/company/?q=Atul+Auto+Ltd.'))

doc.class
# => Nokogiri::HTML::Document
doc.css('.table.draggable.table-striped.table-hover tr.strong td').class
# => Nokogiri::XML::NodeSet

row_data = doc.css('.table.draggable.table-striped.table-hover tr.strong td').map do |tdata|
  tdata.text
end

 #From the webpage I took the below value from the table 
 #*Peer Comparison Top 7 companies in the same business*    

row_data
# => ["6.",
#     "Atul Auto Ltd.",
#     "193.45",
#     "8.36",
#     "216.66",
#     "3.04",
#     "7.56",
#     "81.73",
#     "96.91",
#     "17.24",
#     "2.92"]

从网页上的表格中,我可以看到 CMP/BV CMP 分别是第12列和第3列.现在,我可以从数组row_data中获取数据.因此 CMP 是第二个索引,而 CMP/BV 是数组row_data的最后一个值.

Looking at the table from the webpage I can see CMP/BV and CMP are the twelfth and third columns respectively. Now I can get the data from the array row_data. So CMP is the second index and CMP/BV is the last value of the array row_data.

row_data[2] # => "193.45" #CMP
row_data.last # => "2.92" #CMP/BV

使用XPATH

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('http://www.screener.in/company/?q=Atul+Auto+Ltd.'))

p doc.at_xpath("//*[@id='peers']/table/tbody/tr[6]/td[3]").text
p doc.at_xpath("//*[@id='peers']/table/tbody/tr[6]/td[10]").text
# >> "193.45" #CMP
# >> "17.24"  #CMP/BV

这篇关于如何通过Mechanize和Nokogiri抓取数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆