机械化-如何遵循或“点击"元刷新轨道 [英] Mechanize - How to follow or "click" Meta refreshes in rails

查看:80
本文介绍了机械化-如何遵循或“点击"元刷新轨道的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对机械化有点麻烦.

使用Mechanize提交表单时.我进入的页面具有一次元刷新,并且没有任何链接.

When a submit a form with Mechanize. I am come to a page with one meta refresh and there is no links.

我的问题是如何遵循元刷新?

My question is how do i follow the meta refresh?

我尝试允许元刷新,但随后出现套接字错误. 示例代码

I have tried to allow meta refresh but then i get a socket error. Sample code

require 'mechanize'
agent = WWW::Mechanize.new
agent.get("http://euroads.dk")
form = agent.page.forms.first
form.username = "username"
form.password = "password"
form.submit
page = agent.get("http://www.euroads.dk/system/index.php?showpage=login")
agent.page.body

响应:

<html>
 <head>
   <META HTTP-EQUIV=\"Refresh\" CONTENT=\"0;URL=index.php?showpage=m_frontpage\">
 </head>
</html>

然后我尝试:

redirect_url = page.parser.at('META[HTTP-EQUIV=\"Refresh\"]')[
  "0;URL=index.php?showpage=m_frontpage\"][/url=(.+)/, 1]

但是我得到了


NoMethodError: Undefined method '[]' for nil:NilClass

推荐答案

在内部,机械化使用 Nokogiri 来处理将HTML解析为DOM的过程.您可以获取Nokogiri文档,以便可以使用XPath或CSS访问器在返回的页面中进行浏览.

Internally, Mechanize uses Nokogiri to handle parsing of the HTML into a DOM. You can get at the Nokogiri document so you can use either XPath or CSS accessors to dig around in a returned page.

这是仅通过Nokogiri获取重定向URL的方法:

This is how to get the redirect URL with Nokogiri only:

require 'nokogiri'

html = <<EOT
<html>
  <head>
    <meta http-equiv="refresh" content="2;url=http://www.example.com/">
    </meta>
  </head>
  <body>
    foo
  </body>
</html>
EOT

doc = Nokogiri::HTML(html)
redirect_url = doc.at('meta[http-equiv="refresh"]')['content'][/url=(.+)/, 1]
redirect_url # => "http://www.example.com/"

doc.at('meta[http-equiv="refresh"]')['content'][/url=(.+)/, 1]分解为:查找具有http-equiv属性refresh<meta>标记的CSS访问器的第一个匹配项(at).取得该标签的content属性,并返回url=之后的字符串.

doc.at('meta[http-equiv="refresh"]')['content'][/url=(.+)/, 1] breaks down to: Find the first occurrence (at) of the CSS accessor for the <meta> tag with an http-equiv attribute of refresh. Take the content attribute of that tag and return the string following url=.

这是一些典型的机械化代码.因为您没有提供任何示例代码来作为我的基础,所以您必须从此开始工作:

This is some Mechanize code for a typical use. Because you gave no sample code to base mine on you'll have to work from this:

agent = Mechanize.new
page = agent.get('http://www.examples.com/')
redirect_url = page.parser.at('meta[http-equiv="refresh"]')['content'][/url=(.+)/, 1]
page = agent.get(redirect_url)


at('META[HTTP-EQUIV=\"Refresh\"]')

您的代码具有上面的at().请注意,您正在将单引号引起来的双引号转义.这会导致反斜杠,然后在字符串中加上双引号,这不是我的示例所使用的,这是我为什么会收到错误的第一个猜测. Nokogiri找不到标签,因为没有<meta http-equiv=\"Refresh\"...>.

Your code has the above at(). Notice that you are escaping the double-quotes inside a single-quoted string. That results in a backslash followed by a double-quote in the string which is NOT what my sample uses, and is my first guess for why you're getting the error you are. Nokogiri can't find the tag because there is no <meta http-equiv=\"Refresh\"...>.

通过设置以下内容,Mechanize具有内置的方法来处理元刷新:

Mechanize has a built-in way to handle meta-refresh, by setting:

 agent.follow_meta_refresh = true

它还具有解析元标记并返回的方法内容.从文档中:

It also has a method to parse the meta tag and return the content. From the docs:

parse(content,uri)

parse(content, uri)

从meta标签的content属性解析延迟和url.如果未指定url,则解析需要当前页面的uri来推断url.如果给出了一个块,则将解析后的延迟和URL传递给它以进行进一步处理. 如果无法解析延迟和网址,则返回nil.

Parses the delay and url from the content attribute of a meta tag. Parse requires the uri of the current page to infer a url when no url is specified. If a block is given, the parsed delay and url will be passed to it for further processing. Returns nil if the delay and url cannot be parsed.

# <meta http-equiv="refresh" content="5;url=http://example.com/" />
uri = URI.parse('http://current.com/')

Meta.parse("5;url=http://example.com/", uri)  # => ['5', 'http://example.com/']
Meta.parse("5;url=", uri)                     # => ['5', 'http://current.com/']
Meta.parse("5", uri)                          # => ['5', 'http://current.com/']
Meta.parse("invalid content", uri)            # => nil

这篇关于机械化-如何遵循或“点击"元刷新轨道的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆