什么是 nokogiri % 编码 $ 字符 [英] What is nokogiri % encoding $ character

查看:43
本文介绍了什么是 nokogiri % 编码 $ 字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么我得到:

Nokogiri::HTML('<a href="/test_$4b.html">test</a>').to_html=>"<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><a href=\"/test_%244b.html\">test</a></body></html>\n"

我认为 $ 符号在网址中有效?

跟进:

为什么浏览器的处理方式不同.例如.在页面中:http://www.pmlive.com/pharma_news/its_on_shire_and_abbvie_agree_32bn_takeover9_586>9

链接:http://www.pmlive.com/pharma_news/mylan_buys_abbotts_non-us_generics_in_53 亿美元_deal_585883 有效.

但是 nokogiri 会将这个链接解析为:http://www.pmlive.com/pharma_news/mylan_buys_abbotts_non-us_35bnerics_in_%245.3bn_deal_5858883a> 不起作用(返回 404).

他们是否认为 $ 实际上是安全且更好的选择?

解决方案

这里有 这里 RFC3986 将美元符号列为保留的子分隔符(第 12 页).

<块引用>

保留 = gen-delims/sub-delims

gen-delims = ":";//"/?"/#"/[";/]"/@"

sub-delims = "!";/$"/&"/'";/"("/")";/*"/+"/,"/;"/="

它还建议如何处理保留字符:

<块引用>

2.2.保留字符

URI 包括组件和子组件,这些组件和子组件由保留"中的字符放.这些字符被称为保留"因为它们可能(也可能不会)被定义为分隔符通用语法,通过每个方案特定的语法,或通过URI 解引用算法的特定于实现的语法.如果 URI 组件的数据与保留的字符的用途作为分隔符,那么冲突的数据必须是在 URI 形成之前进行百分比编码.

Nokogiri 的作者喜欢决定,由于他们的库可以被任何人用于任何目的,因此无法自动确定保留字符是否会发生冲突,因此是最安全"的.处理它的方法(没有直接测试 URI)是根据建议对其进行转义.

Why do I get:

Nokogiri::HTML('<a href="/test_$4b.html">test</a>').to_html

=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><a href=\"/test_%244b.html\">test</a></body></html>\n"

I thought $ symbol was valid in the url?

Followup:

Why do browsers handle this differently. E.g. In the page: http://www.pmlive.com/pharma_news/its_on_shire_and_abbvie_agree_32bn_takeover_586969

The link: http://www.pmlive.com/pharma_news/mylan_buys_abbotts_non-us_generics_in_$5.3bn_deal_585883 works.

But nokogiri would parse this link as: http://www.pmlive.com/pharma_news/mylan_buys_abbotts_non-us_generics_in_%245.3bn_deal_585883 which does not work (returns 404).

Are they making the decision that $ is actually safe and a better choice?

解决方案

There's this RFC3986 here which lists the dollar sign as a reserved sub-delimiter (page 12).

reserved = gen-delims / sub-delims

gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"

sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="

It also recommends how reserved characters should be handle:

2.2. Reserved Characters

URIs include components and subcomponents that are delimited by characters in the "reserved" set. These characters are called "reserved" because they may (or may not) be defined as delimiters by the generic syntax, by each scheme-specific syntax, or by the implementation-specific syntax of a URI's dereferencing algorithm. If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed.

The authors of Nokogiri liked decided that since their library may be used by anyone for any purpose, there is no way to automatically determine whether a reserved character would conflict or not, and therefore the "safest" way to handle it (short of testing a URI directly) would be to escape it as per the recommendation.

这篇关于什么是 nokogiri % 编码 $ 字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆