mod_rewrite的规则来执行百分号编码规范 [英] mod_rewrite rule to enforce canonical percent-encoding

查看:200
本文介绍了mod_rewrite的规则来执行百分号编码规范的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有需要的字符是百分之恩$ C $光盘,甚至的毫无保留的字符像这实际上并没有要求是EN codeD括号或aphostrophes。该应用程序认为是EN codeD上的错误的方式URL的规范化,然后重定向到右的编码。

We have a PHP app with a dynamic URL scheme which requires characters to be percent-encoded, even "unreserved characters" like parentheses or aphostrophes which aren't actually required to be encoded. URLs which the app deems to be encoded the "wrong" way are canonicalized and then redirected to the "right" encoding.

但是,谷歌和其他用户代理将规范化百分号编码/解码不同,这意味着当Googlebot请求页面会要求您输入错误的网址,当它回来重定向到正确的网址,Googlebot会拒绝跟随重定向,将拒绝对网页进行索引。

But Google and other user agents will canonicalize percent-encoding/decoding differently, meaning when Googlebot requests the page it will ask for the "wrong" URL, and when it gets back a redirect to the "right" URL, Googlebot will refuse to follow the redirect and will refuse to index the page.

是的,这是我们最终的错误。在HTTP规范规定,服务器对百分之恩codeD和非%的恩codeD相同的毫无保留的字符。但是,固定在应用code中的问题是不直接的,现在,所以我希望通过使用Apache重写规则,以避免code变化,这将确保URL是连接codeD正常从点的视图的应用程序,这意味着apopstrophes,括号等都是百分之恩codeD和空间都设有codeD为 + 而不是 20%

Yes, this is a bug on our end. The HTTP specs require that servers treat percent-encoded and non-percent-encoded unreserved characters identically. But fixing the problem in the app code is non-straightforward right now, so I was hoping to avoid a code change by using an Apache rewrite rule which would ensure that URLs are encoded "properly" from the point-of-view of the app, meaning that apopstrophes, parentheses, etc. are all percent-encoded and that spaces are encoded as + and not %20.

下面是一个例子,在这里我要重写第一和第二种形式结束:

Here's one example, where I want to rewrite the first and end up with the second form:


  • www.splunkbase.com/apps/All/4.x/Add-On/app:OPSEC+LEA+for+Check+Point+(Linux)的

  • www.splunkbase.com/apps/All/4.x/Add-On/app:OPSEC+LEA+for+Check+Point+%28Linux%29

下面是另一个:


  • www.splunkbase.com/apps/All/4.x/app:Benford's+Law+Fraud+Detection+Add-on

  • www.splunkbase.com/apps/All/4.x/app:Benford%27s+Law+Fraud+Detection+Add-on

下面是另一个:


  • www.splunkbase.com/apps/All/4.x/app:Benford%27s%20Law%20Fraud%20Detection%20Add-on

  • www.splunkbase.com/apps/All/4.x/app:Benford%27s+Law+Fraud+Detection+Add-on

如果应用程序只看到这些URL的第二种形式,那么就不会发送任何重定向和谷歌将可以对网页进行索引。

If the app sees only the second form of these URLs, then it won't send any redirects and Google will be able to index the page.

我和重写规则一个新手,这是从我的的mod-rewrite文档的mod_rewrite的做一些自动编码/解码它可以帮助或伤害了我想做的事情,虽然不能肯定。

I'm a newbie with rewrite rules, and it was clear from my read of the mod-rewrite documentation that mod_rewrite does some automatic encoding/decoding which may help or hurt what I want to do, although not sure.

有什么建议重写规则来处理上述情况?我为每个特殊字符的规则确定,因为有并不多,但单个规则(如果可能)将是理想的。

Any advice for rewrite rules to handle the above cases? I'm OK with a rule for each special character since there's not many of them, but a single rule (if possible) would be ideal.

推荐答案

解决方案实际上可能是相当简单的,但它将在Apache 2.2的,由于使用的 B 标志。我不知道它是否正确地照顾每一个案件的(当然我有点怀疑它不涉及比这更多的工作),但我带领相信它应该由源$ C ​​$ C

The solution actually may be fairly simple, though it will only work in Apache 2.2 and later due to the use of the B flag. I'm not sure whether or not it takes care of every case correctly (admittedly I'm a bit skeptical it doesn't involve more work than this), but I'm led to believe it should by the source code.

请记住太那个的 REQUEST_URI 不被mod_rewrite的转变更新,因此,如果您的应用程序依赖于值来确定请求的URL,您所做的更改值将不可见反正。

Keep in mind too that the value of REQUEST_URI is not updated by mod_rewrite transformations, so if your application relies on that value to determine the requested URL, the changes you make won't be visible anyway.

好消息是,这可以在.htaccess来完成,所以你必须离开主要配置不变是否适合你最好的选择。

The good news is that this can be done in .htaccess, so you have the option of leaving the main configuration untouched if that works better for you.

RewriteEngine On

# Make sure this is only done once to avoid escaping the escapes...
RewriteCond %{ENV:REDIRECT_STATUS} ^$
# Check if we have anything to bother escaping (likely unnecessary...)
RewriteCond $0 [^\w]+
# Rewrite the entire URL by escaping the backreference
RewriteRule ^.*$ $0 [B]

那么,为什么还有必要使用 B 标记,而不是让mod_rewrite的自动躲避URL重写?当mod_rewrite的自动脱URL,它使用 ap_escape_uri (这显然已经变成了一个宏 ap_os_escape_path 由于某种原因...),逸出的字符的有限子集的功能。在 B 标志,然而,使用称为内部模块功能 escape_uri ,这是仿照PHP的的 urlen code 功能。

So, why is there a need to use the B flag instead of letting mod_rewrite escape the rewritten URL automatically? When mod_rewrite automatically escapes the URL, it uses ap_escape_uri (which apparently has been turned into a macro for ap_os_escape_path for some reason...), a function that escapes a limited subset of characters. The B flag, however, uses an internal module function called escape_uri, which is modeled on PHP's urlencode function.

escape_uri 的模块中的执行建议,字母数字和下划线保持原样,空间转换为+,和其他一切转换成它逃脱相当于。这似乎是你想要的行为,那么presumably它应该工作。

The implementation of escape_uri in the module suggests that alphanumeric characters and underscores are left as-is, spaces are converted to +, and everything else is converted to its escaped equivalent. This seems to be the behaviour that you want, so presumably it should work.

如果没有,你有设置一个外部程序的选择 RewriteMap指令 ,可以操纵你的传入的URL到正确的格式。这就要求虽然操纵Apache配置和叛徒脚本可能导致问题对整个服务器上的,所以我不认为这是一个理想的解决方案,如果能够避免它。

If not, you do have the option of setting up an external program RewriteMap that could manipulate your incoming URLs into the correct format. This requires manipulating the Apache configuration though, and a renegade script could cause problems for the server on the whole, so I don't consider it an ideal solution if it can be avoided.

这篇关于mod_rewrite的规则来执行百分号编码规范的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆