如何处理Express中非UTF-8编码的URL [英] How to deal with non UTF-8 encoded urls in express

查看:92
本文介绍了如何处理Express中非UTF-8编码的URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个节点js应用程序,我们最近将其从在IIS 7上运行(通过IIS节点)转移到了在Linux(弹性Beanstalk)上运行.自从切换以来,我们已经收到了很多非UTF-8网址发送到我们的应用程序(主要是从搜寻器发送),例如:

We have a node js application which we have recently moved over from running on IIS 7 (via IIS node) to running on Linux (Elastic Beanstalk). Since we switched, we've been getting a lot of non UTF-8 urls being sent to our application (mainly from crawlers) such as:

Bj%F6rk 转换为Björk.现在,这已传递给我们的应用程序,并且我们的Web框架(express)最终调用到

Bj%F6rk which IIS was converting to Björk. This is now being passed to our application and our web framework (express) eventually calls down to

decodeURIComponent('Bj%F6rk');URIError:URI格式错误在解码URIComponent处(本机)在repl:1:1在REPLServer.self.eval(repl.js:110:21)在repl.js:249:20在REPLServer.self.eval(repl.js:122:7)< anonymous>.(repl.js:239:12)在Interface.emit(events.js:95:17)在Interface._onLine(readline.js:203:10)在Interface._line(readline.js:532:8)在Interface._ttyWrite(readline.js:761:14)

在发送url字符串表示之前,是否有建议的安全方法可以执行与IIS相同的转换?

Is there a recommended safe way we can perform the same conversion as IIS before sending the url string to express?

牢记

  1. 我们正在收到对这些编码错误的URL的请求,
  2. 有一种方法可以使用不推荐使用的 unescape javascript函数
  3. 对这些URL的大部分请求来自Bing Bot,我们希望将对搜索排名的任何不利影响降到最低.

  1. We are receiving requests to these badly encoded URLS and
  2. There is a way to decode them using the deprecated unescape javascript function and
  3. The majority of the requests to these URLs are coming from Bing Bot and we want to minimise any adverse effect on our search rankings.

  • 我们真的应该对所有传入的URL都这样做吗?
  • 我们应该关注任何安全性或性能影响吗?
  • 我们应该担心在不久的将来删除 unescape 吗?
  • 有没有更好/更安全的方法来解决此问题(是的,我们确实读过上面链接的MDN文章)

推荐答案

我们真的应该对所有传入的URL都这样做吗?

Should we really be doing this for all incoming URLs?

不,你不应该.发出的请求使用非UTF8 URI组件.那不应该是你的问题.

No, you shouldn't. The request being made uses non-UTF8 URI components. That shouldn't be your problem.

我们应该对安全性或性能有任何影响吗?在意吗?

Are there any security or performance implications we should be concerned about?

URI组件的编码不是安全问题.通过querystring或path参数进行注入尝试.但这是另一主题.在性能方面,每种中间件都会使您的响应花费更长的时间.但是我什至不用担心.如果您想自己解码URI,请执行此操作.只需几毫秒.

The encoding of a URI component is not a security issue. Injection attempts via querystring or path params are. But that's another subject. In terms of performance, every middleware will make your responses take a bit longer. But I wouldn't even worry about that. If you want to decode the URI yourself, just do it. It'll only take a few milliseconds.

我们应该担心在不久的将来移除unescape未来?

Should we be concerned about unescape being removed in the near future?

实际上你应该.已弃用 unescape .如果您仍然想使用它;只需检查它是否首先存在.即全局'unescape'.您还可以使用内置的替代方法: require('querystring').unescape(),在每种情况下都不会产生相同的结果,但不会抛出 URIError .(尽管不推荐).

Actually you should. unescape is deprecated. If you still want to use it; just check if it exists first. i.e. 'unescape' in global. You can also use the built-in alternate: require('querystring').unescape() which won't produce the same result in every case but it won't throw a URIError. (Not recommended though).

要最大程度地减少对搜索排名的不利影响:

确定在这些情况下您的快速应用返回哪个状态代码.可能是 500 (内部服务器错误),看上去很糟糕;可能是 404 (未找到),它将告诉抓取工具您没有查询结果(可能不正确).

Determine which status code your express app returns in these cases. It could be 500 (INTERNAL SERVER ERROR) which will look bad and 404 (NOT FOUND) which will tell the crawler you don't have a result for the query (which may not be true).

在这些情况下,建议您通过返回诸如 400 (错误请求)之类的客户端错误来覆盖此问题,因为问题的根源是所请求的URI格式不正确.在UTF-8中,但不是.搜寻器/漫游器应对此予以关注.

In these cases, I suggest you override this by returning a client error such as 400 (BAD REQUEST) instead, since the origin of the problem is a malformed URI component being requested, which should be in UTF-8 but it's not. The crawler/bot should be concerned about that.

// middleware for responding with BAD REQUEST
app.use(function (err, req, res, next) {
    if (err instanceof URIError) {
        res.status(400).send();
    }
});

首先,尝试为格式错误的URI返回结果还有其他副作用.首先,您将允许一个错误的请求-不好:).其次,这意味着您将得到一个错误的URI的结果,当爬虫/机器人得到200 OK响应时,该URI将被存储,并且将被传播.然后,您将不得不处理更多的错误请求.

Above all, trying to return a result for a malformed URI has other side effects. First, you'll be allowing a bad request — can't be good :). Secondly, it'll mean you have a result for a bad URI which will get stored by crawlers/bots when they get a 200 OK response and it will get spread. Then you'll have to deal with more bad requests.

总结;不要通过 unescape 进行解码.Express已经尝试通过适当的方式进行解码: decodeURIComponent .如果失败了,那就去吧.

To conclude; don't decode via unescape. Express already tries to decode via what's proper: decodeURIComponent. If that fails, let it be.

这篇关于如何处理Express中非UTF-8编码的URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆