内容谈判是否被打破? [英] Is Content Negotiation broken?

查看:148
本文介绍了内容谈判是否被打破?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近对网络抓取工具感兴趣,但有一点对我来说不是很清楚。想象一个简单的爬虫,它将获取页面,从中提取链接并排队,以便以后以相同的方式处理。

I recently got interested in web crawlers but one thing isn't a very clear one to me. Imagine a simple crawler that would get the page, extract links from it and queue them for later processing the same way.

当某些链接不会导致另一个页面而是某个资产或其他类型的静态文件时,抓取工具如何处理?怎么会知道?它可能不想下载这种可能很大的二进制数据,甚至也不想下载xml或json文件。内容谈判如何落入这个?

How crawlers handle the case when certain link wouldn't lead to another page but to some asset or maybe other kind of static file instead? How would it know? It probably doesn't want to download this kind of maybe large binary data, nor even xml or json files. How content negotiation fall into this?

当我向 example.com/foo.png with Accept:text / html 如果它不能满足我的要求,它应该给我发回html响应或错误请求状态,没有别的可以接受,但那不是怎么回事它适用于现实生活。无论如何它还是用 Content-Type:image / png 发回给我二进制数据,即使我告诉它我只接受 text / html 。为什么网络服务器会像这样工作,而不是强迫我要求的正确答案?

How I see content negotiation should work is on the webserver side when I issue a request to example.com/foo.png with Accept: text/html it should send me back an html response or Bad Request status if it cannot satisfy my requirements, nothing else is acceptable, but that's not how it works in the real life. It send me back that binary data anyway with Content-Type: image/png even when I'm telling it I only accept text/html. Why webservers work like this and not coercing the right response I'm asking for?

内容协商的实施是否被破坏或者应用程序有责任正确实施?

Is implementation of content negotiation broken or it's application's responsibility to implement it correctly?

真正的抓取工具如何工作?提前发送HEAD请求以检查链接另一端的什么是不切实际的资源浪费。

And how does real crawlers work? Sending HEAD request ahead to check whats on the other side of a link sees as an unpractical waste of resources.

推荐答案

不'坏'请求',正确的响应是406不可接受。

Not 'Bad Request', the correct response is 406 Not Acceptable.

HTTP规范规定它应该发回这个规范[ 1 ],但大多数实现都不这样做。如果你想避免下载你不感兴趣的内容类型,你唯一的选择就是先做一个HEAD。
既然您可能抓取了这些图片,那么您也可以进行一些智能猜测,它实际上是一张图片(例如,它出现在< img> tag)。

The HTTP spec states that it SHOULD send back this spec[1], but most implementations don't do this. If you want to avoid download a content-type you're not interested in, your only options is indeed to do a HEAD first. Since you probably crawled these images, you may also be able to make some intelligent guesses that it was in fact an image (for instance, it appeared in an <img> tag).

您也可以正常启动请求,一旦发现您正在获取二进制数据,请切断TCP连接短。但我不确定这是多么好的想法。

You could also just start the request as normally, and as soon as you notice that you're getting binary data back, cut the TCP connection short. But I'm not sure how good of an idea this is.

这篇关于内容谈判是否被打破?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆