JSoup错误403当尝试读取我的网站上的目录的内容 [英] JSoup error 403 when trying to read the contents of a directory on my website

查看:739
本文介绍了JSoup错误403当尝试读取我的网站上的目录的内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

线程中的异常mainorg.jsoup.HttpStatusException:HTTP错误提取URL。 Status = 403,URL =(site)
在org.jsoup.helper.HttpConnection $ Response.execute(HttpConnection.java:449)
在org.jsoup.helper.HttpConnection $ Response.execute(HttpConnection .java:465)
在org.jsoup.helper.HttpConnection $ Response.execute(HttpConnection.java:424)
在org.jsoup.helper.HttpConnection.execute(HttpConnection.java:178)
在org.jsoup.helper.HttpConnection.get(HttpConnection.java:167)
在plan.URLReader.main(URLReader.java:21)

Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=(site) at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:449) at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:465) at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:424) at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:178) at org.jsoup.helper.HttpConnection.get(HttpConnection.java:167) at plan.URLReader.main(URLReader.java:21)

您好!

我一直在寻找一种方式来阅读我的网站上的目录,以获取我正在开发的应用程序。

I have been looking up a way to read a directory on a website of mine for an application I'm developing.

我可以自己读取文件并且使用它们,如果我硬编码,但如果我尝试从目录中获取文件列表,我得到这个错误。

I can read the files themselves and work with them if I hardcode it, but if I try to grab the list of files from the directory I get this error.

我尝试过几种方法,但这是我目前使用的代码。

I've tried a few ways, but this is the code I am currently working with.

String url =//(removed网站隐私);
print(Fetching%s ...,url);

String url = ""//(removed site for privacy); print("Fetching %s...", url);

    Document doc = Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36").get();
    Elements links = doc.select("a[href]");
    Elements media = doc.select("[src]");
    Elements imports = doc.select("link[href]");

...
...
...

... ... ...

现在如果我在www.google.com/中使用主要网站,它会读取链接。问题是我想要一个目录,如www.google.com/something/something/...

Now if I use the main site as in www.google.com/ it reads the links. The problem is I want a directory as in www.google.com/something/something/...

当我尝试为我的网站我收到这个错误。

when I try that for my site I am getting this error.

任何想法为什么我可以访问我的主要网站,但不能访问目录?

Any idea why I can access my main site, but not directories within it?

我也注意到'/'需要在最后。

I also notice that '/' is needed at the end.

只是好奇,如果我缺少某些东西,或需要做另一种方式?

Just curious if am I missing something, or need to do something another way?

谢谢你的时间。

推荐答案

这可能是一个问题(或故意试图阻止访问使用)服务器的配置,而不是您的应用程序。来自http-status-code-403标签的标签wiki摘录:

This is likely a problem with (or deliberate attempt to block access using) the server's configuration, not your application. From the tag wiki excerpt for the http-status-code-403 tag:


403或禁止错误消息是HTTP标准响应代码指示请求是合法且理解的,但是服务器拒绝响应请求。

The 403 or "Forbidden" error message is a HTTP standard response code indicating that the request was legal and understood but the server refuses to respond to the request.

标签维基本身


由于授权问题或与请求相关的其他约束,A 403 Forbidden可能由Web服务器返回。文件权限,缺少加密和达到的最大用户数(等等)都可能是403响应的原因。

A 403 Forbidden may be returned by a web server due to an authorization issue or other constraint related to the request. File permissions, lack of encryption, and maximum number of users reached (among others) can all be the cause of a 403 response.

如果目标站点正在尝试阻止屏幕抓取,另一种可能性是无法识别的用户代理字符串,但是您将用户代理字符串设置为从实际浏览器获取的一个(我认为),因此应该不是原因。

If the target site is attempting to block screen-scraping, another possibility is an unrecognized user-agent string, but you're setting the user-agent string to one (I presume) you've obtained from an actual browser, so that shouldn't be the cause.

如果您希望获取常规(HTML)网页或由...生成的特殊目录列表页面,您的问题不清楚当目录中不存在index.html时,该服务器。如果是后者,请注意,许多服务器都禁用了这些列表,以避免在网站本身未链接的目录中泄漏文件的名称。再次,这是一个服务器配置问题,而不是您的应用程序可以解决的问题。

It's not clear from your question if you expect to fetch a regular (HTML) web page, or a special "directory listing" page generated by the server when an index.html is not present in a directory. If it's the latter, note that many servers have these listings disabled to avoid leaking the names of files in the directory that aren't linked to from the web site itself. Again, this is a server configuration issue, not something your application can work around.

这篇关于JSoup错误403当尝试读取我的网站上的目录的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆