提取所有输入参数的JSP页面爬网程序 [英] JSP Page Crawler that extracts all input parameters

查看:269
本文介绍了提取所有输入参数的JSP页面爬网程序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您是否碰巧知道一个开源Java组件,它提供了扫描一组动态页面(JSP)的功能,然后从那里提取所有输入参数。当然,爬虫可以抓取静态代码而不是动态代码,但我的想法是将其扩展为抓取包含所有服务器端代码的Web服务器。当然,我假设该工具可以完全访问已爬网的Web服务器,而不是使用任何黑客攻击。

Do you happen to know of an opensource Java component that provides the facility to scan a set of dynamic pages (JSP) and then extract all the input parameters from there. Of course, a crawler would be able to crawl static code and not dynamic code, but my idea here is to extend it to crawl a webserver including all the server-side code. Naturally, I am assuming that the tool will have full access to the crawled webserver and not by using any hacks.

我们的想法是构建一个静态分析器,它能够检测所有动态页面中的所有参数(request.getParameter()等)字段。

The idea is to build a static analyzer that has the capacity to detect all parameters (request.getParameter() and such) fields from all dynamic pages.

推荐答案


我们的想法是构建一个能够检测所有参数字段的静态分析器所有动态页面。

The idea is to build a static analyzer that has the capacity to detect all parameter fields from all dynamic pages.

您无法使用 web 抓取工具(基本上是HTML解析器)来提取请求参数。他们可以最高扫描HTML结构。您可以使用例如 Jsoup

You cannot use a web crawler (basically, a HTML parser) to extract request parameters. They can at highest scan the HTML structure. You can use for example Jsoup for this:

for (Element form : Jsoup.connect("http://google.com").get().select("form")) {
    System.out.printf("Form found: action=%s, method=%s%n", form.attr("action"), form.attr("method"));
    for (Element input : form.select("input,select,textarea")) {
        System.out.printf("\tInput found: name=%s, value=%s%n", input.attr("name"), input.attr("value"));
    }
}

目前打印


Form found: action=, method=
    Input found: name=hl, value=en
    Input found: name=source, value=hp
    Input found: name=ie, value=ISO-8859-1
    Input found: name=q, value=
    Input found: name=btnG, value=Google Search
    Input found: name=btnI, value=I'm Feeling Lucky
    Input found: name=, value=
Form found: action=/search, method=
    Input found: name=hl, value=en
    Input found: name=source, value=hp
    Input found: name=ie, value=ISO-8859-1
    Input found: name=q, value=
    Input found: name=btnG, value=Google Search
    Input found: name=btnI, value=I'm Feeling Lucky






如果您想扫描任何表格/输入的JSP 源代码,那么您必须向不同的方向看,它绝对不是被称为网络爬虫。遗憾的是,没有想到这样的静态分析工具。最接近的是创建一个过滤器,它记录所有提交的请求参数。


If you want to scan the JSP source code for any forms/inputs, then you have to look in a different direction, it's definitely not to be called "web crawler". Unfortunately no such static analysis tool comes to mind. Closest what you can get is to create a Filter which logs all submitted request parameters.

Map<String, String[]> params = request.getParameterMap();
// ...

这篇关于提取所有输入参数的JSP页面爬网程序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆