使用JSOUP从网页中检索有用的信息 [英] retrieve useful info from webpage using JSOUP

查看:92
本文介绍了使用JSOUP从网页中检索有用的信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

找到页脚元素或id =footer或有一个页脚类的元素?

我尝试使用JSOUP检索网页中的所有链接,然后运行正则表达式< code。。* contact。* 在里面。但我不能100%确定,从这种方法获取的链接是联系我们的网站页面。



Q2



是否还有其他强大的方法,或者如果我可以同时使用页脚链接和已完成的方法来断定页面是否确实是联系我们的页面?

$ b $但是我不能百分百的确定那个取得的链接......


blockquote>

SHORT ANSWER



你永远不会确定。




LONG ANSWER



对于给定的随机HTML页面,您想要查找Contact我们链接。这种工作对于人类来说是微不足道的。这对于电脑来说是一个很大的挑战。



我可以在您的案例中看到一些选项:

选项1:人群采购




  • 获取您想要的联系我们信息的所有网站url

  • 将他们发送到人群服务平台,请求真实的人为您查找信息(Rapidworkers.com,Crowdsource.com,Clickworker.com,亚马逊Mechanical Turk,microworkers.com)



检查平台是否提供API。 code> +人工完成的工作
+动态适应未知模式
- 成本货币
- 我们吮吸重复的任务

选项2: IA(patten搜寻)


  • 培训IA提取信息

  • 然后通过您的网站



看看 Weka

+自动化任务
+可以长时间执行重复任务
- 可能需要时间构建了一个强大的解决方案
- 误报或完全错过的风险

选项3 :使用Jsoup




  • 仔细研究您定位的网站的模式

  • 告诉Jsoup找到您检测到的模式



这个选项是一个永无止境的任务。您必须始终以新模式提供给Jsoup。我建议你有一个监控系统,告诉你网站何时逃脱任何已知的模式。


$ b

  +自动化任务
+可以长时间执行重复任务
- 花时间学习,发现并添加新模式
- 误报或完全错过的风险



选项4:以上三个选项的组合

  +减少误报的几率或完全失败
+更自信的最终结果
- 花时间学习,发现并增加新的模式
- 成本货币


How can i retrieve the Contact us link from any webpage in world wide web from it's "footer" part of the page in JAVA.

E.g. find footer element, or an element with id="footer" or having a footer class?

I had tried retrieving all the links from webpage using JSOUP and then running regex .*contact.* in it. But I cannot be 100% sure on that the fetched link from this approach is the contact us page of the website.

Q2

Is there any other robust approach or if i could use both footer link and my already completed approach to conclude if a page is certainly a contact us page?

解决方案

But I cannot be 100% sure on that the fetched link...

SHORT ANSWER

You will NEVER be sure.


LONG ANSWER

For a given random HTML page, you want to find the "Contact Us" link. This kind of work is trivial for a human. It represents a big challenge for a computer.

I can see some options in your case:

Option 1: Crowd sourcing

  • Fetch all the website urls you want the "Contact Us" information
  • Send them to a crowd service platform asking real people to find the information for you (Rapidworkers.com, Crowdsource.com, Clickworker.com, Amazon Mechanical Turk, microworkers.com)

Check if the platform offer an API.

+ work done by human
+ dynamically adapt to unknown pattern
- cost money
- We suck at repetitive tasks

Option 2: IA (patten searching)

  • Train an IA for extracting the information
  • Then through at it your websites

Have a look at Weka for instance or Java-ML.

+ Automated task
+ Can perform a repetitive task long time
- May take time to built a robust solution
- Risk of false positive or complete miss

Option 3: Use Jsoup

  • Carefully study the pattern of the websites you target
  • Tell Jsoup to find the pattern you have detected

This option is a never ending task. You'll have to always feed Jsoup with new patterns. I suggest you having a monitoring system telling you when website escapes any known pattern.

+ Automated task
+ Can perform a repetitive task long time
- Take time for studying, discovering, adding new patterns
- Risk of false positive or complete miss

Option 4: A mix of the three above options

You can have the three options working on the websites you target.

+ Reduce chances of false positive or complete misses
+ More confident final result
- Take time for studying, discovering, adding new patterns
- Cost money

这篇关于使用JSOUP从网页中检索有用的信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆