JSOUP查找单词组 [英] JSOUP Finding Groups of Words

查看:55
本文介绍了JSOUP查找单词组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于家庭作业,我必须编写一个程序,该程序从网站上抓取HTML,然后以某种方式在网站内查找短语.当我说短语时,是指以某种任意的方式组织文本,以便彼此接近的词放在同一组中.我知道这听起来确实不清楚,但是作业说明我们如何执行此操作取决于我们自己对如何找到短语"的解释.

For a homework assignment I have to write a program that scraps HTML from a website and then somehow find phrases within the website. When I say phrases I mean some sort of arbitrary way of organizing text so that words that are in close proximity to each other are put in the same group. I know this sounds really unclear, but the assignment states how we do this is up to our own interpretation of how to find "phrases".

目前,我的代码如下:

Document doc = Jsoup.connect("http://oracle.com/").get();
String html = doc.body().toString();

System.out.println(html); 

在解析所有html时,哪一个页面上出现的所有不同单词会给我一个不错的打印输出.

Which will give me a decent printout of all the different words that appear on some webpage while parsing out all the html.

我的主要问题是我想不出一种解析HTML的方法,这样我就可以以某种方式将这些任意组组合在一起(而且我不知道我可以使用什么样的标准来任意地形成这些组"的单词).

My main problem is I can't think of a way to parse the HTML so that I can somehow get these arbitrary groups together (and I don't know what kind of criteria I can use to arbitrarily form these "groups" of words).

我知道这个问题听起来很糟糕,但是我不知道该怎么说,而且我真的不知道该做什么.给我的任务非常不清楚,当要求澄清时,我的教授只是告诉我自己解释.我想知道是否有人对如何解析html有任何想法,以便彼此接近的单词(可能在相似的html标签之内或类似的东西)可以类似于我现在的当前输出被过滤掉,除非在每个短语"之后都可以. 就像换行符或我可以解析的内容.

I know this question sounds terrible but I don't know how else I can state it, and I am really out of ideas as to what I can do. The assignment I was given is extremely unclear, and when asked for clarification my professor just tells me to interpret it myself. I was wondering if anyone had any ideas on how to parse the html so that words close to each other (maybe inside similar html tags or something) could be filtered out similar to the current output I have right now, except maybe after every "phrase" there's like a newline or something I can parse.

感谢您的任何想法或建议.

Thanks for any ideas or advice.

推荐答案

您正在寻找的概念叫做摘除.来自维基百科

What you are looking for is a concept called stemming. From wikipedia

例如,英语的词干分析器应识别字符串"cats" (也可能是"catlike","catty"等),以根"cat"为基础,并且 基于词干"的词干",词干",词干".一个茎 算法减少了单词钓鱼",钓鱼",鱼"和渔夫" 根词鱼".

A stemmer for English, for example, should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stemmer", "stemming", "stemmed" as based on "stem". A stemming algorithm reduces the words "fishing", "fished", "fish", and "fisher" to the root word, "fish".

您为此提供了一个简单的蛮力实施.还要从 Lucene

You an provide a simple brute force implementation for this. Also checkout the stemming algorithm implementations from Lucene and OpenNLP

这篇关于JSOUP查找单词组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆