从网页中仅提取有意义的文本 [英] Extracting only meaningful text from webpages

查看:87
本文介绍了从网页中仅提取有意义的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在获取URL列表,并使用nltk对其进行抓取.我的最终结果是列表形式,网页上的所有单词都在列表中.麻烦的是,我只在寻找不是通常的英语糖"单词(例如"as and as,to,am,for"等)之类的关键字和短语.我知道我可以使用所有常见的文件来构建文件英文单词,然后将其从我的已删除标记"列表中删除,但是某些库的内置功能会自动执行此操作吗?

I am getting a list of urls and scraping them using nltk. My end result is in the form of a list with all the words on the webpage in a list. The trouble is that I am only looking for keywords and phrases that are not the usual english "sugar" words such as "as, and, like, to, am, for" etc etc. I know I can construct a file with all common english words and simply remove them from my scraped tokens list, but is there a built in feature for some library that does this automatically?

我实质上是在页面上寻找有用的单词,这些单词不是蓬松的,并且可以为页面的内容提供一些背景信息.就像stackoverflow上的标签或Google用于seo的标签一样.

I am essentially looking for useful words on a page that are not fluff and can give some context to what the page is about. Almost like the tags on stackoverflow or the tags google uses for seo.

推荐答案

我认为您正在寻找的是nltk.corpus中的stopwords.words:

I think what you are looking for is the stopwords.words from nltk.corpus:

>>> from nltk.corpus import stopwords
>>> sw = set(stopwords.words('english'))
>>> sentence = "a long sentence that contains a for instance"
>>> [w for w in sentence.split() if w not in sw]
['long', 'sentence', 'contains', 'instance']

搜索停用词可能重复:使用NLTK删除关键词如何使用nltk或python删除停用词.请参阅这些问题的答案.并考虑词干对术语频率的影响吗?

searching for stopword give possible duplicates: Stopword removal with NLTK, How to remove stop words using nltk or python. See the answers of these question. And consider Effects of Stemming on the term frequency? too

这篇关于从网页中仅提取有意义的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆