有没有一种方法可以使用可读性(文本提取算法)和python中的自定义算法从文本中提取链接? [英] Is there a way to use readability (text extraction algorithm) and a custom algorithm in python to extract links from text?
问题描述
有没有一种方法可以使用可读性(文本提取算法)和python中的自定义算法从文本中提取链接?
Is there a way to use readability (text extraction algorithm) and a custom algorithm in python to extract links from text?
我想找出一种提取文本正文中链接的方法.
I'd like to figure out a way of extracting links that are in the body of text.
1.)我在python中使用可读性 https://github.com/gfxmonk/python-readability
1.) I use readability in python https://github.com/gfxmonk/python-readability
2.)我想以某种方式将提取的文本与原始html文本进行比较,以提取文章实际正文中的链接.
2.) I'd like to somehow compare the extracted text to the original html text in order to extract links in the actual body of an article.
推荐答案
好吧,看起来它返回了BeautifulSoup树.因此,您应该可以执行以下操作:
Well, it looks like it returns a BeautifulSoup tree. So you should be able to do something like:
article = page.summary() # Extract article using readability
article.findAll("a") # Return a list of all links in the article
这篇关于有没有一种方法可以使用可读性(文本提取算法)和python中的自定义算法从文本中提取链接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!