网页抓取-如何识别网页上的主要内容 [英] Web scraping - how to identify main content on a webpage

查看：150 发布时间：2020/11/24 3:13:10 python web-scraping html-parsing webpage

本文介绍了网页抓取-如何识别网页上的主要内容的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

给出一个新闻文章网页(来自任何主要新闻来源，例如《时代》或彭博社)，我想确定该页面上的主要文章内容，并丢弃其他杂项元素，例如广告，菜单，边栏，用户评论.

Given a news article webpage (from any major news source such as times or bloomberg), I want to identify the main article content on that page and throw out the other misc elements such as ads, menus, sidebars, user comments.

可以在大多数主要新闻网站上使用的通用方法是什么?

What's a generic way of doing this that will work on most major news sites?

有哪些好的数据挖掘工具或库? (最好是基于python)

What are some good tools or libraries for data mining? (preferably python based)

网页抓取-如何识别网页上的主要内容 [英] Web scraping - how to identify main content on a webpage

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

网页抓取-如何识别网页上的主要内容 [英] Web scraping - how to identify main content on a webpage

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭