如何抓取 Hype Machine 等网站? [英] How to scrape websites such as Hype Machine?

查看:39
本文介绍了如何抓取 Hype Machine 等网站?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对网站抓取很好奇(即它是如何完成的等等),特别是我想编写一个脚本来执行网站的任务 炒作机器.我实际上是一名软件工程本科生(第 4 年),但是我们并没有真正涵盖任何 Web 编程,因此我对 Javascript/RESTFul API/All Things Web 的理解非常有限,因为我们主要关注理论和客户端应用程序.非常感谢任何帮助或指示.

I'm curious about website scraping (i.e. how it's done etc..), specifically that I'd like to write a script to perform the task for the site Hype Machine. I'm actually a Software Engineering Undergraduate (4th year) however we don't really cover any web programming so my understanding of Javascript/RESTFul API/All things Web are pretty limited as we're mainly focused around theory and client side applications. Any help or directions greatly appreciated.

推荐答案

首先要看的是该站点是否已经提供了某种结构化数据,或者您是否需要自己解析 HTML.看起来有一个最新歌曲的RSS提要.如果这就是您要寻找的东西,最好从那里开始.

The first thing to look for is whether the site already offers some sort of structured data, or if you need to parse through the HTML yourself. Looks like there is an RSS feed of latest songs. If that's what you're looking for, it would be good to start there.

您可以使用脚本语言下载并解析提要.我使用 python,但如果你愿意,你可以选择不同的脚本语言.这里有一些关于如何在python中下载网址在 python 中解析 XML.

You can use a scripting language to download the feed and parse it. I use python, but you could pick a different scripting language if you like. Here's some docs on how you might download a url in python and parse XML in python.

在编写下载站点或 RSS 提要的程序时要注意的另一件事是抓取脚本的运行频率.如果您让它持续运行,以便在新数据可用时立即获得新数据,那么您将在站点上加载大量负载,并且很有可能他们会阻止您.尽量不要频繁运行脚本.

Another thing to be conscious of when you write a program that downloads a site or RSS feed is how often your scraping script runs. If you have it run constantly so that you'll get the new data the second it becomes available, you'll put a lot of load on the site, and there's a good chance they'll block you. Try not to run your script more often than you need to.

这篇关于如何抓取 Hype Machine 等网站?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆