从 Ruby 到 Python:爬虫 [英] Going from Ruby to Python : Crawlers
问题描述
过去几天我开始学习python.我想知道在python中编写爬虫的等效方式.
I've started to learn python the past couple of days. I want to know the equivalent way of writing crawlers in python.
所以在 ruby 中我使用:
so In ruby I use:
nokogiri
用于抓取 html 并通过 css 标签获取内容Net::HTTP
和Net::HTTP::Get.new(uri.request_uri).body
用于从 url 获取 JSON 数据
nokogiri
for crawling html and getting content through css tagsNet::HTTP
andNet::HTTP::Get.new(uri.request_uri).body
for getting JSON data from a url
python 中这些的等价物是什么?
what are equivalents of these in python?
推荐答案
好吧
主要是你必须将'scraper'/crawler、将从网络服务器下载文件/数据的python lib/程序/函数和将读取此数据并解释数据的解析器分开.就我而言,我不得不废弃并获取一些开放"但不适合下载/数据的政府信息.对于这个项目,我使用了scrapy[1].
Mainly you have to separate the 'scraper'/crawler the python lib/program/function that will download the files/data from the webserver and the Parser that will read this data and interpret the data. In my case I had to scrap and get some govt info that is 'open' but not download/data friendly. For this project I used scrapy[1].
主要是我设置了starter_urls",这是我的机器人将抓取/获取的网址,然后我使用解析器"函数来检索/解析这些数据.
Mainly I set the 'starter_urls' that are the urls my robot will crawl/get and after I use a function 'parser' to retrieve/parse this data.
为了解析/检索,您将需要一些 html,lxml 提取器,因为 90% 的数据都是这样.
For parsing/retrieving you are going to need some html,lxml extractor as the 90% of your data will be that.
现在关注您的问题:
用于数据抓取
- Scrapy
- 请求 [2]
- Urllib [3]
用于解析数据
- Scrapy/lxml 或 scrapy+other
- lxml[4]
- 美丽的汤[5]
请记住,抓取"和抓取不仅适用于网络,也适用于电子邮件.你可以在这里查看另一个问题 [6]
And please remember 'crawling' and scrapping is not only for web, emails too. you can check another question about that here [6]
[1] = http://scrapy.org/
[2] - http://docs.python-requests.org/en/最新/
[3] - http://docs.python.org/library/urllib.html
[4] - http://lxml.de/
[5] - http://www.crummy.com/software/BeautifulSoup/一个>
[6] - Python 阅读我的Outlook电子邮件邮箱并解析邮件
这篇关于从 Ruby 到 Python:爬虫的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!