使用C#抓取网站 [英] Crawling websites using C#
本文介绍了使用C#抓取网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
请指导我如何使用C#抓取网站然后将数据存储在sqlserver数据库中...
我需要抓取阿拉伯语网站才能开始使用一些数据挖掘技术。
谢谢
please guide me how to crawl websites using C# then store the data in sqlserver database...
I need to crawl websites in Arabic language to start use some data mining techniques.
thanks
推荐答案
你看过c#中的几个开源爬虫吗? ,哪些可以很容易找到谷歌?没有?好吧,你应该...
你可以从这里开始:
https:/ /github.com/sjdirect/abot/ [ ^ ]
一个简单的履带使用C#套接字 [ ^ ]
http:// ericsowell.com/blog/2007/8/14/how-to-write-a-web-crawler-in-csharp [ ^ ]
ans so .. 。
Have you looked at the several open source crawlers made in c#, which can be easily found with google? No? Well, you should...
You could start here:
https://github.com/sjdirect/abot/[^]
A Simple Crawler Using C# Sockets[^]
http://ericsowell.com/blog/2007/8/14/how-to-write-a-web-crawler-in-csharp[^]
ans so on...
网上有一些带c#的开源爬虫,例如:
一个简单的履带用C#套接字 [ ^ ]
https://abot.codeplex。 com / [ ^ ]
< a href =https://code.google.com/p/abot/> https://code.google.com/p/abot/ [ ^ ]
但是,如果你想学习并编码,你应该这样做:
1-关于网络请求和回复的研究
2-获取第一个网址的html源代码
3-在html中搜索并找到带链接的标签,例如a with href
4-解析它们并选择并保存在DB
<最后我建议在编码后研究示例代码。
There are some open source crawler with c# in net, for example:
A Simple Crawler Using C# Sockets[^]
https://abot.codeplex.com/[^]
https://code.google.com/p/abot/[^]
but, if you want to learn and coding for its, you should do:
1- study about web request and response
2- get html source for first url
3- search in html and find tags with links, for example a with href
4- parse them and select and save in DB
finally i suggest to study sample code after coding.
但是我只需要提取页面的某些部分,如:
新闻标题
新闻图片
新闻详情
新闻日期和时间
并非所有页面,,,
But I need to extract only some parts of the page like:
news title
news image
news details
news date&time
not all the pages,,,
这篇关于使用C#抓取网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文