使用C#抓取网站 [英] Crawling websites using C#

查看:114
本文介绍了使用C#抓取网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请指导我如何使用C#抓取网站然后将数据存储在sqlserver数据库中...



我需要抓取阿拉伯语网站才能开始使用一些数据挖掘技术。



谢谢

please guide me how to crawl websites using C# then store the data in sqlserver database...

I need to crawl websites in Arabic language to start use some data mining techniques.

thanks

推荐答案

你看过c#中的几个开源爬虫吗? ,哪些可以很容易找到谷歌?没有?好吧,你应该...

你可以从这里开始:

https:/ /github.com/sjdirect/abot/ [ ^ ]

一个简单的履带使用C#套接字 [ ^ ]

http:// ericsowell.com/blog/2007/8/14/how-to-write-a-web-crawler-in-csharp [ ^ ]

ans so .. 。
Have you looked at the several open source crawlers made in c#, which can be easily found with google? No? Well, you should...
You could start here:
https://github.com/sjdirect/abot/[^]
A Simple Crawler Using C# Sockets[^]
http://ericsowell.com/blog/2007/8/14/how-to-write-a-web-crawler-in-csharp[^]
ans so on...

网上有一些带c#的开源爬虫,例如:



一个简单的履带用C#套接字 [ ^ ]

https://abot.codeplex。 com / [ ^ ]

< a href =https://code.google.com/p/abot/> https://code.google.com/p/abot/ [ ^ ]



但是,如果你想学习并编码,你应该这样做:



1-关于网络请求和回复的研究

2-获取第一个网址的html源代码

3-在html中搜索并找到带链接的标签,例如a with href

4-解析它们并选择并保存在DB

<最后我建议在编码后研究示例代码。
There are some open source crawler with c# in net, for example:

A Simple Crawler Using C# Sockets[^]
https://abot.codeplex.com/[^]
https://code.google.com/p/abot/[^]

but, if you want to learn and coding for its, you should do:

1- study about web request and response
2- get html source for first url
3- search in html and find tags with links, for example a with href
4- parse them and select and save in DB

finally i suggest to study sample code after coding.


但是我只需要提取页面的某些部分,如:

新闻标题

新闻图片

新闻详情

新闻日期和时间



并非所有页面,,,
But I need to extract only some parts of the page like:
news title
news image
news details
news date&time

not all the pages,,,


这篇关于使用C#抓取网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆