从网页中提取数据,解析特定片段并显示 [英] Pulling data from a webpage, parsing it for specific pieces, and displaying it

查看:27
本文介绍了从网页中提取数据,解析特定片段并显示的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经使用这个网站很长时间来寻找我的问题的答案,但我无法在这个网站上找到答案.

我正在与一个小组合作开展一个班级项目.我们将建立一个小型游戏交易"网站,允许人们注册、放入他们想要交易的游戏,并接受他人的交易或请求交易.

我们的网站提前很长时间运行,因此我们正在尝试向网站添加更多内容.我自己想做的一件事是将放入 Metacritic 的游戏链接起来.

这是我需要做的.我需要(在 Visual Studio 2012 中使用 asp 和 c#)在 metacritic 上获取正确的游戏页面,提取其数据,解析特定部分,然后在我们的页面上显示数据.

基本上,当您选择要交易的游戏时,我们希望有一个小 div 显示游戏的信息和评级.我想通过这种方式来了解更多信息并从这个我不必开始的项目中得到一些东西.

我想知道是否有人可以告诉我从哪里开始.我不知道如何从页面中提取数据.我仍在试图弄清楚我是否需​​要尝试编写一些东西来自动搜索游戏的标题并以这种方式找到页面,或者我是否可以找到直接进入游戏页面的方法.而且一旦获得数据,我不知道如何从中提取我需要的特定信息.

让这件事变得不简单的一件事是,我正在学习 c++ 以及 c# 和 asp,所以我一直在交叉.如果有人能指出我正确的方向,那将是一个很大的帮助.谢谢

解决方案

这个小例子使用了 HtmlAgilityPack,并使用 XPath 选择器来获取所需的元素.

protected void Page_Load(object sender, EventArgs e){字符串 url = "http://www.metacritic.com/game/pc/halo-spartan-assault";var web = new HtmlAgilityPack.HtmlWeb();HtmlDocument doc = web.Load(url);string metascore = doc.DocumentNode.SelectNodes("//*[@id="main"]/div[3]/div/div[2]/div[1]/div[1]/div/div/div[2]/a/span[1]")[0].InnerText;string userscore = doc.DocumentNode.SelectNodes("//*[@id="main"]/div[3]/div/div[2]/div[1]/div[2]/div[1]/div/div[2]/a/span[1]")[0].InnerText;string summary = doc.DocumentNode.SelectNodes("//*[@id="main"]/div[3]/div/div[2]/div[2]/div[1]/ul/li/span[2]/span/span[1]")[0].InnerText;}

获取给定元素的 XPath 的一种简单方法是使用您的网络浏览器(我使用 Chrome)开发人员工具:

  • 打开开发者工具(F12Ctrl + Shift + C 在 Windows 或 Command + Shift + C for Mac).
  • 在页面中选择您想要 XPath 的元素.
  • 右键单击元素"中的元素标签.
  • 点击复制为 XPath".

您可以像在 c# 中那样粘贴它(如我的代码所示),但请确保对引号进行转义.

您必须确保使用一些错误处理技术,因为如果 Web 抓取更改了页面的 HTML 格式,则可能会导致错误.

编辑

根据@knocte 的建议,这里是 HTMLAgilityPack 的 Nuget 包的链接:

https://www.nuget.org/packages/HtmlAgilityPack/

I've been using this site for a long time to find answers to my questions, but I wasn't able to find the answer on this one.

I am working with a small group on a class project. We're to build a small "game trading" website that allows people to register, put in a game they have they want to trade, and accept trades from others or request a trade.

We have the site functioning long ahead of schedule so we're trying to add more to the site. One thing I want to do myself is to link the games that are put in to Metacritic.

Here's what I need to do. I need to (using asp and c# in visual studio 2012) get the correct game page on metacritic, pull its data, parse it for specific parts, and then display the data on our page.

Essentially when you choose a game you want to trade for we want a small div to display with the game's information and rating. I'm wanting to do it this way to learn more and get something out of this project I didn't have to start with.

I was wondering if anyone could tell me where to start. I don't know how to pull data from a page. I'm still trying to figure out if I need to try and write something to automatically search for the game's title and find the page that way or if I can find some way to go straight to the game's page. And once I've gotten the data, I don't know how to pull the specific information I need from it.

One of the things that doesn't make this easy is that I'm learning c++ along with c# and asp so I keep getting my wires crossed. If someone could point me in the right direction it would be a big help. Thanks

解决方案

This small example uses HtmlAgilityPack, and using XPath selectors to get to the desired elements.

protected void Page_Load(object sender, EventArgs e)
{
    string url = "http://www.metacritic.com/game/pc/halo-spartan-assault";
    var web = new HtmlAgilityPack.HtmlWeb();
    HtmlDocument doc = web.Load(url);

    string metascore = doc.DocumentNode.SelectNodes("//*[@id="main"]/div[3]/div/div[2]/div[1]/div[1]/div/div/div[2]/a/span[1]")[0].InnerText;
    string userscore = doc.DocumentNode.SelectNodes("//*[@id="main"]/div[3]/div/div[2]/div[1]/div[2]/div[1]/div/div[2]/a/span[1]")[0].InnerText;
    string summary = doc.DocumentNode.SelectNodes("//*[@id="main"]/div[3]/div/div[2]/div[2]/div[1]/ul/li/span[2]/span/span[1]")[0].InnerText;
}

An easy way to obtain the XPath for a given element is by using your web browser (I use Chrome) Developer Tools:

  • Open the Developer Tools (F12 or Ctrl + Shift + C on Windows or Command + Shift + C for Mac).
  • Select the element in the page that you want the XPath for.
  • Right click the element in the "Elements" tab.
  • Click on "Copy as XPath".

You can paste it exactly like that in c# (as shown in my code), but make sure to escape the quotes.

You have to make sure you use some error handling techniques because Web scraping can cause errors if they change the HTML formatting of the page.

Edit

Per @knocte's suggestion, here is the link to the Nuget package for HTMLAgilityPack:

https://www.nuget.org/packages/HtmlAgilityPack/

这篇关于从网页中提取数据,解析特定片段并显示的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆