抓取网址并检测文本部分 [英] Scraping url and detect text parts

查看:44
本文介绍了抓取网址并检测文本部分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道是否有可能抓取特定的网址并检测其中的所有文本部分.

I'd like to know if it's possible to scraping a specific url and detect all the text parts in it.

更具体地说,我想抓取一篇文章,并找到标题,副标题(如果存在)和内容.我知道在大多数文章中都会有其他噪音内容,如其他建议的标题一样,但是我会弄清楚将来如何解决.现在,我只是问如何抓取网址.通过阅读其他Stack Overflow问题,我发现我可以使用Ajax和jQuery来实现它(就像下面的小代码一样,我知道这毫无意义),但是我不确定是否已经全部这些段落在我的脑海中清晰可见.

To be more specific , I'd like to scraping an article and find the title, the subtitle (if it exists) and the content. I know that in most articles there will be other noise-contents, like other suggested title, but I'll figure how to solve it out in the future. For now, I'm just asking how to scrape an url. By reading in others Stack Overflow questions, I've found out that I can use ajax and jQuery to make it happen (like the little piece of code below, that I know is meaningless), but I'm not sure I've all the passages clear in my mind.

$.ajax({
    url: "/thePageToScrape.html",
    dataType: 'text',
});

推荐答案

在服务器端这样做可以解决很多问题,并且可以获得更好的结果-但这仍然是一个简单的示例:

Thats a large subject and better results will be achieved by doing it on the server side - but still here is a quick example:

假设我们需要此页面: var url ="http://someurl.com/scrapme1.html"; 我们希望它的内容看起来像这样:

Let say that we want this page: var url = "http://someurl.com/scrapme1.html"; And we want its content that looks like that:

<html>
    <head>
    ....
    </head>
<body>
      <h4 class='page-title'>
          I'm an article title
      </h4>
      <div class='summary'>
          ...
      </div>
      <div id="article_body">
          ...
      </div>
</body>
</html>

现在我们需要标题( h4.page-title ),摘要( div.summary )和文章内容( div#article_body ).

Now we need The title (h4.page-title), the summary (div.summary) and the article content (div#article_body).

我们可以将页面加载到jQuery元素中:

We can load the page into a jQuery element:

function getContent(url){
    var content = null;
    $.get('http://wwcshare/icenter/Pages/wwcMenuContent.aspx', 
        function(data) {
            var $dom = $(data); 
            var title = $dom.find("h4.page-title");
            var summary = $dom.find("div.summary");
            var article_body = $dom.find("div#article_body");
            //Do whatever you need....
        }
}

更重要的笔记:

  1. 显然,您需要确保正确返回了预期的数据.
  2. 您也可以使用 $.post()、. load()、. ajax(),甚至可以添加一些POST变量.
  3. 页面可能具有受限制的原始策略,因此您的请求可能会失败.
  4. 您可以使用一些正则表达式模式来动态检测和综合一些关键值,例如发布日期,作者姓名等.
  5. 您始终需要对源有很好的了解,并使用其元素选择器来更好,更快地提取相关内容.
  1. Obviously you need to make sure the expected data is returned correctly.
  2. You can use $.post(), .load(), .ajax() too, and even add some POST variables.
  3. Pages may have restricted origin policies so your requests may fail.
  4. You can use some regex patterns to dynamically detect and synthesize some key values - such as publication date, author name etc.
  5. You always need to have good knowledge on your source and use its elements selectors for better and quicker extraction of the relevant content.

享受.

这篇关于抓取网址并检测文本部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆