使用Node.js，请求和支持者从网站上爬取链接？ [英] Scraping links from website using Node.js, request, and cheerio?

查看：86 发布时间：2020/10/1 6:01:50 javascript html node.js web-scraping cheerio

本文介绍了使用Node.js，请求和支持者从网站上爬取链接？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用Node.js，request和cheerio在学校的课程表网站上抓取链接。但是，我的代码未到达所有学科链接。

I'm trying to scrape links on my school's course schedule website using Node.js, request, and cheerio. However, my code is not reaching all subject links.

到课程表网站的链接此处。

以下是我的代码：

var express = require('express');
var request = require('request');
var cheerio = require('cheerio');

var app = express();

app.get('/subjects', function(req, res) {
  var URL = 'http://courseschedules.njit.edu/index.aspx?semester=2016s';

  request(URL, function(error, response, body) {
    if(!error) {
      var $ = cheerio.load(body);

      $('.courseList_section a').each(function() {
        var text = $(this).text();
        var link = $(this).attr('href');

        console.log(text + ' --> ' + link);
      });
    }
    else {
      console.log('There was an error!');
    }
  });
});

app.listen('8080');
console.log('Magic happens on port 8080!');

我的输出可以找到此处。

从我的输出中可以看到，缺少一些链接。更具体地说，来自 A， I（续）和R（续）部分的链接。这些也是每列的第一部分。

As you can see from my output, some links are missing. More specifically, links from sections 'A', 'I (Continued)', and R '(Continued)'. These are also the first sections of each column.

每个部分都包含在其自己的div中，其类名为 courseList_section，因此我不明白为什么 .courseList_section a不会遍历所有链接。我是否缺少明显的东西？

Each section is contained in its own div with class name 'courseList_section' so I don't understand why '.courseList_section a' doesn't loop through all links. Am I missing something obvious? Any and all insight is very appreciated.

预先感谢您！

推荐答案

问题不在于您的代码，而是您要解析的网站就是问题。 HTML标记无效。您正在尝试解析 .courseList_section 内的所有内容，但标签看起来像这样。

The problem isn't your code, it's the site you're trying to parse that's the problem. The HTML tags are invalid. You're trying to parse everything inside the .courseList_section, but the tags looks like this.

<span> <!-- Opening tag -->
    <div class='courseList_section'>
      <a href='index.aspx?semester=2016s&ƒ=ACC '>ACC  - Accounting/Essex CC</a>
      </span> <!-- Invalid closing tag for the first span, menaing that .courseList_section will be closed instead

<!-- Suddenly this link is outside the .courseList_section tag, meaning that it will be ignored by cheerio -->
<a href='index.aspx?semester=2016s&subjectID=ACCT'>ACCT - Accounting</a>
  <!-- and so on -->

解决方案。

var request = require('request');
var cheerio = require('cheerio');

var URL = 'http://courseschedules.njit.edu/index.aspx?semester=2016s';

request(URL, function(error, response, body) {
  if(error) { return  console.error('There was an error!'); }

  var $ = cheerio.load(body);

  $('a').each(function() {
    var text = $(this).text();
    var link = $(this).attr('href');

    if(link && link.match(/subjectID/)){
      console.log(text + ' --> ' + link);
    };
  });
});

下次，尝试直接查看HTML，看看是否可以。如果它看起来像****，请将其通过 HTML美化器，然后再次进行检查。甚至美化者也无法处理这种标记，这表明标记有问题。

Next time, try looking directly at the HTML and see if it looks okay. If it looks like ****, pass it trough an HTML beautifier and inspect it again. Not even the beautifier could handle this markup which indicated that something was wrong with the tags.

这篇关于使用Node.js，请求和支持者从网站上爬取链接？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用Node.js，请求和支持者从网站上爬取链接？ [英] Scraping links from website using Node.js, request, and cheerio?

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

使用Node.js，请求和支持者从网站上爬取链接？ [英] Scraping links from website using Node.js, request, and cheerio?

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭