解析复杂的HTML表 [英] Parsing complex HTML tables

查看:73
本文介绍了解析复杂的HTML表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试解析大学提供的课程表,以便将信息导入某种日历中. 时间表的示例可以在这里查看:
http://www.asw -berufsakademie.de/fileadmin/download/download/Sked%20Stundenplan/WIA13-7.%20Block.html

I'm trying to parse the class schedule provided by my university in order to import the information into some kind of calendar. An example of the schedule can be seen here:
http://www.asw-berufsakademie.de/fileadmin/download/download/Sked%20Stundenplan/WIA13-7.%20Block.html

在我看来,自动生成的HTML内容是一团糟并且很难掌握.例如.这些表格主要由rowpans和colspans构成(代码中单元格的位置与其在浏览器中的实际视觉位置相比似乎有些随意).

The auto-generated HTML-content is, in my opinion, a mess and very hard to grasp. E.g. the tables are mainly built with rowspans and colspans (the positions of cells in the code compared with their actual visual position in the browser seem partially arbitrary).

我已经尝试过的:

  1. 请大学的行政办公室单独提供一个更简单,更易于阅读的文件.当然这是不可能的,毕竟这意味着要花一分钟的额外时间.
  2. 研究用于生成HTML的原始工具.它被称为"Sked Stundenplan软件".我找不到任何提示或工具来逆向"生成过程.
  3. 在寻找现有解决方案时,我发现了一些工具(例如 http ://code.google.com/p/skd-schedule-parser/)不适用于我的日程安排.在研究了这些工具的代码之后,我得出结论,它们必须是为其他/过时版本的sked设计的.
  4. 使用PHP解析HTML(主要使用DOMDocument).有时候可以用,但是太不可靠了.要考虑的例外似乎是不确定的.
  1. Asking the university's administration office to provide a simpler, easier to read file separately. Of course this wasn't possible, after all it would mean one minute of additional effort.
  2. Researching the original tool used to generate the HTML. It is called "sked Stundenplan Software". I couldn't find any hints or tools to "reverse" the generation process.
  3. Looking for an existing solution, at which point I found some tools (e.g. http://code.google.com/p/skd-schedule-parser/) that do not work for my schedule. After studying the codes of these tools I concluded that they must have been designed for an other/outdated version of sked.
  4. Parsing the HTML with PHP (mostly using DOMDocument). That worked sometimes, but was way too unreliable...The exceptions to take into account seem indefinite.

现在,我认为常规的HTML解析不会使我走得很远,至少在可接受的开发时间内不会. 我正在寻找的是从复杂的HTML表中获取信息的其他方法,例如YQL,或者可以使用col-/rowspans标准化此类表的实用程序. 因为我没有具体的想法,所以我主要是在寻求其他方法的提示或提示.

Right now I don't think that conventional HTML parsing will get me far, at least not in an acceptable developing time. What I am looking for are other methods to fetch information from complex HTML tables, something like YQL, or maybe utilities that can normalize such tables with col-/rowspans. Because I don't have anything concrete in mind, I am mainly asking for some tips or hints for another approach.

还有其他更合适的方法来解析此类表吗?还是我坚持传统的HTML解析?

Are there other, more suitable methods to parse such tables or am I stuck with conventional HTML parsing?

我代表一个请求粘贴一个原始代码示例...

On behalf of a request, I'll paste an example of raw code...

本周:

此代码的结果:
http://pastebin.com/BJduUVtU

Results from this code:
http://pastebin.com/BJduUVtU


由于进行了一些解析讨论,因此我还将添加我的PHP代码.这是我第一次使用PHP,所以它不是很复杂.它宁可让我深入了解我在理论上解析这些表所走的距离.实际工作发生在函数 parseSkedTable()中,请专注于这一工作.另外,我想指出在评论中出现的双重课程"一词,它描述了同时发生的两种不同的课程(在这种情况下班级会分开).这些课程的示例可以在第二周找到:
http://www.asw -berufsakademie.de/fileadmin/download/download/Sked%20Stundenplan/WIB14-4.%20Block.html

Edit 2:
Because of some parsing discussions I'll also add my PHP code. It's my first time with PHP so it's not very sophisticated. It should rather give an insight on how far I've come with parsing the tables in theory. The actual work happens in the function parseSkedTable(), please concentrate on this one. Also, I would like to point out the term "double courses" appearing in the comments, which describes two different courses happening at the same time (the class would be split in such moments). An example of these courses can be found here in week two:
http://www.asw-berufsakademie.de/fileadmin/download/download/Sked%20Stundenplan/WIB14-4.%20Block.html

它看起来像这样:

该周相应的 HTML代码也可以在以下位置访问:
http://pastebin.com/gLTWz5KU

The corresponding HTML-code of that week can also be accessed here:
http://pastebin.com/gLTWz5KU

现在是 PHP代码(我很难翻译注释,因为我已经很难用我的母语来表达它们了……我希望它们可能仍然会有所帮助):
http://pastebin.com/Nzi8m2v8

And now the PHP-code (I had a hard time translating the comments since I already struggled expressing them in my first language...I hope they may still be helpful):
http://pastebin.com/Nzi8m2v8

更新

到目前为止,已经有一些解决我的解析问题的解决方案,每个解决方案都使用JavaScript.由于JavaScript(由于能够使用浏览器呈现的数据而在这里特别强大)似乎是从HTML检索可靠信息的唯一有效方法,因此我现在正在寻找一种实现无头浏览器或呈现引擎的方法.在我的免费服务器x10hosting.com上.遗憾的是,除 softaculous 提供的软件之外,我无法安装软件,也不允许使用PHP的 exec()命令.
任何想法将不胜感激!

So far, there have been some solutions to my parsing problem, each of them using JavaScript. Since JavaScript (being especially powerful here because of the ability to use browser-rendered data) seems to be the only efficient way to retrieve reliable information from the HTML, I am now looking for a way to implement some kind of headless browser or rendering engine on my free server at x10hosting.com. Sadly, I am neither able to install software other than provided by softaculous nor allowed to use PHP's exec() command.
Any idea would be appreciated!

为了完整起见,我将发布两种解决方案,直到现在:

For the sake of completeness, I'll post both solutions, existing until now:

jQuery解析器 :

(函数($){ $(document).ready(function(){

(function ($) { $(document).ready(function() {

    var _pe = window.pe || {
        fn : {}
    };

    var tblNumber = 0; // Just a incremental number to identify the schedule item with the table

    // For each table
    $('table').each(function () {

        $('#output').append('Parsing the table number: ' + tblNumber + '<br>');
        // console.log('Parsing the table number: ' + tblNumber);
        tblNumber += 1;

        var currentTable = this;


        // Parser the complex table
        _pe.fn.parsertable.parse($(currentTable));

        // Retrieve the parsed data
        var parsedData = $(currentTable).data().tblparser;

        //
        // Information about the column structure, nice that is consistent
        //

        // Day: Cell index position (0 based)
        // Mo: 3
        // Di: 7
        // Mi: 11
        // Do: 15
        // Fr: 19
        // Sa: 23

        // Title Location at Row index position "0"

        // "i" represent the middle column position
        for (var i = 3; i < 24; i += 4) {

            var currentDay;

            // Get the day
            currentDay = $(parsedData.row[0].cell[i].elem).text();

            $('#output').append('  Day: ' + currentDay + '<br>');
            // console.log('Day: ' + currentDay);

            // Get all the events for that day, excluding the first row and the last row
            for (var j = 1; j < parsedData.col[i].cell.length - 2; j += 1) {

                // First column 
                if (parsedData.col[i - 1].cell[j - 1].uid !== parsedData.col[i - 1].cell[j].uid ) {

                    // Get the content of that cell and remove ending space
                    var event = $(parsedData.col[i - 1].cell[j].elem).text().trim();

                    if (event.length > 0) {
                        $('#output').append('  + Event: ' + event + '<br>');
                        // console.log('Event: ' + event);
                    }
                }

                // Second Column
                if (parsedData.col[i].cell[j - 1].uid !== parsedData.col[i].cell[j].uid &&
                    parsedData.col[i - 1].cell[j].uid !== parsedData.col[i].cell[j].uid) {

                    // Get the content of that cell and remove ending space
                    var event = $(parsedData.col[i].cell[j].elem).text().trim();

                    if (event.length > 0) {
                        $('#output').append('  + Event: ' + event + '<br>');
                        // console.log('Event: ' + event);
                    }
                }

                // Third Column
                if (parsedData.col[i + 1].cell[j - 1].uid !== parsedData.col[i + 1].cell[j].uid &&
                    parsedData.col[i].cell[j].uid !== parsedData.col[i + 1].cell[j].uid) {

                    // Get the content of that cell and remove ending space
                    var event = $(parsedData.col[i + 1].cell[j].elem).text().trim();

                    if (event.length > 0) {
                        $('#output').append('  + Event: ' + event + '<br>');
                        // console.log('Event: ' + event);
                    }
                }
            } 

        }

    });


});

}(jQuery));

}(jQuery));

使用位置解析器的JS解析器 信息 由我实现,朗博 编码者的想法

JS parser using positional information by me, realizing rambo coder's idea

推荐答案


我在同一所大学学习,几周前,我遇到了同样的问题来解析此时间表并将其转换为ICS文件.最终,我找到了自己的解决方案并对代码进行了通用化,以便其他大学的学生使用Sked软件并拥有更复杂的时间表,也可以导入时间表.
我还创建了一个网站,学生可以在其中注册并配置要预订的时间表的URL.在后台运行cronjob,以确保所订阅的日历始终是最新的. 您可以在我的网站上找到该项目的结果:
http://calendar.pineappledeveloper.com/
(仅提供德语).


I study at the same university and a few weeks ago I faced the same problem to parse this time table and convert it to an ICS file. Finally I found my own solution and generalized the code, so that students from other universities, using the Sked software and have a much more complex time table, can import their time table too.
I also created a website, where students can sign up and configure the urls to the time tables which they want to subscribe. In the background runs a cronjob which ensures, that the subscribed calendars are always up to date. You can find the result of the project on my website:
http://calendar.pineappledeveloper.com/
(it is only in German available).

这篇关于解析复杂的HTML表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆