使用 AppleScript 解析 HTML 源代码 [英] Parsing HTML source code using AppleScript

查看:32
本文介绍了使用 AppleScript 解析 HTML 源代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试解析已在 Automator 中转换为 TXT 文件的 HTML 文件.

我之前使用 Automator 从网站下载了 HTML 文件,现在我正在努力解析源代码.

最好,我只想获取表格的信息,我需要对 1800 个不同的 HTML 文件重复此操作.

以下是源代码示例:

<身体><div id="标题"><div class="wrapper"><span class="access"><div id="fb-root"></div><span class="access">黄金账户:<a class="upgrade" title="Account Details" href="http://www.hedge-professionals.com/account-details.html" >Active </a>以 Edward&nbsp;&nbsp; 的身份登录|&nbsp;&nbsp;<a href="javascript:void(0);"onclick='logout()' class="logout">Sign Out</span></span></div><!--/wrapper --></div><!--/header --><div id="刊头"><div class="wrapper"><a href="http://www.hedge-professionals.com" ><img src="http://www.hedge-professionals.com/images/hedgep_logo_white.png" alt="Hedge Professionals 数据库" width="333" height="46" class="logo" border="0"/></a><div id="导航"><ul><li ><a href='http://www.hedge-professionals.com/dashboard.html' >仪表板</a></li><li ><a href='http://www.hedge-professionals.com/people.html'class='current' >People</a></li><li ><;a href='http://www.hedge-professionals.com/watchlists.html' >我的关注列表</a></li><li ><a href='http://www.hedge-professionals.com/my-searches.html' >我的搜索</a></li><li ><a href='http://www.hedge-professionals.com/my-profile.html' >我的个人资料</a></li></ul></div><!--/navigation --></div><!--/wrapper --></div><!--/masthead--><div id="内容"><div class="wrapper"><div id="main-content"><!-- 每个项目的东西--><span class="section"><img src="http://www.hedge-professionals.com/images/people/noimage_53x53.jpg" alt="Christian Sieling" width="52" height="53" class="profile-pic" id="profile-pic-104947"/><h1><span id="profile-name-104947" >Christian Sieling</span></h1><ul class="gbutton-group right"><li><a class="gbutton bold pie" href="http://www.hedge-professionals.com/people.html">&laquo;返回 </a></li><li><a class="gbutton 粗体药丸盒式点击" href="http://www.hedge-professionals.com/addtoWatchlist.php?usr=114752" id="row-104947" title='添加到关注列表'>添加到关注列表</a></li><div style="float:right;padding:3px 3px;text-align:center;margin-top:5px;"><span id="profile-updated-date" >更新日期:2010 年 8 月 4 日</span><br/><a class="gbutton bold pie" href="http://www.hedge-professionals.com/profile/suggest/people/104947/Christian-Sieling" style="margin:5px;"title='报告不准确的数据'>报告不准确的数据</a>

<h2><span id="profile-details-104947" >在 <a href="http://www.hedge-professionals.com/quicksearch/search/Lumix+Capital+Management+Ltd."><span title='Lumix Capital Management Ltd.'>Lumix Capital Management Ltd.</span></a></span><input type="hidden" name="sub-id" id="sub-id" value="114752">;</h2></span><table width="100%" border="0" cellspacing="0" cellpadding="0" id="profile-table"><tr><th>角色</th><td><p>其他</p></td></tr><tr><th>组织类型</th><td><p>资产管理器</p></td></tr><tr><th>电子邮件</th><td><a href="mailto:cs@lumixcapital.com" title="cs@lumixcapital.com" >cs@lumixcapital.com</a></td></tr><tr><th>网站</th><td><a href="http://www.lumixcapital.com/" target="_new" title="http://www.lumixcapital.com/" >http://www.lumixcapital.com/</a></td></tr><tr><th>电话</th><td>41 78 616 7334</td></tr><tr><th>传真</th><td></td></tr><tr><th>邮寄地址</th><td>Birrenstrasse 30</td></tr><tr>第<第>个城市个<td>辛德勒吉</td></tr><tr><th>State</th><td>CH</td></tr><tr><th>国家</th><td>瑞士</td></tr><tr><th class="lastrow" >邮政编码</th><td class="lastrow" >8834</td></tr></div><!--/main-content --><div id="侧边栏" >

<div id="similar_sidebar" class="similar_refine" >

</div><!--/wrapper --></div><!--/content --><div id="页脚">

我的 AppleScript 尝试使用 text item delimiters 以类似的方式提取表格:

设置 p 为输入将 ex 设置为 extractBetween(p, "", "
") -- 提取 URL到extractBetween(SearchText, startText, endText)将 tid 设置为 AppleScript 的文本项分隔符将 AppleScript 的文本项分隔符设置为 startText将 endItems 设置为 SearchText 的文本项 -1 的文本将 AppleScript 的文本项分隔符设置为 endText将 beginToEnd 设置为 endItems 的文本项 1 的文本将 AppleScript 的文本项分隔符设置为 tid返回开始到结束结束提取之间

如何从 HTML 文件中解析表格?

解决方案

你真的很接近.问题是你的 startText 变量.起始表标记不在 html 文本中,因此无法找到.表格开始的那一行实际上是...

所以我修改了您的代码以分两步查找该标签.首先...

然后这个分开...

>

通过这种方式,我们可以忽略 table 标签附带的所有代码(宽度、边框等),因为我认为它会因文件而异.这样做后,我们只得到表的代码.试试这个...

设置 p 为输入将 ex 设置为 extractBetween(p, "", "
")到extractBetween(SearchText, startText1, startText2, endText)将 tid 设置为 AppleScript 的文本项分隔符将 AppleScript 的文本项分隔符设置为 startText1将 endItems 设置为 SearchText 的文本项 -1将 AppleScript 的文本项分隔符设置为 endText将 beginToEnd 设置为 endItems 的文本项 1将 AppleScript 的文本项分隔符设置为 startText2将 finalText 设置为(从开始到结束的文本项 2 到 -1)作为文本将 AppleScript 的文本项分隔符设置为 tid返回最终文本结束提取之间

I'm trying to parse an HTML file which I have converted to a TXT file inside of Automator.

I previously downloaded the HTML file from a website using Automator, and I am now struggling to parse the source code.

Preferably, I want to take the information of just the table and I need to repeat this action for 1800 different HTML files.

Here is an example of the source code:

</head>
<body>
<div id="header">
    <div class="wrapper">
        <span class="access">
        <div id="fb-root"></div>


    <span class="access">
     Gold Account: <a class="upgrade" title="Account Details" href="http://www.hedge-professionals.com/account-details.html" >Active </a>       Logged in as Edward&nbsp;&nbsp; | &nbsp;&nbsp;<a href="javascript:void(0);" onclick='logout()' class="logout">Sign Out</a>

    </span>
                                    </span>
    </div><!-- /wrapper -->
</div><!-- /header -->

<div id="masthead">
    <div class="wrapper">   
        <a href="http://www.hedge-professionals.com" ><img src="http://www.hedge-professionals.com/images/hedgep_logo_white.png" alt="Hedge Professionals Database" width="333" height="46" class="logo" border="0" /></a>
        <div id="navigation">
            <ul>
<li ><a href='http://www.hedge-professionals.com/dashboard.html' >Dashboard</a></li>    <li ><a href='http://www.hedge-professionals.com/people.html'class='current' >People</a></li><li ><a href='http://www.hedge-professionals.com/watchlists.html' >My Watchlists</a></li><li ><a href='http://www.hedge-professionals.com/my-searches.html' >My Searches</a></li><li ><a href='http://www.hedge-professionals.com/my-profile.html' >My Profile</a></li></ul>               
        </div><!-- /navigation -->

    </div><!-- /wrapper -->     
</div><!-- /masthead -->


<div id="content">
    <div class="wrapper">
        <div id="main-content">

 <!-- per Project stuff -->
    <span class="section">
                <img src="http://www.hedge-professionals.com/images/people/noimage_53x53.jpg" alt="Christian Sieling" width="52" height="53" class="profile-pic" id="profile-pic-104947"/>
                <h1><span id="profile-name-104947" >Christian Sieling</span></h1>
                                    <ul class="gbutton-group right">
                    <li><a class="gbutton bold pill" href="http://www.hedge-professionals.com/people.html">&laquo; Back </a></li>
                    <li><a class="gbutton bold pill boxy on-click" href="http://www.hedge-professionals.com/addtoWatchlist.php?usr=114752"  id="row-104947" title='Add to Watchlist' >Add to Watchlist</a></li>
                </ul>

                <div style="float:right;padding:3px 3px;text-align:center;margin-top:5px;" >
                <span id="profile-updated-date" >Updated On: 4 Aug, 2010</span><br/>
                <a class="gbutton bold pill" href="http://www.hedge-professionals.com/profile/suggest/people/104947/Christian-Sieling" style="margin:5px;" title='Report Inaccurate Data' >Report Inaccurate Data</a>
                </div>
                                    <h2><span id="profile-details-104947" > at <a href="http://www.hedge-professionals.com/quicksearch/search/Lumix+Capital+Management+Ltd." ><span title='Lumix Capital Management Ltd.' >Lumix Capital Management Ltd.</span></a></span><input type="hidden" name="sub-id" id="sub-id" value="114752"></h2>

            </span>

            <table width="100%" border="0" cellspacing="0" cellpadding="0" id="profile-table">
                                                        <tr>
                    <th>Role</th>
                    <td>
                    <p>Other</p>                            </td>
                </tr>
                <tr>  
                    <th>Organisation Type</th>
                    <td>
                    <p>Asset Manager</p>                        </td>
                </tr>
                <tr>
                    <th>Email</th>
                    <td><a href="mailto:cs@lumixcapital.com" title="cs@lumixcapital.com" >cs@lumixcapital.com</a></td>
                </tr>
                <tr>
                    <th>Website</th>
                    <td><a href="http://www.lumixcapital.com/" target="_new" title="http://www.lumixcapital.com/" >http://www.lumixcapital.com/</a></td>
                </tr>
                <tr>
                    <th>Phone</th>
                    <td>41 78 616 7334</td>
                </tr>
                <tr>
                    <th>Fax</th>
                    <td></td> 
                </tr>
                <tr>
                    <th>Mailing Address</th>
                    <td>Birrenstrasse 30</td>
                </tr>
                <tr>
                    <th>City</th>
                    <td>Schindellegi</td>
                </tr>
                <tr>
                    <th>State</th>
                    <td>CH</td>
                </tr>
                <tr>
                    <th>Country</th>
                    <td>Switzerland</td>
                </tr>
                <tr>
                    <th class="lastrow" >Zip/ Postal Code</th>
                    <td class="lastrow" >8834</td>
                </tr>
        </table>
                </div><!-- /main-content -->
                    <div id="sidebar"  >
                    </div>

            <div id="similar_sidebar" class="similar_refine" >



            </div>
                            </div><!-- /wrapper -->
</div><!-- /content -->

<div id="footer">

</div>

My AppleScript attempt that is using text item delimiters to extract the table in a similar fashion:

set p to input
set ex to extractBetween(p, "<table>", "</table>") -- extract the URL
to extractBetween(SearchText, startText, endText)
set tid to AppleScript's text item delimiters
set AppleScript's text item delimiters to startText
set endItems to text of text item -1 of SearchText
set AppleScript's text item delimiters to endText
set beginningToEnd to text of text item 1 of endItems
set AppleScript's text item delimiters to tid
return beginningToEnd
end extractBetween

How can I parse the table from the HTML file?

解决方案

You're really close. The problem is your startText variable. The starting table tag is not in the html text so it can't be found. The line that starts the table is actually...

<table width="100%" border="0" cellspacing="0" cellpadding="0" id="profile-table">

So I modified your code to look for that tag in 2 steps. First...

<table

And then this separately...

>

In this way we can ignore all of the code that comes with the table tag (width, border etc.) because I assume it will vary between the files. After doing this we get only the code of the table. Try this...

set p to input
set ex to extractBetween(p, "<table", ">", "</table>")

to extractBetween(SearchText, startText1, startText2, endText)
    set tid to AppleScript's text item delimiters
    set AppleScript's text item delimiters to startText1
    set endItems to text item -1 of SearchText
    set AppleScript's text item delimiters to endText
    set beginningToEnd to text item 1 of endItems
    set AppleScript's text item delimiters to startText2
    set finalText to (text items 2 thru -1 of beginningToEnd) as text
    set AppleScript's text item delimiters to tid
    return finalText
end extractBetween

这篇关于使用 AppleScript 解析 HTML 源代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆