使用 AppleScript 解析 HTML 源代码 [英] Parsing HTML source code using AppleScript

查看：32 发布时间：2021/11/16 21:31:59 html parsing applescript delimiter automator

本文介绍了使用 AppleScript 解析 HTML 源代码的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试解析已在 Automator 中转换为 TXT 文件的 HTML 文件.

我之前使用 Automator 从网站下载了 HTML 文件，现在我正在努力解析源代码.

最好，我只想获取表格的信息，我需要对 1800 个不同的 HTML 文件重复此操作.

以下是源代码示例:

<身体><div id="标题"><div class="wrapper"><span class="access"><div id="fb-root"></div><span class="access">黄金账户:<a class="upgrade" title="Account Details" href="http://www.hedge-professionals.com/account-details.html" >Active </a>以 Edward&nbsp;&nbsp; 的身份登录|&nbsp;&nbsp;<a href="javascript:void(0);"onclick='logout()' class="logout">Sign Out</span></span></div><!--/wrapper --></div><!--/header --><div id="刊头"><div class="wrapper"><a href="http://www.hedge-professionals.com" ><img src="http://www.hedge-professionals.com/images/hedgep_logo_white.png" alt="Hedge Professionals 数据库" width="333" height="46" class="logo" border="0"/></a><div id="导航"><ul><li ><a href='http://www.hedge-professionals.com/dashboard.html' >仪表板</a></li><li ><a href='http://www.hedge-professionals.com/people.html'class='current' >People</a></li><li ><;a href='http://www.hedge-professionals.com/watchlists.html' >我的关注列表</a></li><li ><a href='http://www.hedge-professionals.com/my-searches.html' >我的搜索</a></li><li ><a href='http://www.hedge-professionals.com/my-profile.html' >我的个人资料</a></li></ul></div><!--/navigation --></div><!--/wrapper --></div><!--/masthead--><div id="内容"><div class="wrapper"><div id="main-content"><!-- 每个项目的东西--><span class="section"><img src="http://www.hedge-professionals.com/images/people/noimage_53x53.jpg" alt="Christian Sieling" width="52" height="53" class="profile-pic" id="profile-pic-104947"/><h1><span id="profile-name-104947" >Christian Sieling</span></h1><ul class="gbutton-group right"><li><a class="gbutton bold pie" href="http://www.hedge-professionals.com/people.html">&laquo;返回 </a></li><li><a class="gbutton 粗体药丸盒式点击" href="http://www.hedge-professionals.com/addtoWatchlist.php?usr=114752" id="row-104947" title='添加到关注列表'>添加到关注列表</a></li><div style="float:right;padding:3px 3px;text-align:center;margin-top:5px;"><span id="profile-updated-date" >更新日期:2010 年 8 月 4 日</span><br/><a class="gbutton bold pie" href="http://www.hedge-professionals.com/profile/suggest/people/104947/Christian-Sieling" style="margin:5px;"title='报告不准确的数据'>报告不准确的数据</a>

<h2><span id="profile-details-104947" >在 <a href="http://www.hedge-professionals.com/quicksearch/search/Lumix+Capital+Management+Ltd."><span title='Lumix Capital Management Ltd.'>Lumix Capital Management Ltd.</span></a></span><input type="hidden" name="sub-id" id="sub-id" value="114752">;</h2></span><table width="100%" border="0" cellspacing="0" cellpadding="0" id="profile-table"><tr><th>角色</th><td><p>其他</p></td></tr><tr><th>组织类型</th><td><p>资产管理器</p></td></tr><tr><th>电子邮件</th><td><a href="mailto:cs@lumixcapital.com" title="cs@lumixcapital.com" >cs@lumixcapital.com</a></td></tr><tr><th>网站</th><td><a href="http://www.lumixcapital.com/" target="_new" title="http://www.lumixcapital.com/" >http://www.lumixcapital.com/</a></td></tr><tr><th>电话</th><td>41 78 616 7334</td></tr><tr><th>传真</th><td></td></tr><tr><th>邮寄地址</th><td>Birrenstrasse 30</td></tr><tr>第<第>个城市个<td>辛德勒吉</td></tr><tr><th>State</th><td>CH</td></tr><tr><th>国家</th><td>瑞士</td></tr><tr><th class="lastrow" >邮政编码</th><td class="lastrow" >8834</td></tr></div><div id="侧边栏" >

<div id="similar_sidebar" class="similar_refine" >

</div></div><div id="页脚">

设置 p 为输入将 ex 设置为 extractBetween(p, "", " ") -- 提取 URL到extractBetween(SearchText, startText, endText)将 tid 设置为 AppleScript 的文本项分隔符将 AppleScript 的文本项分隔符设置为 startText将 endItems 设置为 SearchText 的文本项 -1 的文本将 AppleScript 的文本项分隔符设置为 endText将 beginToEnd 设置为 endItems 的文本项 1 的文本将 AppleScript 的文本项分隔符设置为 tid返回开始到结束结束提取之间

所以我修改了您的代码以分两步查找该标签.首先... 然后这个分开... > 通过这种方式，我们可以忽略 table 标签附带的所有代码(宽度、边框等)，因为我认为它会因文件而异.这样做后，我们只得到表的代码.试试这个... 设置 p 为输入将 ex 设置为 extractBetween(p, "", " ")到extractBetween(SearchText, startText1, startText2, endText)将 tid 设置为 AppleScript 的文本项分隔符将 AppleScript 的文本项分隔符设置为 startText1将 endItems 设置为 SearchText 的文本项 -1将 AppleScript 的文本项分隔符设置为 endText将 beginToEnd 设置为 endItems 的文本项 1将 AppleScript 的文本项分隔符设置为 startText2将 finalText 设置为(从开始到结束的文本项 2 到 -1)作为文本将 AppleScript 的文本项分隔符设置为 tid返回最终文本结束提取之间

</head> <body> <div id="header"> <div class="wrapper"> <span class="access"> <div id="fb-root"></div> <span class="access"> Gold Account: <a class="upgrade" title="Account Details" href="http://www.hedge-professionals.com/account-details.html" >Active </a> Logged in as Edward   |   <a href="javascript:void(0);" onclick='logout()' class="logout">Sign Out</a> </span> </span> </div> </div> <div id="masthead"> <div class="wrapper"> <a href="http://www.hedge-professionals.com" ><img src="http://www.hedge-professionals.com/images/hedgep_logo_white.png" alt="Hedge Professionals Database" width="333" height="46" class="logo" border="0" /></a> <div id="navigation"> <ul> <li ><a href='http://www.hedge-professionals.com/dashboard.html' >Dashboard</a></li> <li ><a href='http://www.hedge-professionals.com/people.html'class='current' >People</a></li><li ><a href='http://www.hedge-professionals.com/watchlists.html' >My Watchlists</a></li><li ><a href='http://www.hedge-professionals.com/my-searches.html' >My Searches</a></li><li ><a href='http://www.hedge-professionals.com/my-profile.html' >My Profile</a></li></ul> </div> </div> </div> <div id="content"> <div class="wrapper"> <div id="main-content">  <span class="section"> <img src="http://www.hedge-professionals.com/images/people/noimage_53x53.jpg" alt="Christian Sieling" width="52" height="53" class="profile-pic" id="profile-pic-104947"/> <h1><span id="profile-name-104947" >Christian Sieling</span></h1> <ul class="gbutton-group right"> <li><a class="gbutton bold pill" href="http://www.hedge-professionals.com/people.html">« Back </a></li> <li><a class="gbutton bold pill boxy on-click" href="http://www.hedge-professionals.com/addtoWatchlist.php?usr=114752" id="row-104947" title='Add to Watchlist' >Add to Watchlist</a></li> </ul> <div style="float:right;padding:3px 3px;text-align:center;margin-top:5px;" > <span id="profile-updated-date" >Updated On: 4 Aug, 2010</span><br/> <a class="gbutton bold pill" href="http://www.hedge-professionals.com/profile/suggest/people/104947/Christian-Sieling" style="margin:5px;" title='Report Inaccurate Data' >Report Inaccurate Data</a> </div> <h2><span id="profile-details-104947" > at <a href="http://www.hedge-professionals.com/quicksearch/search/Lumix+Capital+Management+Ltd." ><span title='Lumix Capital Management Ltd.' >Lumix Capital Management Ltd.</span></a></span><input type="hidden" name="sub-id" id="sub-id" value="114752"></h2> </span> <table width="100%" border="0" cellspacing="0" cellpadding="0" id="profile-table"> <tr> <th>Role</th> <td> <p>Other</p> </td> </tr> <tr> <th>Organisation Type</th> <td> <p>Asset Manager</p> </td> </tr> <tr> <th>Email</th> <td><a href="mailto:cs@lumixcapital.com" title="cs@lumixcapital.com" >cs@lumixcapital.com</a></td> </tr> <tr> <th>Website</th> <td><a href="http://www.lumixcapital.com/" target="_new" title="http://www.lumixcapital.com/" >http://www.lumixcapital.com/</a></td> </tr> <tr> <th>Phone</th> <td>41 78 616 7334</td> </tr> <tr> <th>Fax</th> <td></td> </tr> <tr> <th>Mailing Address</th> <td>Birrenstrasse 30</td> </tr> <tr> <th>City</th> <td>Schindellegi</td> </tr> <tr> <th>State</th> <td>CH</td> </tr> <tr> <th>Country</th> <td>Switzerland</td> </tr> <tr> <th class="lastrow" >Zip/ Postal Code</th> <td class="lastrow" >8834</td> </tr> </table> </div> <div id="sidebar" > </div> <div id="similar_sidebar" class="similar_refine" > </div> </div> </div> <div id="footer"> </div>

set p to input set ex to extractBetween(p, "<table>", "</table>") -- extract the URL to extractBetween(SearchText, startText, endText) set tid to AppleScript's text item delimiters set AppleScript's text item delimiters to startText set endItems to text of text item -1 of SearchText set AppleScript's text item delimiters to endText set beginningToEnd to text of text item 1 of endItems set AppleScript's text item delimiters to tid return beginningToEnd end extractBetween

set p to input set ex to extractBetween(p, "<table", ">", "</table>") to extractBetween(SearchText, startText1, startText2, endText) set tid to AppleScript's text item delimiters set AppleScript's text item delimiters to startText1 set endItems to text item -1 of SearchText set AppleScript's text item delimiters to endText set beginningToEnd to text item 1 of endItems set AppleScript's text item delimiters to startText2 set finalText to (text items 2 thru -1 of beginningToEnd) as text set AppleScript's text item delimiters to tid return finalText end extractBetween

使用 AppleScript 解析 HTML 源代码 [英] Parsing HTML source code using AppleScript

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

使用 AppleScript 解析 HTML 源代码 [英] Parsing HTML source code using AppleScript

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭