需要从外部页面提取特定数据 [英] Need to Fetch the specific data from external page

查看:188
本文介绍了需要从外部页面提取特定数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在进行cfhttp调用并获取数据。



现在我得到一个完整的页面,如下所示:

 < html>< title> MyPage< / title>< head>< link rel =stylesheethref =style.css>< ; / head> 
< body>
< table>< / table>
< table>< / table>
< table>< / table>
< table>< / table>
< table>< / table>
< table>< / table>
< / body>
< / html>

现在问题我想要的代码,它在body标签内,并删除最后一个表



$ b 我不知道从哪里开始 [ps JSOUP不是选项]

$ b

尝试像下面,但它没有产生任何结果:

 < cfset objPattern = CreateObject ,java.util.regex.Pattern)。编译(JavaCast(string,(?i)< table [^>] *>([\w\W]表))+< / table>))> 
< cfset objMatcher = objPattern.Matcher(JavaCast(string,cfhttp.FileContent))>
< cfoutput>#objMatcher#< / cfoutput>


解决方案

至于说服客户,表达式是伟大的一些工作,他们真的不是解析html的最好的工具。 JSoup不是外部服务。它是为此任务特别设计的预建库(与正则表达式不同)。



JSoup使用非常简单,类似于使用javascript的DOM。只需将JSoup jar添加到类路径(如果需要,重新启动),它就可以使用。


我想要的代码位于body标记内,并且完全删除
的最后一个表标记。


将HTML内容加载到Document对象中并抓取< body> 元素:

  jsoup = createObject(java,org.jsoup.Jsoup); 
doc = jsoup.parse(yourHTMLContentString);
body = doc.body();

使用 selector 以抓取并删除最后 < table> / p>

  elem = doc.select(table:last-of-type); 
elem.remove();

就是这样。现在,您可以使用< body> 元素的html:

打印或执行任何其他操作:

  writeOutput(HTMLEditFormat(body.html())); 

有关详细信息,请参阅其文档。特别是, JSoup Cookbook 有一些很好的例子。


I am making a cfhttp call and getting the data back..

Now I am getting a complete page like below:

<html><title>MyPage</title><head><link rel="stylesheet" href="style.css"></head>
<body>
<table></table>
<table></table>
<table></table>
<table></table>
<table></table>
<table></table>
</body>
</html>

Now the issue I want the code which which is inside the body tag, and also remove the last table tag completely.

I am not sure where to start [p.s JSOUP is not an option]

tried like below but it did not yielded any results:

<cfset objPattern = CreateObject("java","java.util.regex.Pattern").Compile(JavaCast("string","(?i)<table[^>]*>([\w\W](?!<table))+?</table>"))>  
    <cfset objMatcher = objPattern.Matcher(JavaCast( "string", cfhttp.FileContent ))> 
    <cfoutput>#objMatcher#</cfoutput>

解决方案

As far as convincing the client, explain that while regular expressions are great for some jobs, they are really not the best tool for parsing html. JSoup is not an external service. It is a pre-built library designed specifically for this task (unlike regular expressions).

JSoup is very simple to use, and similar to working with javascript's DOM. Just add the JSoup jar to your class path (restart if needed) and it is ready to use.

I want the code which which is inside the body tag, and also remove the last table tag completely.

Load the html content into a Document object and grab the <body> element:

jsoup = createObject("java", "org.jsoup.Jsoup");
doc = jsoup.parse( yourHTMLContentString );
body = doc.body();

Use a selector to grab and remove the last <table> element:

elem = doc.select("table:last-of-type");
elem.remove();

That is it. Now you can print, or do whatever you want, with the <body> element's html:

writeOutput( HTMLEditFormat(body.html()) );

See their documentation for more information. In particular, the JSoup Cookbook has some very good examples.

这篇关于需要从外部页面提取特定数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆