需要从外部页面提取特定数据 [英] Need to Fetch the specific data from external page
问题描述
我正在进行cfhttp调用并获取数据。
现在我得到一个完整的页面,如下所示:
< html>< title> MyPage< / title>< head>< link rel =stylesheethref =style.css>< ; / head>
< body>
< table>< / table>
< table>< / table>
< table>< / table>
< table>< / table>
< table>< / table>
< table>< / table>
< / body>
< / html>
现在问题我想要的代码,它在body标签内,并删除最后一个表
$ b 我不知道从哪里开始 [ps JSOUP不是选项] $ b
尝试像下面,但它没有产生任何结果:
< cfset objPattern = CreateObject ,java.util.regex.Pattern)。编译(JavaCast(string,(?i)< table [^>] *>([\w\W]表))+< / table>))>
< cfset objMatcher = objPattern.Matcher(JavaCast(string,cfhttp.FileContent))>
< cfoutput>#objMatcher#< / cfoutput>
至于说服客户,表达式是伟大的一些工作,他们真的不是解析html的最好的工具。 JSoup不是外部服务。它是为此任务特别设计的预建库(与正则表达式不同)。
JSoup使用非常简单,类似于使用javascript的DOM。只需将JSoup jar添加到类路径(如果需要,重新启动),它就可以使用。
我想要的代码位于body标记内,并且完全删除
的最后一个表标记。
将HTML内容加载到Document对象中并抓取< body>
元素:
jsoup = createObject(java,org.jsoup.Jsoup);
doc = jsoup.parse(yourHTMLContentString);
body = doc.body();
使用 selector 以抓取并删除最后 < table>
/ p>
elem = doc.select(table:last-of-type);
elem.remove();
就是这样。现在,您可以使用< body>
元素的html:
writeOutput(HTMLEditFormat(body.html()));
有关详细信息,请参阅其文档。特别是, JSoup Cookbook 有一些很好的例子。
I am making a cfhttp call and getting the data back..
Now I am getting a complete page like below:
<html><title>MyPage</title><head><link rel="stylesheet" href="style.css"></head>
<body>
<table></table>
<table></table>
<table></table>
<table></table>
<table></table>
<table></table>
</body>
</html>
Now the issue I want the code which which is inside the body tag, and also remove the last table tag completely.
I am not sure where to start [p.s JSOUP is not an option]
tried like below but it did not yielded any results:
<cfset objPattern = CreateObject("java","java.util.regex.Pattern").Compile(JavaCast("string","(?i)<table[^>]*>([\w\W](?!<table))+?</table>"))>
<cfset objMatcher = objPattern.Matcher(JavaCast( "string", cfhttp.FileContent ))>
<cfoutput>#objMatcher#</cfoutput>
As far as convincing the client, explain that while regular expressions are great for some jobs, they are really not the best tool for parsing html. JSoup is not an external service. It is a pre-built library designed specifically for this task (unlike regular expressions).
JSoup is very simple to use, and similar to working with javascript's DOM. Just add the JSoup jar to your class path (restart if needed) and it is ready to use.
I want the code which which is inside the body tag, and also remove the last table tag completely.
Load the html content into a Document object and grab the <body>
element:
jsoup = createObject("java", "org.jsoup.Jsoup");
doc = jsoup.parse( yourHTMLContentString );
body = doc.body();
Use a selector to grab and remove the last <table>
element:
elem = doc.select("table:last-of-type");
elem.remove();
That is it. Now you can print, or do whatever you want, with the <body>
element's html:
writeOutput( HTMLEditFormat(body.html()) );
See their documentation for more information. In particular, the JSoup Cookbook has some very good examples.
这篇关于需要从外部页面提取特定数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!