如何使用PHP将HTML转换为JSON? [英] How to convert HTML to JSON using PHP?
问题描述
我可以使用,它将HTML解析为HTML 4.0,而在HTML 4.0中,< script>
的内容是类型 CDATA
,而不是<$ c $您有两种解决方案可以解决这个问题。
-
简单但不是非常可靠的解决方案是将
LIBXML_NOCDATA
标志添加到DOMDocument :: loadHTML
。(我实际上不是100%确定这是否适用于HTML在我看来,更好的解决方法是在测试$时添加一个额外的测试。
。递归函数将变为:
递归之前的子元素> nodeType
函数element_to_obj($ element){
echo $ element-> tagName,\\\
;
$ obj = array(tag=> $ element-> tagName);
foreach($ element->属性作为$属性){
$ obj [$ attribute-> name] = $ attribute-> value;
}
foreach($ element-> childNodes as $ subElement){
if($ subElement-> nodeType == XML_TEXT_NODE){
$ obj [html] = $ subElement-> wholeText;
}
elseif($ subElement-> nodeType == XML_CDATA_SECTION_NODE){
$ obj [html] = $ subElement-> data;
}
else {
$ obj [children] [] = element_to_obj($ subElement);
}
}
return $ obj;
}
如果碰到另一个这种类型的bug,你应该做的第一件事检查节点 $ subElement
的类型是,因为存在许多其他的可能性我的简短示例函数没有处理。
另外,您会注意到 libxml2
必须修正HTML中的错误,以便能够为其构建DOM。这就是为什么即使你没有指定< html>
和一个< head>
他们。您可以通过使用 LIBXML_HTML_NOIMPLIED
标志避免此情况。
脚本测试用例
$ html =<< < script type =文本/ JavaScript的>
alert('hi');
< / script>
EOF;
header(Content-Type:text / plain);
echo json_encode(html_to_obj($ html),JSON_PRETTY_PRINT);
输出
< pre-class =lang-javascript prettyprint-override> {
tag:html,
children:[
{
标签:head,
children:[
{
tag:script,
type:text\ / javascript,
html:\\\
alert('hi'); \\\
}
]
}
]
}
I can convert JSON to HTML using JsontoHtml library. Now,I need to convert present HTML to JSON as shown in this site. When looked into the code I found the following script:
<script>
$(function(){
//HTML to JSON
$('#btn-render-json').click(function() {
//Set html output
$('#html-output').html( $('#html-input').val() );
//Process to JSON and format it for consumption
$('#html-json').html( FormatJSON(toTransform($('#html-output').children())) );
});
});
//Convert obj or array to transform
function toTransform(obj) {
var json;
if( obj.length > 1 )
{
json = [];
for(var i = 0; i < obj.length; i++)
json[json.length++] = ObjToTransform(obj[i]);
} else
json = ObjToTransform(obj);
return(json);
}
//Convert obj to transform
function ObjToTransform(obj)
{
//Get the DOM element
var el = $(obj).get(0);
//Add the tag element
var json = {'tag':el.nodeName.toLowerCase()};
for (var attr, i=0, attrs=el.attributes, l=attrs.length; i<l; i++){
attr = attrs[i];
json[attr.nodeName] = attr.value;
}
var children = $(obj).children();
if( children.length > 0 ) json['children'] = [];
else json['html'] = $(obj).text();
//Add the children
for(var c = 0; c < children.length; c++)
json['children'][json['children'].length++] = toTransform(children[c]);
return(json);
}
//Format JSON (with indents)
function FormatJSON(oData, sIndent) {
if (arguments.length < 2) {
var sIndent = "";
}
var sIndentStyle = " ";
var sDataType = RealTypeOf(oData);
// open object
if (sDataType == "array") {
if (oData.length == 0) {
return "[]";
}
var sHTML = "[";
} else {
var iCount = 0;
$.each(oData, function() {
iCount++;
return;
});
if (iCount == 0) { // object is empty
return "{}";
}
var sHTML = "{";
}
// loop through items
var iCount = 0;
$.each(oData, function(sKey, vValue) {
if (iCount > 0) {
sHTML += ",";
}
if (sDataType == "array") {
sHTML += ("\n" + sIndent + sIndentStyle);
} else {
sHTML += ("\"" + sKey + "\"" + ":");
}
// display relevant data type
switch (RealTypeOf(vValue)) {
case "array":
case "object":
sHTML += FormatJSON(vValue, (sIndent + sIndentStyle));
break;
case "boolean":
case "number":
sHTML += vValue.toString();
break;
case "null":
sHTML += "null";
break;
case "string":
sHTML += ("\"" + vValue + "\"");
break;
default:
sHTML += ("TYPEOF: " + typeof(vValue));
}
// loop
iCount++;
});
// close object
if (sDataType == "array") {
sHTML += ("\n" + sIndent + "]");
} else {
sHTML += ("}");
}
// return
return sHTML;
}
//Get the type of the obj (can replace by jquery type)
function RealTypeOf(v) {
if (typeof(v) == "object") {
if (v === null) return "null";
if (v.constructor == (new Array).constructor) return "array";
if (v.constructor == (new Date).constructor) return "date";
if (v.constructor == (new RegExp).constructor) return "regex";
return "object";
}
return typeof(v);
}
</script>
Now, I am in need of using the following function in PHP. I can get the HTML data. All what I needed now is to convert the JavaScript function to PHP function. Is this possible? My major doubts are as follows:
The primary input for the Javascript function
toTransform()
is an object. Is it possible to convert HTML to object via PHP?Are all the functions present in this particular JavaScript available in PHP?
Please suggest me the idea.
When I tried to convert script tag to json as per the answer given, I get errors. When I tried it in json2html site, it showed like this: .. How to achieve the same solution?
If you are able to obtain a DOMDocument
object representing your HTML, then you just need to traverse it recursively and construct the data structure that you want.
Converting your HTML document into a DOMDocument
should be as simple as this:
function html_to_obj($html) {
$dom = new DOMDocument();
$dom->loadHTML($html);
return element_to_obj($dom->documentElement);
}
Then, a simple traversal of $dom->documentElement
which gives the kind of structure you described could look like this:
function element_to_obj($element) {
$obj = array( "tag" => $element->tagName );
foreach ($element->attributes as $attribute) {
$obj[$attribute->name] = $attribute->value;
}
foreach ($element->childNodes as $subElement) {
if ($subElement->nodeType == XML_TEXT_NODE) {
$obj["html"] = $subElement->wholeText;
}
else {
$obj["children"][] = element_to_obj($subElement);
}
}
return $obj;
}
Test case
$html = <<<EOF
<!DOCTYPE html>
<html lang="en">
<head>
<title> This is a test </title>
</head>
<body>
<h1> Is this working? </h1>
<ul>
<li> Yes </li>
<li> No </li>
</ul>
</body>
</html>
EOF;
header("Content-Type: text/plain");
echo json_encode(html_to_obj($html), JSON_PRETTY_PRINT);
Output
{
"tag": "html",
"lang": "en",
"children": [
{
"tag": "head",
"children": [
{
"tag": "title",
"html": " This is a test "
}
]
},
{
"tag": "body",
"html": " \n ",
"children": [
{
"tag": "h1",
"html": " Is this working? "
},
{
"tag": "ul",
"children": [
{
"tag": "li",
"html": " Yes "
},
{
"tag": "li",
"html": " No "
}
],
"html": "\n "
}
]
}
]
}
Answer to updated question
The solution proposed above does not work with the <script>
element, because it is parsed not as a DOMText
, but as a DOMCharacterData
object. This is because the DOM extension in PHP is based on libxml2
, which parses your HTML as HTML 4.0, and in HTML 4.0 the content of <script>
is of type CDATA
and not #PCDATA
.
You have two solutions for this problem.
The simple but not very robust solution would be to add the
LIBXML_NOCDATA
flag toDOMDocument::loadHTML
. (I am not actually 100% sure whether this works for the HTML parser.)The more difficult but, in my opinion, better solution, is to add an additonal test when you are testing
$subElement->nodeType
before the recursion. The recursive function would become:
function element_to_obj($element) {
echo $element->tagName, "\n";
$obj = array( "tag" => $element->tagName );
foreach ($element->attributes as $attribute) {
$obj[$attribute->name] = $attribute->value;
}
foreach ($element->childNodes as $subElement) {
if ($subElement->nodeType == XML_TEXT_NODE) {
$obj["html"] = $subElement->wholeText;
}
elseif ($subElement->nodeType == XML_CDATA_SECTION_NODE) {
$obj["html"] = $subElement->data;
}
else {
$obj["children"][] = element_to_obj($subElement);
}
}
return $obj;
}
If you hit on another bug of this type, the first thing you should do is check the type of node $subElement
is, because there exists many other possibilities my short example function did not deal with.
Additionally, you will notice that libxml2
has to fix mistakes in your HTML in order to be able to build a DOM for it. This is why an <html>
and a <head>
elements will appear even if you don't specify them. You can avoid this by using the LIBXML_HTML_NOIMPLIED
flag.
Test case with script
$html = <<<EOF
<script type="text/javascript">
alert('hi');
</script>
EOF;
header("Content-Type: text/plain");
echo json_encode(html_to_obj($html), JSON_PRETTY_PRINT);
Output
{
"tag": "html",
"children": [
{
"tag": "head",
"children": [
{
"tag": "script",
"type": "text\/javascript",
"html": "\n alert('hi');\n "
}
]
}
]
}
这篇关于如何使用PHP将HTML转换为JSON?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!