使用JSoup修改内存中的HTML [英] Modifying HTML in Memory with JSoup

查看：157 发布时间：2018/6/24 18:56:23 java html io

本文介绍了使用JSoup修改内存中的HTML的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

最近我被推荐使用JSoup来解析和修改HTML文档。然而，如果我有一个我想要修改的HTML文档（发送，存储在其他地方等），我该怎么做，而不必改变它原始文件？

假设我有这样的HTML文件：

 < HTML> 
< head>< / head> 
< body> 
< p>< / p> 
< h2>标题：标题< / h2> 
< p>< / p> 
< p>名称：< / p> 
< p>地址：< / p> 
< p>电话号码：< / p> 
< / body> 
< / html>

我想填写姓名，地址，电话号码和其他任何信息的相应数据如果不修改原始HTML文件，我该如何解决使用JSoup的问题？

解决方案

@MarcoS有一个很好的解决方案，使用NodeTraversor在https://stackoverflow.com/a/6594828/1861357 ，我只是稍微修改了他的方法，它将一个节点（一组标记）替换为在节点中的数据加上你想添加的任何信息。

为了在内存中存储字符串，我使用了一个静态的 StringBuilder 将HTML保存在内存中。

首先，我们读取HTML文件（手动指定，可以更改），然后我们进行一系列检查以更改包含任何数据的任何节点想要。

MarcoS在解决方案中没有解决的一个问题是，它会分割每个单词，而不是查看一行。然而，我只是对多个单词使用' - '，因为否则它会将该字符串直接放在该单词之后。

所以完整的实现：

  import java.util。*; 
 import org.jsoup.Jsoup; 
 import org.jsoup.nodes。*; 
 import org.jsoup.select。*; 
 import java.io. *; 
 
 public class memoryHTML 
 {
 static String htmlLocation =C：\\Users\\User\\; 
 static String fileName =blah; //只是为了演示，很容易修改。 
 static StringBuilder buildTmpHTML = new StringBuilder（）; 
 static StringBuilder buildHTML = new StringBuilder（）; 
 static String name =John Doe; 
 static String address =42大学Dr.，Somewhere，Someplace; 
 static String phoneNumber =（123）456-7890; 
 
 public static void main（String [] args）
 {
 //您可以使用文件名将完整路径发送给它。我将它们分开，因为我用它来处理多个文件。 
 readHTML（htmlLocation，fileName）; 
 modifyHTML（）; 
 
 System.out.println（buildHTML.toString（））; 
 
 //您需要清除StringBuilder对象，否则它将保留在内存中并在每次运行时生成。 
 buildTmpHTML.setLength（0）; 
 buildHTML.setLength（0）; 
 
 System.exit（0）; 
 
 
 //简单地为一个临时HTML文件解析和构建一个StringBuilder，该文件将在modifyHTML（）中修改
 public static void readHTML（String directory，String fileName）
 {
 try 
 {
 BufferedReader br = new BufferedReader（new FileReader（directory + fileName +.html））; 
 
字符串行; （（line = br.readLine（））！= null）
 
 buildTmpHTML.append（line）; 
} 
 br.close（）; 
 
 catch（Exception e）
 {
 e.printStackTrace（）; 
 System.exit（1）; 
} 
} 
 
 //通过@MarcoS在HTML文件中解析和修改节点的最佳方法是https://stackoverflow.com/a/6594828/1861357 
 //它有一些小问题，但它有诀窍。 
 public static void modifyHTML（）
 {
 String htmld = buildTmpHTML.toString（）; 
 Document doc = Jsoup.parse（htmld）; 
 
最终列表< TextNode> nodesToChange = new ArrayList< TextNode>（）; 
 
 NodeTraversor nd = new NodeTraversor（new NodeVisitor（）
 {
 @Override 
 public void tail（Node node，int depth）
 {
 if（node instanceof TextNode）
 {
 TextNode textNode =（TextNode）node; 
 nodesToChange.add（textNode）; 
} 
} 
 
 @Override 
 public void head（Node node，int depth）
 {
} 
}）; 
 
 nd.traverse（doc.body（））; （TextNode textNode：nodesToChange）
 
 
 {
 Node newNode = buildElementForText（textNode）; 
 textNode.replaceWith（newNode）; 
} 
 
 buildHTML.append（doc.html（））; 
 
 
 private static Node buildElementForText（TextNode textNode）
 {
 String text = textNode.getWholeText（）; 
 String [] words = text.trim（）。split（）; 
 Set< String> units = new HashSet< String>（）; 
 for（String word：words）
 units.add（word）; 
 
 String newText = text; 
 for（String rpl：units）
 {
 if（rpl.contains（Name））
 newText = newText.replaceAll（rpl，+ rpl + + name :)）; 
 if（rpl.contains（Address）|| rpl.contains（Residence））
 newText = newText.replaceAll（rpl，+ rpl ++ address）; 
 if（rpl.contains（Phone-Number）|| rpl.contains（PhoneNumber））
 newText = newText.replaceAll（rpl，+ rpl ++ phoneNumber）; 
} 
返回新的DataNode（newText，textNode.baseUri（））; 
}

然后你会得到这个HTML（记得我改了Phone Number到电话号码）：

 < html> 
< head>< / head> 
< body> 
< p>< / p> 
< h2>标题：标题< / h2> 
< p>< / p> 
< p>名称：John Doe< / p> 
< p>地址：42大学Dr.，Somewhere，Someplace< / p> 
< p>电话号码：（123）456-7890< / p> 
< / body> 
< / html>

Recently I was recommended to use JSoup to parse and modify HTML documents.

However what if I have a HTML document that I want to modify (to send, store somewhere else, etc.), how might I go about doing that without changing the original document?

Say I have an HTML file like so:
<html> <head></head> <body> <h2>Title: title</h2> Name: Address: Phone Number: </body> </html>
And I want to fill in the appropriate data for Name, Address, Phone Number and any other information I'd like, without modifying the original HTML file, how might I go about that using JSoup?
解决方案
@MarcoS had an excellent solution using a NodeTraversor to make a list of nodes to change at https://stackoverflow.com/a/6594828/1861357 and I only very slightly modified his method which replaces a node (a set of tags) with the data in the node plus whatever information you would like to add.

To store a String in memory I used a static StringBuilder to save the HTML in memory.

First we read in the HTML file (that is manually specified, this can be changed), then we make a series of checks to change whatever nodes with any data that we want.

The one problem that I didn't fix in the solution by MarcoS was that it split each individual word, instead of looking at a line. However I just used '-' for multiple words, because otherwise it places the string directly after that word.

So a full implementation:
import java.util.*; import org.jsoup.Jsoup; import org.jsoup.nodes.*; import org.jsoup.select.*; import java.io.*; public class memoryHTML { static String htmlLocation = "C:\\Users\\User\\"; static String fileName = "blah"; // Just for demonstration, easily modified. static StringBuilder buildTmpHTML = new StringBuilder(); static StringBuilder buildHTML = new StringBuilder(); static String name = "John Doe"; static String address = "42 University Dr., Somewhere, Someplace"; static String phoneNumber = "(123) 456-7890"; public static void main(String[] args) { // You can send it the full path with the filename. I split them up because I used this for multiple files. readHTML(htmlLocation, fileName); modifyHTML(); System.out.println(buildHTML.toString()); // You need to clear the StringBuilder Object or it will remain in memory and build on each run. buildTmpHTML.setLength(0); buildHTML.setLength(0); System.exit(0); } // Simply parse and build a StringBuilder for a temporary HTML file that will be modified in modifyHTML() public static void readHTML(String directory, String fileName) { try { BufferedReader br = new BufferedReader(new FileReader(directory + fileName + ".html")); String line; while((line = br.readLine()) != null) { buildTmpHTML.append(line); } br.close(); } catch (Exception e) { e.printStackTrace(); System.exit(1); } } // Excellent method of parsing and modifying nodes in HTML files by @MarcoS at https://stackoverflow.com/a/6594828/1861357 // It has its small problems, but it does the trick. public static void modifyHTML() { String htmld = buildTmpHTML.toString(); Document doc = Jsoup.parse(htmld); final List<TextNode> nodesToChange = new ArrayList<TextNode>(); NodeTraversor nd = new NodeTraversor(new NodeVisitor() { @Override public void tail(Node node, int depth) { if (node instanceof TextNode) { TextNode textNode = (TextNode) node; nodesToChange.add(textNode); } } @Override public void head(Node node, int depth) { } }); nd.traverse(doc.body()); for (TextNode textNode : nodesToChange) { Node newNode = buildElementForText(textNode); textNode.replaceWith(newNode); } buildHTML.append(doc.html()); } private static Node buildElementForText(TextNode textNode) { String text = textNode.getWholeText(); String[] words = text.trim().split(" "); Set<String> units = new HashSet<String>(); for (String word : words) units.add(word); String newText = text; for (String rpl : units) { if(rpl.contains("Name")) newText = newText.replaceAll(rpl, "" + rpl + " " + name:)); if(rpl.contains("Address") || rpl.contains("Residence")) newText = newText.replaceAll(rpl, "" + rpl + " " + address); if(rpl.contains("Phone-Number") || rpl.contains("PhoneNumber")) newText = newText.replaceAll(rpl, "" + rpl + " " + phoneNumber); } return new DataNode(newText, textNode.baseUri()); }
And you'll get this HTML back (remember I changed "Phone Number" to "Phone-Number"):
<html> <head></head> <body> <h2>Title: title</h2> Name: John Doe Address: 42 University Dr., Somewhere, Someplace Phone-Number: (123) 456-7890 </body> </html>

这篇关于使用JSoup修改内存中的HTML的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用JSoup修改内存中的HTML [英] Modifying HTML in Memory with JSoup

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

使用JSoup修改内存中的HTML [英] Modifying HTML in Memory with JSoup

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭