Java 中的 Trim() 没有按我期望的方式工作? [英] Trim() in Java not working the way I expect?
问题描述
可能的重复:
查询Java中的trim()方法
我正在解析一个站点的用户名和其他信息,每一个后面都有一堆空格(但单词之间有空格).例如:Bob the Builder"或Sam thewelder".空格的数量因名称而异.我想我只用 .trim(),因为我以前用过它.但是,这给我带来了麻烦.我的代码如下所示:
for (int i = 0; i
结果是一样的;最后没有删除空格.预先感谢您的出色回答!
更新:
完整的代码有点复杂,因为首先解析出 HTML 标签.它完全是这样的:
for (String s : splitSource2) {if (s.length() > "<td class=\"dddefault\">".length() && s.substring(0, "<td class=\"dddefault\">".length()).equals("<td class=\"dddefault\">")) {splitSource3.add(s.substring("<td class=\"dddefault\">".length()));}}System.out.println("\n");for (int i = 0; i < splitSource3.size(); i++) {splitSource3.set(i, splitSource3.get(i).substring(0, splitSource3.get(i).length() - 5));splitSource3.set(i, splitSource3.get(i).trim());System.out.println(i + ": " + splitSource3.get(i));}}
更新:
冷静.我从来没有说错误在于 Java,我从来没有说它是一个错误或损坏或任何东西.我只是说我遇到了问题,并发布了我的代码供您协作并帮助解决我的问题.请注意短语我的问题"而不是java 的问题".我实际上已经把代码打印出来了
System.out.println(i + ": " + splitSource3.get(i) + "*");
在之后的每个循环中.
这就是我知道我遇到问题的方式.顺便说一下,问题还没有解决.
更新:
示例输出(减去单引号):
'0: Olin D. Kirkland ''1:大二 ''2:某地,弗吉尼亚州 12345<br/>VA SomeCity<br/>''3:本科 '
EDIT OP 在 Query 上重新表述了他的问题关于 Java 中的 trim() 方法,发现问题在于 String.trim()
不匹配的 Unicode 空白字符.
我突然想到,在我从事屏幕抓取项目时,曾经遇到过此类问题.关键是有时下载的 HTML 源包含不可打印的字符,这些字符也是非空白字符.这些很难复制粘贴到浏览器.我想这可能发生在你身上.
如果我的假设是正确的,那么您有两个选择:
使用二进制阅读器找出这些字符是什么——然后用 String.replace() 删除它们;例如:
<前>私有静态无效 cutCharacters(String fromHtml) {字符串结果 = fromHtml;char[] 有问题的Characters = {'\000', '\001', '\003'};//这也可能是一个私有的静态最终常量for (char ch : 问题字符) {result = result.replace(ch, "");//我知道,修改输入参数很脏.但它会作为一个例子}返回结果;}如果您在要解析的 HTML 中发现某种重复出现的模式,那么您可以使用正则表达式和子字符串来剪切不需要的部分.例如:
<前>私人字符串 getImportantParts(String fromHtml) {模式 p = Pattern.compile("(\\w*\\s*)");//这也可能是一个私有的静态最终常量.匹配器 m = p.matcher(fromHtml);StringBuilder buff = new StringBuilder();而(m.find()){buff.append(m.group(1));}返回 buff.toString().trim();}
Possible Duplicate:
Query about the trim() method in Java
I am parsing a site's usernames and other information, and each one has a bunch of spaces after it (but spaces in between the words). For example: "Bob the Builder " or "Sam the welder ". The numbers of spaces vary from name to name. I figured I'd just use .trim(), since I've used this before. However, it's giving me trouble. My code looks like this:
for (int i = 0; i < splitSource3.size(); i++) {
splitSource3.set(i, splitSource3.get(i).trim());
}
The result is just the same; no spaces are removed at the end. Thank you in advance for your excellent answers!
UPDATE:
The full code is a bit more complicated, since there are HTML tags that are parsed out first. It goes exactly like this:
for (String s : splitSource2) {
if (s.length() > "<td class=\"dddefault\">".length() && s.substring(0, "<td class=\"dddefault\">".length()).equals("<td class=\"dddefault\">")) {
splitSource3.add(s.substring("<td class=\"dddefault\">".length()));
}
}
System.out.println("\n");
for (int i = 0; i < splitSource3.size(); i++) {
splitSource3.set(i, splitSource3.get(i).substring(0, splitSource3.get(i).length() - 5));
splitSource3.set(i, splitSource3.get(i).trim());
System.out.println(i + ": " + splitSource3.get(i));
}
}
UPDATE:
Calm down. I never said the fault lay with Java, and I never said it was a bug or broken or anything. I simply said I was having trouble with it and posted my code for you to collaborate on and help solve my issue. Note the phrase "my issue" and not "java's issue". I have actually had the code printing out
System.out.println(i + ": " + splitSource3.get(i) + "*");
in a for each loop afterward.
This is how I knew I had a problem. By the way, the problem has still not been fixed.
UPDATE:
Sample output (minus single quotes):
'0: Olin D. Kirkland '
'1: Sophomore '
'2: Someplace, Virginia 12345<br />VA SomeCity<br />'
'3: Undergraduate '
EDIT the OP rephrased his question at Query about the trim() method in Java, where the issue was found to be Unicode whitespace characters which are not matched by String.trim()
.
It just occurred to me that I used to have this sort of issue when I worked on a screen-scraping project. The key is that sometimes the downloaded HTML sources contain non-printable characters which are non-whitespace characters too. These are very difficult to copy-paste to a browser. I assume that this could happened to you.
If my assumption is correct then you've got two choices:
Use a binary reader and figure out what those characters are - and delete them with String.replace(); E.g.:
private static void cutCharacters(String fromHtml) { String result = fromHtml; char[] problematicCharacters = {'\000', '\001', '\003'}; //this could be a private static final constant too for (char ch : problematicCharacters) { result = result.replace(ch, ""); //I know, it's dirty to modify an input parameter. But it will do as an example } return result; }
If you find some sort of reoccurring pattern in the HTML to be parsed then you can use regexes and substrings to cut the unwanted parts. E.g.:
private String getImportantParts(String fromHtml) { Pattern p = Pattern.compile("(\\w*\\s*)"); //this could be a private static final constant as well. Matcher m = p.matcher(fromHtml); StringBuilder buff = new StringBuilder(); while (m.find()) { buff.append(m.group(1)); } return buff.toString().trim(); }
这篇关于Java 中的 Trim() 没有按我期望的方式工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!