将格式化的电子邮件(HTML)转换为纯文本? [英] Convert formatted email (HTML) to plain Text?

查看:178
本文介绍了将格式化的电子邮件(HTML)转换为纯文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个代码实现 ParserCallback 并将 HTML 电子邮件转换为 code>文字。当我解析电子邮件正文时,这段代码工作正常=

I have this code that implements ParserCallback and converts HTML emails to Plain text. This code works fine when I parse email body like this =

  "DO NOT REPLY TO THIS EMAIL MESSAGE.   <br>---------------------------------------<br>\n" +
                "nix<br>---------------------------------------<br> Esfghjdfkj\n" +
                "</blockquote></div><br><br clear=\"all\"><div><br></div>-- <br><div dir=\"ltr\"><b>Regards <br>Nisj<br>Software Engineer<br></b><div><b>Bingo</b></div></div>\n" +
                "</div>"

但是当我解析这个电子邮件正文时,它返回null,

but when I parse this kinda email body, it returns null,

 email = "<html><head><meta http-equiv=\"Content-Type\" content=\"text/html charset=us-ascii\"></head><body style=\"word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;\">Got it...so pls send to customer now.<div><br><div style=\"\"><div>On Nov 8, 2013, at 12:31 PM, <a href=\"mailto:xxxxxxx.com\">xxxxxxx.com</a> wrote:</div><br class=\"Apple-interchange-newline\"><blockquote type=\"cite\">Forwarding test.<br>---------------------------------------<br> ABCD.</blockquote></div><br></div></body></html>";

代码:

import java.io.IOException;
import java.io.StringReader;

import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML.Attribute;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.html.HTMLEditorKit.Parser;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;
import javax.swing.text.html.parser.ParserDelegator;

public class EmailBody {
    public static void main(String[] args) throws IOException
    {
        String email = "";

        class EmailCallback extends ParserCallback
        {
            private String body_;
            private boolean divStarted_;

            public String getBody()
            {
                return body_;
            }

            @Override
            public void handleStartTag(Tag t, MutableAttributeSet a, int pos)
            {
                if (t.equals(Tag.DIV) && "ltr".equals(a.getAttribute(Attribute.DIR)))
                {
                    divStarted_ = true;
                }
            }

            @Override
            public void handleEndTag(Tag t, int pos)
            {
                if (t.equals(Tag.DIV))
                {
                    divStarted_ = false;
                }
            }

            @Override
            public void handleText(char[] data, int pos)
            {
                if (divStarted_)
                {
                    body_ = new String(data);
                }
            }
        }
        EmailCallback callback = new EmailCallback();
        Parser parser = new ParserDelegator();
        StringReader reader = new StringReader(email);
        parser.parse(reader, callback, true);
        reader.close();
        System.out.println(callback.getBody());
    }
}

你能说出原因吗正在发生?

推荐答案

你的代码只会从 DIV 元素,它们具有 dir 属性,其中 ltr 值。如果 divStarted _ 标志为true,那么 handleText 方法只会处理元素文本,只有当 handleStartTag 将此标志设置为true。

在第一个电子邮件示例中,您具有这样的元素,第二个不包含它们。

You code will only take the element text from DIV elements which have a dir attribute with an ltr value. The handleText method will only handle the element text if the divStarted_ flag is true, which happens only if the handleStartTag set this flag to true.
In the first email example you have such elements, in the second one you do not have them.

这篇关于将格式化的电子邮件(HTML)转换为纯文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆