使用PHP在docx文件中查找换行符 [英] Find linebreaks in a docx file using PHP

查看:78
本文介绍了使用PHP在docx文件中查找换行符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的PHP脚本成功地从.docx文件中读取了所有文本,但是我无法弄清楚换行符应该在哪里,这样会使文本堆积并且难以阅读(一个大段).我已经手动检查了所有XML文件以尝试找出它,但我无法弄清楚.

My PHP script successfully reads all text from a .docx file, but I cannot figure out where the line breaks should be so it makes the text bunched up and hard to read (one huge paragraph). I have manually gone over all of the XML files to try and figure it out but I cannot figure it out.

这是我用来检索文件数据并返回纯文本的功能.

Here are the functions I use to retrieve the file data and return the plain text.

    public function read($FilePath)
{
    // Save name of the file
    parent::SetDocName($FilePath);

    $Data = $this->docx2text($FilePath);

    $Data = str_replace("<", "&lt;", $Data);
    $Data = str_replace(">", "&gt;", $Data);

    $Breaks = array("\r\n", "\n", "\r");
    $Data = str_replace($Breaks, '<br />', $Data);

    $this->Content = $Data;
}

function docx2text($filename) {
    return $this->readZippedXML($filename, "word/document.xml");
}

function readZippedXML($archiveFile, $dataFile)
{
    // Create new ZIP archive
    $zip = new ZipArchive;

    // Open received archive file
    if (true === $zip->open($archiveFile))
    {
        // If done, search for the data file in the archive
        if (($index = $zip->locateName($dataFile)) !== false)
        {
            // If found, read it to the string
            $data = $zip->getFromIndex($index);

            // Close archive file
            $zip->close();

            // Load XML from a string
            // Skip errors and warnings
            $xml = DOMDocument::loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);

            $xmldata = $xml->saveXML();
            //$xmldata = str_replace("</w:t>", "\r\n", $xmldata);
            // Return data without XML formatting tags
            return strip_tags($xmldata);
        }

        $zip->close();
    }

    // In case of failure return empty string
    return "";
} 

推荐答案

这实际上是一个非常简单的答案.您需要做的就是将这行添加到readZippedXML():

It is actually quite a simple answer. All you need to do is add this line in readZippedXML():

$xmldata = str_replace("</w:p>", "\r\n", $xmldata);

这是因为</w:p>是单词用来标记段落结尾的内容.例如

This is because </w:p> is what word uses to mark the end of a paragraph. E.g.

<w:p>This is a paragraph.</w:p>
<w:p>And a second one.</w:p>

这篇关于使用PHP在docx文件中查找换行符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆