使用PHP在docx文件中查找换行符 [英] Find linebreaks in a docx file using PHP
问题描述
我的PHP脚本成功地从.docx文件中读取了所有文本,但是我无法弄清楚换行符应该在哪里,这样会使文本堆积并且难以阅读(一个大段).我已经手动检查了所有XML文件以尝试找出它,但我无法弄清楚.
My PHP script successfully reads all text from a .docx file, but I cannot figure out where the line breaks should be so it makes the text bunched up and hard to read (one huge paragraph). I have manually gone over all of the XML files to try and figure it out but I cannot figure it out.
这是我用来检索文件数据并返回纯文本的功能.
Here are the functions I use to retrieve the file data and return the plain text.
public function read($FilePath)
{
// Save name of the file
parent::SetDocName($FilePath);
$Data = $this->docx2text($FilePath);
$Data = str_replace("<", "<", $Data);
$Data = str_replace(">", ">", $Data);
$Breaks = array("\r\n", "\n", "\r");
$Data = str_replace($Breaks, '<br />', $Data);
$this->Content = $Data;
}
function docx2text($filename) {
return $this->readZippedXML($filename, "word/document.xml");
}
function readZippedXML($archiveFile, $dataFile)
{
// Create new ZIP archive
$zip = new ZipArchive;
// Open received archive file
if (true === $zip->open($archiveFile))
{
// If done, search for the data file in the archive
if (($index = $zip->locateName($dataFile)) !== false)
{
// If found, read it to the string
$data = $zip->getFromIndex($index);
// Close archive file
$zip->close();
// Load XML from a string
// Skip errors and warnings
$xml = DOMDocument::loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
$xmldata = $xml->saveXML();
//$xmldata = str_replace("</w:t>", "\r\n", $xmldata);
// Return data without XML formatting tags
return strip_tags($xmldata);
}
$zip->close();
}
// In case of failure return empty string
return "";
}
推荐答案
这实际上是一个非常简单的答案.您需要做的就是将这行添加到readZippedXML()
:
It is actually quite a simple answer. All you need to do is add this line in readZippedXML()
:
$xmldata = str_replace("</w:p>", "\r\n", $xmldata);
这是因为</w:p>是单词用来标记段落结尾的内容.例如
This is because </w:p> is what word uses to mark the end of a paragraph. E.g.
<w:p>This is a paragraph.</w:p>
<w:p>And a second one.</w:p>
这篇关于使用PHP在docx文件中查找换行符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!