如何在Linux上的Word文档中获取页数? [英] How to get the number of pages in a Word Document on linux?

查看:705
本文介绍了如何在Linux上的Word文档中获取页数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我看到了这个问题 PHP-获取中的页数Word文档.我还需要根据给定的Word文件(doc/docx)确定页数.我试图调查phplivedocx/ZF(@hobodave与原始帖子答案中的链接),但是我在那里失去了手脚.我也不能使用任何外部Web服务(例如DOC2PDF站点,然后计算PDF版本中的页面,等等.).

I saw this question PHP - Get number of pages in a Word document . I also need to determine the pages count from given word file (doc/docx). I tried to investigate phplivedocx/ZF (@hobodave linked to those in the original post answers), but I lost my hands and legs there. I can't use any outer web service either (like DOC2PDF sites, and then count the pages in the PDF version, or so...).

简而言之:是否有任何php代码(使用ZF或PHP中的其他任何内容,不包括COM对象或其他执行文件,例如'AbiWord');我正在使用共享Linux服务器,而没有exec或类似功能),以查找word文件的页数?

Simply: Is there any php code (using ZF or anything else in PHP, excluding COM object or other execution-files, such 'AbiWord'; I'm using shared Linux server, without exec or similar function), to find the pages count of word file?

编辑:即将受支持的单词版本为Microsoft-Word 2003& 2007年.

The word versions that about to be supported are Microsoft-Word 2003 & 2007.

推荐答案

获取docx文件的页数非常简单:

Getting the number of pages for docx files is very easy:

function get_num_pages_docx($filename)
{
    $zip = new ZipArchive();

    if($zip->open($filename) === true)
    {  
        if(($index = $zip->locateName('docProps/app.xml')) !== false)
        {
            $data = $zip->getFromIndex($index);
            $zip->close();

            $xml = new SimpleXMLElement($data);
            return $xml->Pages;
        }

        $zip->close();
    }

    return false;
}

对于97-2003格式,这无疑是具有挑战性的,但绝不是不可能的.页数存储在文档的SummaryInformation部分中,但是由于文件的OLE格式使得查找起来很麻烦. 此处和更简单的此处.我今天看了一个小时,但距离还很远! (不是我习惯的抽象级别),但输出十六进制以更好地理解结构:

For 97-2003 format it's certainly challenging, but by no means impossible. The number of pages is stored in the SummaryInformation section of the document, but due to the OLE format of the files that makes it a pain to find. The structure is defined extremely thoroughly (though badly imo) here and simpler here. I looked at this for an hour today, but didn't get very far! (not a level of abstraction I'm used to), but output the hex to better understand the structure:

function get_num_pages_doc($filename) 
{
    $handle = fopen($filename, 'r');
    $line = @fread($handle, filesize($filename));

    echo '<div style="font-family: courier new;">';

        $hex = bin2hex($line);
        $hex_array = str_split($hex, 4);
        $i = 0;
        $line = 0;
        $collection = '';
        foreach($hex_array as $key => $string)
        {
            $collection .= hex_ascii($string);
            $i++;

            if($i == 1)
            {
                echo '<b>'.sprintf('%05X', $line).'0:</b> ';
            }

            echo strtoupper($string).' ';

            if($i == 8)
            {
                echo ' '.$collection.' <br />'."\n";
                $collection = '';
                $i = 0;

                $line += 1;
            }
        }

    echo '</div>';

    exit();
}

function hex_ascii($string, $html_safe = true)
{
    $return = '';

    $conv = array($string);
    if(strlen($string) > 2)
    {
        $conv = str_split($string, 2);
    }

    foreach($conv as $string)
    {
        $num = hexdec($string);

        $ascii = '.';
        if($num > 32)
        {   
            $ascii = unichr($num);
        }

        if($html_safe AND ($num == 62 OR $num == 60))
        {
            $return .= htmlentities($ascii);
        }
        else
        {
            $return .= $ascii;
        }
    }

    return $return;
}

function unichr($intval)
{
    return mb_convert_encoding(pack('n', $intval), 'UTF-8', 'UTF-16BE');
}

将在其中放置代码的地方放出代码,例如:

which will out put code where you can find the sections such as:

007000: 0500 5300 7500 6D00 6D00 6100 7200 7900 ..S.u.m.m.a.r.y.
007010: 4900 6E00 6600 6F00 7200 6D00 6100 7400 I.n.f.o.r.m.a.t.
007020: 6900 6F00 6E00 0000 0000 0000 0000 0000 i.o.n...........
007030: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 

这将允许您查看引用信息,例如:

Which will allow you to see the referencing info such as:

007040: 2800 0201 FFFF FFFF FFFF FFFF FFFF FFFF (...ÿÿÿÿÿÿÿÿÿÿÿÿ
007050: 0000 0000 0000 0000 0000 0000 0000 0000 ................
007060: 0000 0000 0000 0000 0000 0000 0000 0000 ................
007070: 0000 0000 2500 0000 0010 0000 0000 0000 ....%...........

这将允许您确定所描述的属性:

Which will allow you to determine properties described:

_ab = ("SummaryInformation") 
_cb = 0028
_mse = 02 (STGTY_STREAM) 
_bflags = 01 (DE_BLACK) 
_sidLeftSib = FFFF FFFF 
_sidRightSib = FFFF FFFF (none) 
_sidChild = FFFF FFFF (n/a for STGTY_STREAM) 
_clsid = 0000 0000 0000 0000 0000 0000 0000 0000 (n/a) 
_dwUserFlags = 0000 0000 (n/a) 
_time[0] = CreateTime = 0000 0000 0000 0000 (n/a) 
_time[1] = ModifyTime = 0000 0000 0000 0000 (n/a)
_startSect = 0000 0000 
_ulSize = 0000 1000 
_dptPropType = 0000 (n/a)

这将使您找到代码的相关部分,将其解压缩并获取页码.当然,这是我没时间去做的难事,但应该为您设定正确的方向.

Which will let you find the relevant section of code, unpack it and get the page number. Of course this is the hard bit that I just don't have time for, but should set you in the right direction.

M $并不容易!

这篇关于如何在Linux上的Word文档中获取页数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆