将存储在内存中的字符串传递给pdftotext,antiword,catdoc等 [英] Passing string stored in memory to pdftotext, antiword, catdoc, etc

查看:38
本文介绍了将存储在内存中的字符串传递给pdftotext,antiword,catdoc等的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以调用CLI工具(如pdftotext,antiword,catdoc(文本提取器脚本))传递字符串而不是文件?

Is it possible to call CLI tools like pdftotext, antiword, catdoc (text extractor scripts) passing a string instead of a file?

目前,我阅读的 child_process.spawn 调用了pdftotext的PDF文件.我产生一个新进程并将结果存储在一个新变量中.一切正常.

Currently, I read PDF files calling pdftotext with child_process.spawn. I spawn a new process and store the result in a new variable. Everything works fine.

我想从 fs.readFile 传递 binary 而不是文件本身:

I’d like to pass the binary from a fs.readFile instead of the file itself:

fs.readFile('./my.pdf', (error, binary) => {
    // Call pdftotext with child_process.spawn passing the binary.
    let event = child_process.spawn('pdftotext', [
        // Args here!
    ]);
});

我该怎么做?

推荐答案

如果该命令可以处理管道输入,则绝对有可能.

It's definitely possible, if the command can handle piped input.

spawn 返回转换

spawn returns a ChildProcess object, you can pass the string (or binary) in memory to it by write to its stdin. The string should be converted to a ReadableStream first, then you can write the string to stdin of the CLI by pipe.

createReadStream 创建一个下面的示例下载pdf文件,并将内容通过管道传递到 pdftotext ,然后显示结果的前几个字节.

The following example download a pdf file and pipe the content to pdftotext, then show first few bytes of the result.

const source = 'http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf'
const http = require('http')
const spawn = require('child_process').spawn

download(source).then(pdftotext)
.then(result => console.log(result.slice(0, 77)))

function download(url) {
  return new Promise(resolve => http.get(url, resolve))
}

function pdftotext(binaryStream) {
  //read input from stdin and write to stdout
  const command = spawn('pdftotext', ['-', '-'])
  binaryStream.pipe(command.stdin)

  return new Promise(resolve => {
    const result = []
    command.stdout.on('data', chunk => result.push(chunk.toString()))
    command.stdout.on('end', () => resolve(result.join('')))
  })
}

由于CLI不能选择从 stdin 中读取,您可以使用命名管道.

For CLIs have no option to read from stdin, you can use named pipes.

添加另一个带有命名管道的示例.

Add another example with named pipes.

一旦创建了命名管道,就可以像使用文件一样使用它们.下面的示例创建临时的命名管道以发送输入和获取输出,并显示结果的前几个字节.

Once the named pipes are created, you can use them like files. The following example creates temporary named pipes to send input and get output, and show first few bytes of the result.

const fs = require('fs')
const spawn = require('child_process').spawn

pipeCommand({
  name: 'wvText',
  input: fs.createReadStream('document.doc'),
}).then(result => console.log(result.slice(0, 77)))

function createPipe(name) {
  return new Promise(resolve =>
    spawn('mkfifo', [name]).on('exit', () => resolve()))
}

function pipeCommand({name, input}) {
  const inpipe = 'input.pipe'
  const outpipe = 'output.pipe'
  return Promise.all([inpipe, outpipe].map(createPipe)).then(() => {
    const result = []
    fs.createReadStream(outpipe)
    .on('data', chunk => result.push(chunk.toString()))
    .on('error', console.log)

    const command = spawn(name, [inpipe, outpipe]).on('error', console.log)
    input.pipe(fs.createWriteStream(inpipe).on('error', console.log))
    return new Promise(resolve =>
      command.on('exit', () => {
        [inpipe, outpipe].forEach(name => fs.unlink(name))
        resolve(result.join(''))
      }))
  })
}

这篇关于将存储在内存中的字符串传递给pdftotext,antiword,catdoc等的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆