如何使用php从pdf中提取特定的文本 [英] How to extract particular text from pdf using php

查看:319
本文介绍了如何使用php从pdf中提取特定的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要在mysql表中存储候选人的名字和他的id,我已经使用pdfparser提取了文本

 < ?php 

//包括Composer自动加载器(如果尚未完成)。
包含'vendor \autoload.php';

//解析pdf文件并构建必要的对象。
$ parser = new \Smalot\PdfParser\Parser();
$ pdf = $ parser-> parseFile('C:\Desktop\Data\ApplicationForm.pdf');

$ text = $ pdf-> getText();
echo $ text;

?>

现在它只显示提取的文本,现在我需要从页面中提取名称和id当我们运行上述程序时出现的页面)填充了提取的文本,点击查看页面源代码我找到了我需要的ID



出现在:
tr 1115 * 15 td.line-number 31 * 15和td.line-content:1084 * 15,行号值= 12

名称存在于: -

tr 1115 * 15 td.line-number 31 * 15和td.line-content:1084 * 15,line数值= 13



我现在迷失了,因为我不知道如何得到这些信息。请帮助我。



我有多个pdf,我需要的所有信息都在同一个地方(在同一地点,我的意思是在线数值= 13,tr 1115 * 15 td.line-number 31 * 15和td.line-content :1084 * 15)我只想找个办法解决这个问题,帮帮我。

如果您有任何疑问我会澄清,如果这个问题似乎不清楚我会改进它的。

解决方案

我需要从pdf中提取候选人名称和他的id,所以在使用pdfparser之后,我提取了文本并使用下载的html页面php

 <?php 
$ filename ='filename.txt';
header('Content-disposition:attachment; filename ='。$ filename);
header('Content-type:text');
// ...文件的其余部分
?>
<?php

//包含Composer自动加载器(如果尚未完成)。
包括'C:\ Users \Downloads\pdfparser-master(1)\pdfparser-master\vendor\autoload.php';

//解析pdf文件并构建必要的对象。
$ parser = new \Smalot\PdfParser\Parser();
$ pdf = $ parser-> parseFile('C:\Users\Desktop\Data\ApplicationForm(3).pdf');

$ text = $ pdf-> getText();
echo $ text;


?>

我这样做是因为我需要的信息位于视图源页面的第12行和第13行,这是我所需要的所有pdf文件,所以在下载HTML文件格式的HTML页面后,我使用下面的代码从下载的文件中提取需要的文本并将其存储在数据库中。

 <?php 

$ source = file(filename.txt);

$ number = $ source [12];
$ name = $ source [13];
$ gslink =https://www.google.co.in/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=google+scholar+\".$名称;
$ dblplink =https://www.google.co.in/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=dblp+\".$name;
$ servername =127.0.0.1;
$ username =root;
$ password =;
$ dbname =mydb;
//创建连接
$ conn = new mysqli($ servername,$ username,$ password,$ dbname);
//检查连接
if($ conn-> connect_error){
die(Connection failed:。$ conn-> connect_error);

$ sql =INSERT INTO faculty(candidate_no,candidate_name,gs_link,dblp_link)VALUES('$ number','$ name','$ gslink','$ dblplink');
if($ conn-> query($ sql)=== TRUE){
echoNew record created successfully;
} else {
echoError:。 $ sql。 <峰; br> 中。 $ conn->误差;
}

$ conn-> close();
?>


I need to store name of candidate and his id in mysql table , I have extracted the text using pdfparser

<?php

// Include Composer autoloader if not already done.
include 'vendor\autoload.php';

// Parse pdf file and build necessary objects.
$parser = new  \Smalot\PdfParser\Parser();
$pdf    = $parser->parseFile('C:\Desktop\Data\ApplicationForm.pdf');

$text = $pdf->getText();
echo $text;

?>

right now its just showing the extracted text ,now I need to extract name and id from the page(the page which appears when we run the above program) which is filled with extracted text ,on clicking view page source I found the id I need

appears on:-

tr 1115*15 td.line-number 31*15 and td.line-content:1084*15, line number value = 12

name exists on :-

tr 1115*15 td.line-number 31*15 and td.line-content:1084*15, line number value = 13

I am lost at this point as I don't know how to get this info .Please help me .

I have multiple pdf's and all info I need is at same place (by same place I mean on line number value =13,tr 1115*15 td.line-number 31*15 and td.line-content:1084*15, )I just want to find a way to solve this problem , help me .

if you have any doubts I will clarify , if the question seems unclear I will improve it.

解决方案

I need to extract name of candidate and his id from a pdf ,so after using pdfparser I extracted the text and downloaded the html page using php

<?php
$filename = 'filename.txt';
header('Content-disposition: attachment; filename=' . $filename);
header('Content-type: text');
// ... the rest of your file
?>
<?php

// Include Composer autoloader if not already done.
include 'C:\Users\Downloads\pdfparser-master (1)\pdfparser-master\vendor\autoload.php';

// Parse pdf file and build necessary objects.
$parser = new  \Smalot\PdfParser\Parser();
$pdf    = $parser->parseFile('C:\Users\Desktop\Data\ApplicationForm (3).pdf');

$text = $pdf->getText();
echo $text;


?>

I did this cause the info I need that was on line 12 and 13 of the view source page and this was was with all the pdf's I need ,so after downloading the html page in form of text file, I used the code below to extract text I needed from the downloaded file and store it in database

<?php

$source = file("filename.txt");

$number =$source[12];
$name = $source[13];
$gslink = "https://www.google.co.in/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=google+scholar+".$name;        
$dblplink = "https://www.google.co.in/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=dblp+".$name ;
$servername = "127.0.0.1";
$username = "root";
$password = "";
$dbname = "mydb";
// Create connection
$conn = new mysqli($servername, $username, $password, $dbname);
// Check connection
if ($conn->connect_error) {
    die("Connection failed: " . $conn->connect_error);
} 
$sql = "INSERT INTO faculty (candidate_no,candidate_name,gs_link,dblp_link)VALUES('$number','$name','$gslink','$dblplink')";
if ($conn->query($sql) === TRUE) {
    echo "New record created successfully";
} else {
    echo "Error: " . $sql . "<br>" . $conn->error;
}

$conn->close();
?>

这篇关于如何使用php从pdf中提取特定的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆