如何从具有集合结构的PDF中有效地提取有意义的数据？ [英] How do I efficiently extract meaningful data from a PDF with a set structure?

查看：72 发布时间：2019/6/12 14:51:52 C# C#4.0 PDF itextsharp

本文介绍了如何从具有集合结构的PDF中有效地提取有意义的数据？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我一直致力于帮助管理健康管理组织部门的应用程序。

这是我的第一个商业软件，我很兴奋如果我可以解决这个问题，本周即将结束。

这是问题...

MIS部门每年收到4次PDF。此PDF包含2条信息。

a）在组织下注册的所有医院的清单。

b）在每家医院注册的登记者名单。

我的任务是编写一个程序，检索PDF中的所有医院，在应用程序的数据库中注册它们，然后检索所有登记者并将其登记在各自医院的数据库中（这是使用数据库中的外键关系管理的。）

我使用Regex编写解决方案这会记录所有的医院，并且在解析PDF（长度为4000页）时会有一些延迟。它可以很好地工作。

问题是我的注册解决方案由于我的代码效率低下，登记者中的大约2个登记者没有登记，因此登记者的效率不如预期。

当我将已经部分工作的解决方案转移到最终驻留的客户端服务器时，我收到一条错误，上面写着源代码不能存在发现的。但是当我在调试模式下运行它以检查问题可能是什么时，它会按预期提取登记者详细信息。所以我对此非常困惑。

如果我能得到帮助a）无法找到源代码错误或b）为什么我的代码可以工作我的开发机器，但不是服务器，我将非常感激。

我包括我的代码，还包括PDF的快照，但我怀疑堆栈让附件有问题。

谢谢。

Hi,
I've been working on an application to help manage a Health Management Organization departments.
Its my first commercial piece of software and I'm excited about rounding up this week if I can get this issue sorted out.
Here's the problem...
The MIS department receives a PDF 4 times a year. This PDF contains 2 pieces of information.

a) A list of all hospitals registered under the orgnanisation.

b) A list of Enrollee's registered under each hospital.

I was tasked with writing a program that retrieve's all hospitals in the PDF, registering them in the application's database, then retrieving all enrollee's and registering them in the database under their respective hospitals(This is managed using a foreign key relationship in the database).

I used Regex to write a solution that registers all the hospitals, and save for some delay when parsing the PDF(which is 4000 pages long) it works perfectly.

The problem is that my solution to register the enrollee's is not as efficient as it should be, about 2 out of 10 enrollee's don't get registered due to inefficiencies in my code.

And when I transfer the already partially working solution to the client's server where it will finally reside, I get an error which says "Source Code Could Not Be Found". But when I run it in debug mode to check what the problem might be, it extracts the enrollee details as expected. So I'm very confused about that.

If I can get help with a)The "source code cannot be found" error or b)Why my code works on my development machine but not the server I would be very grateful.

Ill include my code and would also include a snapshot of the PDF as well but I doubt stack lets attachments with questions.

Thanks.

private void extractEnrolleesFromPDF(string enrolleeExtraction, string hospital)
        {
            int start;
            int end;
            string substring;
            
            try
            {
                MatchCollection policyNumbers = Regex.Matches(enrolleeExtraction, @"(\*)(\d{8})(\*)");

                foreach (var policyNumber in policyNumbers)
                {
                    Match match = Regex.Match(enrolleeExtraction, "\\" + policyNumber.ToString());
                    if (match.Success)
                    {
                        //Strore the first occurence of the enrollee's policy number
                        start = match.Index;

                        Match match2 = Regex.Match(enrolleeExtraction.Substring(start + 10), @"(\*)");
                        if (match2.Success)
                        {
                            end = match2.Index + 9;

                            substring = enrolleeExtraction.Substring(start, end);

                            enrolleePolicyNumber.Add(substring);
                        }                        
                    }
                }
                //Extract enrollee data an insert into the database

                ArrayList individualEnrolees = new ArrayList();

                int numberOfEnrollees = enrolleePolicyNumber.Count;
                bool principal = false;
                string fName;
                string lName;
                DateTime dob;
                string sex;
                string hospitalCode = hospital.Substring(1, 7);
                for (int i = 0; i < numberOfEnrollees; i++)
                {
                    string enrolleePolNumber;
                    Match policyNumber = Regex.Match(enrolleePolicyNumber[i].ToString(), @"((\*)(\d{8})(\*))");
                    if (policyNumber.Success)
                    {
                        enrolleePolNumber = policyNumber.Value;
                    }
                    MatchCollection enrolleeRecords = Regex.Matches(enrolleePolicyNumber[i].ToString(), @"(\d{1})(\s)(\D*)(\d{2})/(\d{2})/(\d{4})");

                    //Empty the array list each time to avoid going over the same recors over and over again
                    individualEnrolees.Clear();

                    foreach (var record in enrolleeRecords)
                    {
                        individualEnrolees.Add(record);
                    }

                    //The way our search works at the moment is that is uses the pattern *-------* at th ebeginning and end to
                    //mark where an enrolleee's records begin and end. The problem now is that the last record does not have
                    //that pattern at the end. So we need to find a way to retrieve the last record and add it to the collection we parse
                    //for the enrollee data.
                    try
                    {
                        Match lastPolicyNumberInHospital = Regex.Match(enrolleeExtraction, @"(\*)(\d{8})(\*)", RegexOptions.RightToLeft);

                        string lastRecord = enrolleeExtraction.Substring(lastPolicyNumberInHospital.Index);

                        enrolleePolicyNumber.Add(lastRecord);
                    }
                    catch (Exception ex)
                    {
                        MessageBox.Show("Failed to extract last record: " + ex.Message);
                    }

                    foreach (var record in individualEnrolees)
                    {
                        string princ;

                        string[] splitEnrolleeData = record.ToString().Split(' ');

                        //int splitSectionCount counts how many section our split enrollee data is
                        int splitSectionCount = splitEnrolleeData.Count();

                        //if we have six sections then we expect the Principal or Spouse record to be
                        //on index 1
                        if (splitSectionCount == 5)
                        {
                            princ = splitEnrolleeData[1].ToString();
                            if (princ == "Principal")
                            {
                                principal = true;
                            }
                            else
                            {
                                principal = false;
                            }
                        }
                        //if we have five sections then we expect the Principal or Spouse record to be
                        //on index 0.
                        //i.e. Merged with the serial number so we check to see if it contains
                        //the string "Principal" or "Spouse"
                        else if (splitSectionCount == 4)
                        {
                            if (splitEnrolleeData[0].ToString().Contains("0"))
                            {
                                principal = true;
                            }
                            else if (!splitEnrolleeData[0].ToString().Contains("0"))
                            {
                                principal = false;
                            }
                        }
                        //TO-DO: Eliminate this comment block is else-if above works properly
                        //princ = splitEnrolleeData[1].ToString();
                        //if (princ == "Principal")
                        //{
                        //    principal = true;
                        //}
                        //else
                        //{
                        //    principal = false;
                        //}

                        enrolleePolNumber = policyNumber.Value.Substring(1, policyNumber.Value.Length - 2);

                        //if we have 6 sections as expected carry on and register the enrollee as usual
                        //if not, if we have 5 do something else
                        //this is because some enrollees in the NHIS PDF arent split properly returning 
                        //5 items instead of 6
                        if (splitSectionCount == 5)
                        {
                            lName = splitEnrolleeData[2].ToString();
                            fName = splitEnrolleeData[3].ToString();
                            dob = Convert.ToDateTime(splitEnrolleeData[4].ToString());
                            hosp = getHospitalID(hospitalCode);
                            if (principal == true)
                            {
                                if (checkIfPolicyNumberExists(enrolleePolNumber) == false)
                                {
                                    registerEnrollee(enrolleePolNumber, fName, lName, dob, hosp.ToString());
                                }
                            }
                            else if (principal == false)
                            {
                                if (checkIfDependantPolicyNumberExists(enrolleePolNumber, fName) == false)
                                {
                                    registerDependant(enrolleePolNumber, fName, lName, dob, hosp.ToString());
                                }
                            }
                        }
                        else if (splitSectionCount == 4)
                        {
                            lName = splitEnrolleeData[1].ToString();
                            fName = splitEnrolleeData[2].ToString();
                            dob = Convert.ToDateTime(splitEnrolleeData[3].ToString());
                            hosp = getHospitalID(hospitalCode);
                            if (principal == true)
                            {
                                if (checkIfPolicyNumberExists(enrolleePolNumber) == false)
                                {
                                    registerEnrollee(enrolleePolNumber, fName, lName, dob, hosp.ToString());
                                }
                            }
                            else if (principal == false)
                            {
                                if (checkIfDependantPolicyNumberExists(enrolleePolNumber, fName) == false)
                                {
                                    
                                        registerDependant(enrolleePolNumber, fName, lName, dob, hosp.ToString());

                                    //else if (!parentExists(enrolleePolNumber))
                                    //{
                                        
                                    //}
                                }
                            }
                        }
                    }
                }
            }
            catch (Exception ex)
            {
                MetroFramework.MetroMessageBox.Show(this, "Error retrieving subsitring: " + ex.Message);
            }

        }

如何从具有集合结构的PDF中有效地提取有意义的数据？ [英] How do I efficiently extract meaningful data from a PDF with a set structure?

问题描述

推荐答案

相关文章

其他开发语言最新文章

热门教程

热门工具

登录关闭

如何从具有集合结构的PDF中有效地提取有意义的数据？ [英] How do I efficiently extract meaningful data from a PDF with a set structure?

问题描述

推荐答案

相关文章

其他开发语言最新文章

热门教程

热门工具

登录 关闭

登录关闭