如何从具有集合结构的PDF中有效地提取有意义的数据? [英] How do I efficiently extract meaningful data from a PDF with a set structure?

查看:72
本文介绍了如何从具有集合结构的PDF中有效地提取有意义的数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我一直致力于帮助管理健康管理组织部门的应用程序。

这是我的第一个商业软件,我很兴奋如果我可以解决这个问题,本周即将结束。

这是问题...

MIS部门每年收到4次PDF。此PDF包含2条信息。



a)在组织下注册的所有医院的清单。



b)在每家医院注册的登记者名单。



我的任务是编写一个程序,检索PDF中的所有医院,在应用程序的数据库中注册它们,然后检索所有登记者并将其登记在各自医院的数据库中(这是使用数据库中的外键关系管理的。)



我使用Regex编写解决方案这会记录所有的医院,并且在解析PDF(长度为4000页)时会有一些延迟。它可以很好地工作。



问题是我的注册解决方案由于我的代码效率低下,登记者中的大约2个登记者没有登记,因此登记者的效率不如预期。



当我将已经部分工作的解决方案转移到最终驻留的客户端服务器时,我收到一条错误,上面写着源代码不能存在发现的。但是当我在调试模式下运行它以检查问题可能是什么时,它会按预期提取登记者详细信息。所以我对此非常困惑。



如果我能得到帮助a)无法找到源代码错误或b)为什么我的代码可以工作我的开发机器,但不是服务器,我将非常感激。



我包括我的代码,还包括PDF的快照,但我怀疑堆栈让附件有问题。



谢谢。



Hi,
I've been working on an application to help manage a Health Management Organization departments.
Its my first commercial piece of software and I'm excited about rounding up this week if I can get this issue sorted out.
Here's the problem...
The MIS department receives a PDF 4 times a year. This PDF contains 2 pieces of information.

a) A list of all hospitals registered under the orgnanisation.

b) A list of Enrollee's registered under each hospital.

I was tasked with writing a program that retrieve's all hospitals in the PDF, registering them in the application's database, then retrieving all enrollee's and registering them in the database under their respective hospitals(This is managed using a foreign key relationship in the database).

I used Regex to write a solution that registers all the hospitals, and save for some delay when parsing the PDF(which is 4000 pages long) it works perfectly.

The problem is that my solution to register the enrollee's is not as efficient as it should be, about 2 out of 10 enrollee's don't get registered due to inefficiencies in my code.

And when I transfer the already partially working solution to the client's server where it will finally reside, I get an error which says "Source Code Could Not Be Found". But when I run it in debug mode to check what the problem might be, it extracts the enrollee details as expected. So I'm very confused about that.

If I can get help with a)The "source code cannot be found" error or b)Why my code works on my development machine but not the server I would be very grateful.

Ill include my code and would also include a snapshot of the PDF as well but I doubt stack lets attachments with questions.

Thanks.

private void extractEnrolleesFromPDF(string enrolleeExtraction, string hospital)
        {
            int start;
            int end;
            string substring;
            
            try
            {
                MatchCollection policyNumbers = Regex.Matches(enrolleeExtraction, @"(\*)(\d{8})(\*)");

                foreach (var policyNumber in policyNumbers)
                {
                    Match match = Regex.Match(enrolleeExtraction, "\\" + policyNumber.ToString());
                    if (match.Success)
                    {
                        //Strore the first occurence of the enrollee's policy number
                        start = match.Index;

                        Match match2 = Regex.Match(enrolleeExtraction.Substring(start + 10), @"(\*)");
                        if (match2.Success)
                        {
                            end = match2.Index + 9;

                            substring = enrolleeExtraction.Substring(start, end);

                            enrolleePolicyNumber.Add(substring);
                        }                        
                    }
                }
                //Extract enrollee data an insert into the database

                ArrayList individualEnrolees = new ArrayList();

                int numberOfEnrollees = enrolleePolicyNumber.Count;
                bool principal = false;
                string fName;
                string lName;
                DateTime dob;
                string sex;
                string hospitalCode = hospital.Substring(1, 7);
                for (int i = 0; i < numberOfEnrollees; i++)
                {
                    string enrolleePolNumber;
                    Match policyNumber = Regex.Match(enrolleePolicyNumber[i].ToString(), @"((\*)(\d{8})(\*))");
                    if (policyNumber.Success)
                    {
                        enrolleePolNumber = policyNumber.Value;
                    }
                    MatchCollection enrolleeRecords = Regex.Matches(enrolleePolicyNumber[i].ToString(), @"(\d{1})(\s)(\D*)(\d{2})/(\d{2})/(\d{4})");

                    //Empty the array list each time to avoid going over the same recors over and over again
                    individualEnrolees.Clear();

                    foreach (var record in enrolleeRecords)
                    {
                        individualEnrolees.Add(record);
                    }

                    //The way our search works at the moment is that is uses the pattern *-------* at th ebeginning and end to
                    //mark where an enrolleee's records begin and end. The problem now is that the last record does not have
                    //that pattern at the end. So we need to find a way to retrieve the last record and add it to the collection we parse
                    //for the enrollee data.
                    try
                    {
                        Match lastPolicyNumberInHospital = Regex.Match(enrolleeExtraction, @"(\*)(\d{8})(\*)", RegexOptions.RightToLeft);

                        string lastRecord = enrolleeExtraction.Substring(lastPolicyNumberInHospital.Index);

                        enrolleePolicyNumber.Add(lastRecord);
                    }
                    catch (Exception ex)
                    {
                        MessageBox.Show("Failed to extract last record: " + ex.Message);
                    }

                    foreach (var record in individualEnrolees)
                    {
                        string princ;

                        string[] splitEnrolleeData = record.ToString().Split(' ');

                        //int splitSectionCount counts how many section our split enrollee data is
                        int splitSectionCount = splitEnrolleeData.Count();

                        //if we have six sections then we expect the Principal or Spouse record to be
                        //on index 1
                        if (splitSectionCount == 5)
                        {
                            princ = splitEnrolleeData[1].ToString();
                            if (princ == "Principal")
                            {
                                principal = true;
                            }
                            else
                            {
                                principal = false;
                            }
                        }
                        //if we have five sections then we expect the Principal or Spouse record to be
                        //on index 0.
                        //i.e. Merged with the serial number so we check to see if it contains
                        //the string "Principal" or "Spouse"
                        else if (splitSectionCount == 4)
                        {
                            if (splitEnrolleeData[0].ToString().Contains("0"))
                            {
                                principal = true;
                            }
                            else if (!splitEnrolleeData[0].ToString().Contains("0"))
                            {
                                principal = false;
                            }
                        }
                        //TO-DO: Eliminate this comment block is else-if above works properly
                        //princ = splitEnrolleeData[1].ToString();
                        //if (princ == "Principal")
                        //{
                        //    principal = true;
                        //}
                        //else
                        //{
                        //    principal = false;
                        //}

                        enrolleePolNumber = policyNumber.Value.Substring(1, policyNumber.Value.Length - 2);

                        //if we have 6 sections as expected carry on and register the enrollee as usual
                        //if not, if we have 5 do something else
                        //this is because some enrollees in the NHIS PDF arent split properly returning 
                        //5 items instead of 6
                        if (splitSectionCount == 5)
                        {
                            lName = splitEnrolleeData[2].ToString();
                            fName = splitEnrolleeData[3].ToString();
                            dob = Convert.ToDateTime(splitEnrolleeData[4].ToString());
                            hosp = getHospitalID(hospitalCode);
                            if (principal == true)
                            {
                                if (checkIfPolicyNumberExists(enrolleePolNumber) == false)
                                {
                                    registerEnrollee(enrolleePolNumber, fName, lName, dob, hosp.ToString());
                                }
                            }
                            else if (principal == false)
                            {
                                if (checkIfDependantPolicyNumberExists(enrolleePolNumber, fName) == false)
                                {
                                    registerDependant(enrolleePolNumber, fName, lName, dob, hosp.ToString());
                                }
                            }
                        }
                        else if (splitSectionCount == 4)
                        {
                            lName = splitEnrolleeData[1].ToString();
                            fName = splitEnrolleeData[2].ToString();
                            dob = Convert.ToDateTime(splitEnrolleeData[3].ToString());
                            hosp = getHospitalID(hospitalCode);
                            if (principal == true)
                            {
                                if (checkIfPolicyNumberExists(enrolleePolNumber) == false)
                                {
                                    registerEnrollee(enrolleePolNumber, fName, lName, dob, hosp.ToString());
                                }
                            }
                            else if (principal == false)
                            {
                                if (checkIfDependantPolicyNumberExists(enrolleePolNumber, fName) == false)
                                {
                                    
                                        registerDependant(enrolleePolNumber, fName, lName, dob, hosp.ToString());

                                    //else if (!parentExists(enrolleePolNumber))
                                    //{
                                        
                                    //}
                                }
                            }
                        }
                    }
                }
            }
            catch (Exception ex)
            {
                MetroFramework.MetroMessageBox.Show(this, "Error retrieving subsitring: " + ex.Message);
            }

        }

推荐答案

令人印象深刻的是,您可以获得80%的记录。不幸的是,即使它是PDF格式的,数据本身在进入时也不是很好。如果你是一名顾问,那么你可以做的并不多,但如果你是一名普通员工,你可以提出一些建议。 。



在编程方面,我会跟踪所有失败的记录,并寻找它们失败原因的相关性;例如它来自的医院,数据类型或长度。这将有助于进一步缩小范围。我还要求他们以CSV,Excel或XML格式提交文档,因为我猜它来自某种Office应用程序,有人认为它看起来不错,因为PDF版本较小所以它可以通过电子邮件。



最后,如果您可以显示哪些记录失败以及为什么,我会找到编制该文档的人并要求他们Poka-yoke输入表单。这是精益生产中使用的日语术语,用于使事物充分证明。通常办公室表格,例如来自医生办公室的办公表格是最差的,因为它们是手工输入并且难以理解;因此你最终得到了垃圾数据。
That's impressive that you can get 80% of those records. Unfortunately, even though it is in PDF format, the data itself is not well formed when it goes in. If you are a consultant, then there isn't much you can do, but if you are a regular employee you can throwout some suggestions.

On the programming side, I would track all of the failed records and look for a correlation on why they failed; such as the hospital it came from, the data type, or length. This would help narrow it down further. I would also beg to have them submit the document in CSV, Excel, or XML format, because I am guessing it is coming from some sort of Office application that someone thinks looks nice as a PDF and it is smaller so it can be sent through email.

Lastly, if you can show which records are failing and why, I'd find who compiled the document and ask them to Poka-yoke the input forms. That is a Japanese term used in Lean Manufacturing to make things full-proof. Typically office forms, such as those from the doctor's office are the worst, as they are hand entered and difficult to understand; thus you end up with garbage data.


这篇关于如何从具有集合结构的PDF中有效地提取有意义的数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆