使用python从.docx文件中提取GPS坐标 [英] Extract GPS coordinates from .docx file with python

查看:83
本文介绍了使用python从.docx文件中提取GPS坐标的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些忙碌的工作要做,我需要python的帮助.请参阅此Word文档.

I have some hectic task to do for which I need some help from python. Please see this word document.

我要从每一行提取文本和GPS坐标.目前,在10个docx文件中有100多个坐标.我的大量" python知识使我明白了这一点.

I am to extract texts and GPS coordinates from each row. There are currently over 100 coordinates in 10 docx file. My "hefty" python knowledge get me to this.

from docx import Document
import re

main_file = Document("D:/DOCUMENTS/Google_Link/1  Category I/1  Category 
I.docx")
table = main_file.tables[1] #this is same for every document

data = []
keys = None

for i, row in enumerate(table.rows):
   text = (cell.text for cell in row.cells)

if i == 0:
    keys = tuple(text)
    continue

row_data = tuple(text)
data.append(row_data)

regexReference = re.compile("(C.-)\w+")
colReference = [item[1] for item in data]

listReference = filter(regexReference.match, colReference)

for i in listReference:
    print i.encode('UTF-8')

我可以从第2列中打印16个参考ID.

I can print 16 reference ids from column 2. Please guide me to print something like this.

C1-20701-17-1

some site, some region

The existing CMC Office at Bariyodhala (22°40'34.3"N; 91°38'28.2"E) requires 
some repair/maintenance works including electrical wiring and electrical 
lights and appliances like ceiling fans supplies. Detail specification of 
the works are attached

x = 91°38'28.2"E
y = 22°40'34.3"N

这些XY位置和说明随后将用于创建KML文件并与每个文档一起附加.我希望在上述部分的每个部分(参考ID,位置,说明,x和y)都使用一个变量,以便我也可以使它自动化.

These XY locations and descritions will be used to create KML files afterwards and attach with each document. I'd prefer a variable for each part of the above section (ref id, location, description, x and y) so that I can automate that as well.

演示docx

推荐答案

我不知道这是否适用于是否存在具有不同模式的文件(请注意,我使用的是python 2.7.11):

I don't know if this works if there are files with different patterns (p.s. I'm using python 2.7.11):

# -*- coding: utf-8 -*-
from docx import Document
import sys
import os
import re

reload(sys)
sys.setdefaultencoding('utf8')

for root, dirs, files in os.walk("."):
    for name in files:
        doc_file = os.path.join(root, name)
        if doc_file.endswith('docx'):
            main_file = Document(doc_file)
            table = main_file.tables[1]  # this is same for every document

            data = []
            keys = None

            for i, row in enumerate(table.rows):
                text = (cell.text for cell in row.cells)

                if i == 0:
                    keys = tuple(text)
                    continue

                row_data = tuple(text)
                data.append(row_data)

            regexReference = re.compile("(C.-[0-9-]+)")
            regexCoordinate = re.compile(r'(N-(.{,12})([0-9]|\')|[0-9].{,12}N)[;, ]+(E-(.{,12})([0-9]|\')|[0-9].{,12}E)')

            result = []
            for item in data:
                tmp = dict()
                matchReference = regexReference.search(item[1])
                matchCoordinate = regexCoordinate.search(unicode(item[2]))
                if matchReference:
                    tmp['reference'] = matchReference.group()
                if matchCoordinate:
                    tmp['x'] = matchCoordinate.group(1)
                    tmp['y'] = matchCoordinate.group(4)
                tmp['description'] = unicode(item[2])
                tmp['location'] = unicode(item[3])
                result.append(tmp)

            for rs in result:
                if 'reference' in rs:
                    for k, v in rs.iteritems():
                        print('{} = {}'.format(k, v))
                    print

# Output:
# --------------------------------
# y = 91°38'28.2"E
# x = 22°40'34.3"N
# description = The existing CMC Office at Bariyodhala (22°40'34.3"N; 91°38'28.2"E) requires some repair/maintenance works including electrical wiring and electrical lights and appliances like ceiling fans supplies. Detail specification of the works are attached.
# reference = C1-20701-17-1
# location = xxxxx Site, c Region

这篇关于使用python从.docx文件中提取GPS坐标的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆