美丽的汤-`findAll`不能捕获SVG中的所有标签(`ElementTree`可以) [英] Beautiful Soup - `findAll` not capturing all tags in SVG (`ElementTree` does)

查看:128
本文介绍了美丽的汤-`findAll`不能捕获SVG中的所有标签(`ElementTree`可以)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图通过修改 SVG映射来生成Choropleth映射描述美国的所有县.基本方法由 BeautifulSoup 解析器./p>

问题是,解析器无法捕获SVG文件中的所有path元素.以下仅捕获了149条路径(超过3000条路径):

#Open SVG file
svg=open(shp_dir+'USA_Counties_with_FIPS_and_names.svg','r').read()

#Parse SVG
soup = BeautifulSoup(svg, selfClosingTags=['defs','sodipodi:namedview'])

#Identify counties
paths = soup.findAll('path')

len(paths)

但是,我知道,物理检查和 ElementTree 方法使用以下例程捕获3,143条路径:

#Parse SVG
tree = ET.parse(shp_dir+'USA_Counties_with_FIPS_and_names.svg')

#Capture element
root = tree.getroot()

#Compile list of IDs from file
ids=[]
for child in root:
    if 'path' in child.tag:
        ids.append(child.attrib['id'])

len(ids)

我还没有弄清楚如何用ElementTree对象写东西,但还没有完全搞清楚.

#Define style template string
style='font-size:12px;fill-rule:nonzero;stroke:#FFFFFF;stroke-opacity:1;'+\
        'stroke-width:0.1;stroke-miterlimit:4;stroke-dasharray:none;'+\
        'stroke-linecap:butt;marker-start:none;stroke-linejoin:bevel;fill:'

#For each path...
for child in root:
    #...if it is a path....
    if 'path' in child.tag:
        try:
            #...update the style to the new string with a county-specific color...
            child.attrib['style']=style+col_map[child.attrib['id']]
        except:
            #...if it's not a county we have in the ACS, leave it alone
            child.attrib['style']=style+'#d0d0d0'+'\n'

#Write modified SVG to disk
tree.write(shp_dir+'mhv_by_cty.svg')

上面的修改/写入例程会产生这种怪异现象:

我的主要问题是:为什么BeautifulSoup无法捕获所有path标签?其次,为什么用ElementTree对象修改的图像会继续进行所有课外活动?任何建议将不胜感激.

解决方案

alexce的答案对您的第一个问题是正确的.关于您的第二个问题:

为什么用ElementTree对象修改的图像会进行所有课外活动?"

答案很简单-并非每个<path>元素都划出一个县.具体来说,应删除两个元素,一个带有id="State_Lines",另一个带有id="separator".您没有提供颜色数据集,所以我只为每个颜色使用了一个随机的十六进制颜色生成器(改编自此处)县,然后使用 lxml 来解析.svg的XML并遍历每个<path>元素,从而跳过我上面提到的那些:

from lxml import etree as ET
import random

def random_color():
    r = lambda: random.randint(0,255)
    return '#%02X%02X%02X' % (r(),r(),r())

new_style = 'font-size:12px;fill-rule:nonzero;stroke:#FFFFFF;stroke-opacity:1;stroke-width:0.1;stroke-miterlimit:4;stroke-dasharray:none;stroke-linecap:butt;marker-start:none;stroke-linejoin:bevel;fill:'

tree = ET.parse('USA_Counties_with_FIPS_and_names.svg')
root = tree.getroot()
for child in root:
    if 'path' in child.tag and child.attrib['id'] not in ["separator", "State_Lines"]:
        child.attrib['style'] = new_style + random_color()

tree.write('counties_new.svg')

得到的这张精美图片:

I was attempting to generate a choropleth map by modifying an SVG map depicting all counties in the US. The basic approach is captured by Flowing Data. Since SVG is basically just XML, the approach leverages the BeautifulSoup parser.

The thing is, the parser does not capture all path elements in the SVG file. The following captured only 149 paths (out of over 3000):

#Open SVG file
svg=open(shp_dir+'USA_Counties_with_FIPS_and_names.svg','r').read()

#Parse SVG
soup = BeautifulSoup(svg, selfClosingTags=['defs','sodipodi:namedview'])

#Identify counties
paths = soup.findAll('path')

len(paths)

I know, however, that many more exist from both physical inspection, and the fact that ElementTree methods capture 3,143 paths with the following routine:

#Parse SVG
tree = ET.parse(shp_dir+'USA_Counties_with_FIPS_and_names.svg')

#Capture element
root = tree.getroot()

#Compile list of IDs from file
ids=[]
for child in root:
    if 'path' in child.tag:
        ids.append(child.attrib['id'])

len(ids)

I have not yet figured out how to write from the ElementTree object in a way that is not all messed up.

#Define style template string
style='font-size:12px;fill-rule:nonzero;stroke:#FFFFFF;stroke-opacity:1;'+\
        'stroke-width:0.1;stroke-miterlimit:4;stroke-dasharray:none;'+\
        'stroke-linecap:butt;marker-start:none;stroke-linejoin:bevel;fill:'

#For each path...
for child in root:
    #...if it is a path....
    if 'path' in child.tag:
        try:
            #...update the style to the new string with a county-specific color...
            child.attrib['style']=style+col_map[child.attrib['id']]
        except:
            #...if it's not a county we have in the ACS, leave it alone
            child.attrib['style']=style+'#d0d0d0'+'\n'

#Write modified SVG to disk
tree.write(shp_dir+'mhv_by_cty.svg')

The modification/write routine above yields this monstrosity:

My primary question is this: why did BeautifulSoup fail to capture all of the path tags? Second, why would the image modified with the ElementTree objects have all of that extracurricular activity going on? Any advice would be greatly appreciated.

解决方案

alexce's answer is correct for your first question. As far as your second question is concerned:

why would the image modified with the ElementTree objects have all of that extracurricular activity going on?"

the answer is pretty simple - not every <path> element draws a county. Specifically, there are two elements, one with id="State_Lines" and one with id="separator", that should be eliminated. You didn't supply your dataset of colors, so I just used a random hex color generator (adapted from here) for each county, then used lxml to parse the .svg's XML and iterate through each <path> element, skipping the ones I mentioned above:

from lxml import etree as ET
import random

def random_color():
    r = lambda: random.randint(0,255)
    return '#%02X%02X%02X' % (r(),r(),r())

new_style = 'font-size:12px;fill-rule:nonzero;stroke:#FFFFFF;stroke-opacity:1;stroke-width:0.1;stroke-miterlimit:4;stroke-dasharray:none;stroke-linecap:butt;marker-start:none;stroke-linejoin:bevel;fill:'

tree = ET.parse('USA_Counties_with_FIPS_and_names.svg')
root = tree.getroot()
for child in root:
    if 'path' in child.tag and child.attrib['id'] not in ["separator", "State_Lines"]:
        child.attrib['style'] = new_style + random_color()

tree.write('counties_new.svg')

resulting in this nice image:

这篇关于美丽的汤-`findAll`不能捕获SVG中的所有标签(`ElementTree`可以)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆