R 中的 Tabulizer 包:如何在特定标题后抓取表格 [英] Tabulizer package in R: how to scrape tables after specific Title

查看:41
本文介绍了R 中的 Tabulizer 包:如何在特定标题后抓取表格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何从 PDF 中抓取带有一些标题文本的表格?我正在试验 tabulizer 包.这是从特定页面获取表格的示例(波兰语公共卫生需求地图")

库(制表器)图书馆(tidyverse)选项(java.parameters =-Xmx8000m")位置<-http://www.mpz.mz.gov.pl/wp-content/uploads/sites/4/2019/01/mpz_choroby_ukladu_kostno_miesniowego_woj_dolnoslaskie.pdf"(out<-extract_tables(location, pages = 8,encoding = "UTF-8", method = "stream", outdir = getwd())[[4]] %>%as.tibble())

这让我在特定页面上有一张桌子.但我会有很多这样的 pdf 可以从网站上抓取:

看起来这很有效,只是列名不见了.不确定是否有办法保留它们,但这并未包含在您的问题中,所以您可能已经有了解决方案?

How to scrape tables preceded with some title text from PDF? I am experimenting with tabulizer package. Here an example of getting a table from a specific page (Polish "Map of Public Health Needs")

library(tabulizer)
library(tidyverse)
options(java.parameters = "-Xmx8000m")

location<-"http://www.mpz.mz.gov.pl/wp-content/uploads/sites/4/2019/01/mpz_choroby_ukladu_kostno_miesniowego_woj_dolnoslaskie.pdf"

(out<-extract_tables(location, pages = 8,encoding = "UTF-8", method = "stream", outdir = getwd())[[4]] %>%
as.tibble())

This gets me one table at specific page. But I will have plenty of such pdfs to scrape, from the site: http://www.mpz.mz.gov.pl/mapy-dla-30-grup-chorob-2018/ and then subpages with many links for each illness, getting the links with rvest, for each province of Poland and I need to scrape tables after a specific title string eg.

Tabela 1.2.2: Struktura zapadalnosci rejestrowanej w zależności od płci, miejsca zamieszkania oraz grupy wiekowej - Choroby układowe tkanki łącznej"

I need to detect Tabela(...) Struktura zapadalnosci(...)", because the tables may not be at the same page. Many thanks for any directions and ideas in advance.

EDIT: After I asked the question I succeeded so far to find pages where the table might be, maybe very ineffective:

library(pdfsearch)

pages <-
  keyword_search(
    location,
    keyword = c(
      'Tabela',
      'Struktura zapadalnosci rejestrowanej'
    ),
    path = TRUE,
    surround_lines = FALSE
  ) %>%
  group_by(page_num) %>%
  mutate(keyword = paste0(keyword, collapse = ";")) %>%
  filter(
    str_detect(keyword, "Tabela") &
      str_detect(keyword, "Struktura zapadalnosci rejestrowanej")
  ) %>%
  pull(page_num) %>%
  unique()

解决方案

I can help you with your basic problem, but there is one catch (see at the end). I use pdftools instead of pdfsearch but it's basically doing the same in this case (finding pages with a table). In order to save time, I only download the PDF once at the start:

options(java.parameters = "-Xmx8000m")# needs to be set before loading tabulizer
library(tabulizer)
library(tidyverse)

location <- "http://www.mpz.mz.gov.pl/wp-content/uploads/sites/4/2019/01/mpz_choroby_ukladu_kostno_miesniowego_woj_dolnoslaskie.pdf"
download.file(location, "test.pdf", mode = "wb")

Now convert the pdf to a data.frame with each line in a row of the df:

raw <- pdftools::pdf_data("test.pdf") 
pages <- lapply(seq_along(raw), function(p) {
  if (nrow(raw[[p]]) > 0) {
    raw[[p]]$page <- p
    raw[[p]]
  }
}) %>% 
  bind_rows() %>% 
  group_by(y, page) %>% 
  summarise(text = paste(text, collapse = " ")) %>% 
  arrange(page, y)

This data.frame is searchable and we keep only the lines fitting your keyword:

tables <- pages %>% 
  filter(grepl("Tabela .* Struktura zapadalnosci", text))

There are 8 lines which fit the keyphrase. We only extract tables from these. Furthermore, the function within the lapply loop only keeps the matrix with the most rows. If there are two tables on one page that might be a problem but generally it works well to use only the 'best guess' that tabulizer made finding the table structure.

tables_list <- lapply(tables$page, function(p) {
  cat(p, "\n")
  out <- extract_tables("test.pdf", 
                        pages = p,
                        encoding = "UTF-8", 
                        method = "stream", 
                        output = "matrix")
  out <- as_tibble(out[[which.max(sapply(out, nrow) + sapply(out, ncol))]]) # keep the biggest table
  attr(out, "caption") <- tables$text[tables$page %in% p]
  return(out)
})

The object tables_list now contains a list of data.frames, each a converted table:

> tables_list[[1]]
# A tibble: 16 x 8
   V1                  V2    V3    V4    V5    V6    V7    V8   
   <chr>               <chr> <chr> <chr> <chr> <chr> <chr> <chr>
 1 dolnośląskie        77,05 74,65 4,04  10,59 13,37 27,87 44,14
 2 kujawsko-pomorskie  78,12 65,93 4,29  14,96 14,82 27,01 38,92
 3 lubelskie           76,50 56,83 2,67  14,83 17,00 29,00 36,50
 4 lubuskie            79,10 76,23 4,92  12,70 12,70 30,74 38,93
 5 łódzkie             74,37 67,77 6,45  13,84 15,09 30,03 34,59
 6 małopolskie         72,71 55,35 6,99  14,63 12,01 25,87 40,50
 7 mazowieckie         76,31 68,52 5,89  12,11 12,30 27,03 42,67
 8 opolskie            79,55 54,65 4,83  10,04 17,47 26,02 41,64
 9 podkarpackie        75,10 47,32 7,57  14,86 18,29 25,31 33,98
10 podlaskie           74,18 68,00 5,82  10,55 17,09 32,36 34,18
11 pomorskie           76,57 74,96 5,71  12,74 13,76 26,65 41,14
12 śląskie             73,51 81,15 4,89  14,96 14,43 26,64 39,08
13 świętokrzyskie      74,45 56,51 4,91  14,00 14,74 27,27 39,07
14 warmińsko-mazurskie 75,91 63,22 5,62  13,59 18,48 29,53 32,79
15 wielkopolskie       72,66 62,71 3,62  14,37 14,77 29,45 37,79
16 zachodniopomorskie  74,26 73,21 8,44  13,71 11,60 24,89 41,35

I also added the (first line of the) caption of each table as an attribute to the data.frame:

> attr(tables_list[[1]], "caption")
[1] "Tabela 1.2.2: Struktura zapadalnosci rejestrowanej w zależności od płci, miejsca zamieszkania oraz grupy"

Compare this to the pdf:

It seems this worked well, except that the column names are gone. Not sure if there is a way to retain them but that was not included in your question, so maybe you already have a solution?

这篇关于R 中的 Tabulizer 包:如何在特定标题后抓取表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆