Last updated: 2020-05-03

Checks: 7 0

Knit directory: 033_purrr_learning/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20200501) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 9848eb4. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rproj.user/

Untracked files:
    Untracked:  data/gap_copy.rds
    Untracked:  data/gap_mod.rds
    Untracked:  data/gapminder_raw.csv

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/04_map-over-multiple-input-objects.Rmd) and HTML (docs/04_map-over-multiple-input-objects.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 9848eb4 ogorodriguez 2020-05-03 Section 4 examples

Iteration with more than one object

map() can be used to iterate over more than one element of a list or over more than one column in a data frame. For that the purrr package has functions to work with iterations over two objects, the map2() function and the pmap()function if we need to iterate with 2 or more.

In this section we will work with an example that is going to create a list of plots to compare two of the columns of the gapminder data set: life expectancy and GDP for each continent/year combination.

The arguments of the map2() are the two objects to iterate over and the function that will combine them both.

map2(.x = object1, .y = object2, .f = name_of_function(.x, .y))

For this specific example I will define a vector list of continents and a paired vector or list of years to iterate through.

An important observation on how the iteration works. For example.

if .x = c(“Americas”, “Asia”) is my continent vector and, .y = c(1952, 2007) is my year vector; the iteration is not going to over all of the combinations possible with .x and .y. Meaning, it is not going to start pairing something like (Americas, 1952) and then (Americas, 2007), and then (Asia, 1952) and then (Asia, 2007).

The correct iteration will be first (America, 1952) then (Asia, 2007) only.

It goes over the list row by corresponding row, it seems.

In this case we will get our two element from the gapminder data set identified. One for the year and one for the continent.

gap_mod <- read_rds(here::here("data", "gap_mod.rds"))

continent_year <- gap_mod %>% distinct(continent, year)
continent_year
#> # A tibble: 60 x 2
#>    continent  year
#>    <fct>     <dbl>
#>  1 Asia       1952
#>  2 Asia       1957
#>  3 Asia       1962
#>  4 Asia       1967
#>  5 Asia       1972
#>  6 Asia       1977
#>  7 Asia       1982
#>  8 Asia       1987
#>  9 Asia       1992
#> 10 Asia       1997
#> # ... with 50 more rows
# extract the continent and year as separate vectors
continents <- continent_year %>% 
  pull(continent) %>% 
  as.character()

years <- continent_year %>% 
  pull(year)

To be consistent with the basic arguments, we will pass the previous vectors to .x and .y variables.

Before jumping into getting the corresponding graphs for all countries, we will try the code with one continent and one year to see the life expectancy of all the countries in that continent that given year.

.x <- continents[1]
.y <- years[1]

gap_mod %>% 
  filter(continent == .x,
         year == .y) %>% 
  ggplot() +
    geom_point(aes(x = gdpPercap, y = lifeExp)) + 
    ggtitle(glue::glue(.x, " ", .y))

This seems to work. I will use the log version of the gdpPercap variable to better see the differences between the countris with lower GDP per capita.

.x <- continents[1]
.y <- years[1]

gap_mod %>% 
  filter(continent == .x,
         year == .y) %>% 
  ggplot() +
    geom_point(aes(x = log(gdpPercap), y = lifeExp)) + 
    ggtitle(glue::glue(.x, " ", .y))

I will pass this code into a map2() function to gett all countries.

plot_list <- map2(.x = continents,
                  .y = years,
                  .f = ~ (
                    gap_mod %>% 
                      filter(continent == .x,
                             year == .y) %>% 
                      ggplot() +
                        geom_point(aes(x = log(gdpPercap), 
                                       y = lifeExp)) + 
                        ggtitle(glue::glue(.x, " ", .y))
                  ))

The result is the creation of a list of graphs saved in the variable plot_list. We can start pulling graphs by calling them from their corresponding arguments.

# GDP per capita in America in 1967
plot_list[[40]]

# GDP per capita in Europe in 1997
plot_list[[22]]

List columns and nested data frames

Tibbles are data frames where the columns can also be lists (not only vectors as in base R.)

A tibble can nested, which means that some columns can contain grouped objects such as other data frames o lists.

The next example will nest the gapminder dataset by continent.

gap_copy <- read_rds(here::here("data", "gap_copy.rds"))
gap_nest <- gap_copy %>% 
  group_by(continent) %>% 
  nest()

gap_nest
#> # A tibble: 5 x 2
#> # Groups:   continent [5]
#>   continent data              
#>   <chr>     <list>            
#> 1 Asia      <tibble [396 x 5]>
#> 2 Europe    <tibble [360 x 5]>
#> 3 Africa    <tibble [624 x 5]>
#> 4 Americas  <tibble [300 x 5]>
#> 5 Oceania   <tibble [24 x 5]>

In the previous result the first is the variable we grouped by, continent, and the second column is the rest of the data fram corresponding to that group.

We can see the fifth entry displayed in detail here (Ocenia):

gap_nest$data[[5]]
#> # A tibble: 24 x 5
#>    country    year      pop lifeExp gdpPercap
#>    <chr>     <dbl>    <dbl>   <dbl>     <dbl>
#>  1 Australia  1952  8691212    69.1    10040.
#>  2 Australia  1957  9712569    70.3    10950.
#>  3 Australia  1962 10794968    70.9    12217.
#>  4 Australia  1967 11872264    71.1    14526.
#>  5 Australia  1972 13177000    71.9    16789.
#>  6 Australia  1977 14074100    73.5    18334.
#>  7 Australia  1982 15184200    74.7    19477.
#>  8 Australia  1987 16257249    76.3    21889.
#>  9 Australia  1992 17481977    77.6    23425.
#> 10 Australia  1997 18565243    78.8    26998.
#> # ... with 14 more rows

The same can be achieved using purrr’s pluck() function.

# extract the fifth entry of the data column in gap_nest
gap_nest %>% 
  pluck("data", 5)
#> # A tibble: 24 x 5
#>    country    year      pop lifeExp gdpPercap
#>    <chr>     <dbl>    <dbl>   <dbl>     <dbl>
#>  1 Australia  1952  8691212    69.1    10040.
#>  2 Australia  1957  9712569    70.3    10950.
#>  3 Australia  1962 10794968    70.9    12217.
#>  4 Australia  1967 11872264    71.1    14526.
#>  5 Australia  1972 13177000    71.9    16789.
#>  6 Australia  1977 14074100    73.5    18334.
#>  7 Australia  1982 15184200    74.7    19477.
#>  8 Australia  1987 16257249    76.3    21889.
#>  9 Australia  1992 17481977    77.6    23425.
#> 10 Australia  1997 18565243    78.8    26998.
#> # ... with 14 more rows

Nesting data frames can offer the possibility to use dplyr on more complex objects, not simply mutate() and the like. mutate() work when the columns are simply vectors.

In order for mutate() to work with nested columns we need to use the map() function.

test_list <- tibble(list_col = list(c(1, 5, 7), 
                       5, 
                       c(10, 10, 11))) %>%
  mutate(list_sum = map(list_col, ~ sum(.x)))
test_list %>% 
  pluck("list_sum", 3)
#> [1] 31

Ir we want all sums as a list we can check:

test_list %>% 
  pull(list_sum) 
#> [[1]]
#> [1] 13
#> 
#> [[2]]
#> [1] 5
#> 
#> [[3]]
#> [1] 31

If we wanted the result to be a vector, we can pass map_dbl() instead.

test_dbl <- tibble(list_col = list(c(1, 5, 7), 
                       5, 
                       c(10, 10, 11))) %>%
  mutate(list_sum = map_dbl(list_col, sum))
test_dbl
#> # A tibble: 3 x 2
#>   list_col  list_sum
#>   <list>       <dbl>
#> 1 <dbl [3]>       13
#> 2 <dbl [1]>        5
#> 3 <dbl [3]>       31

Nesting the gap minder data. The LifeExp example

The idea is to calculate the average life expectancy within each continent and add it as a new column using mutate().

If we use mutate() the traditional way, this will not work:

gap_nest %>% 
  mutate(avg_life_exp = mean(data$lifeExp))
#> # A tibble: 5 x 3
#> # Groups:   continent [5]
#>   continent data               avg_life_exp
#>   <chr>     <list>                    <dbl>
#> 1 Asia      <tibble [396 x 5]>           NA
#> 2 Europe    <tibble [360 x 5]>           NA
#> 3 Africa    <tibble [624 x 5]>           NA
#> 4 Americas  <tibble [300 x 5]>           NA
#> 5 Oceania   <tibble [24 x 5]>            NA

The code will not extract the lifeExp column for each data frame. Applying mutate to a data collumn will result in such errors since the data column in gap_nest is a list of data frames. The way to access data in list columns is via map().

We will try doing the mean lifeExp of a single continent and then we will expand it to the rest.

.x will be an individual nested data frame of gap_nest, say the first one, “Asia.” The following code defines it.

# the first entry of the nested column "data" in gap_nest, which is Asia
.x <- gap_nest %>% 
  pluck("data", 1)

.x
#> # A tibble: 396 x 5
#>    country      year      pop lifeExp gdpPercap
#>    <chr>       <dbl>    <dbl>   <dbl>     <dbl>
#>  1 Afghanistan  1952  8425333    28.8      779.
#>  2 Afghanistan  1957  9240934    30.3      821.
#>  3 Afghanistan  1962 10267083    32.0      853.
#>  4 Afghanistan  1967 11537966    34.0      836.
#>  5 Afghanistan  1972 13079460    36.1      740.
#>  6 Afghanistan  1977 14880372    38.4      786.
#>  7 Afghanistan  1982 12881816    39.9      978.
#>  8 Afghanistan  1987 13867957    40.8      852.
#>  9 Afghanistan  1992 16317921    41.7      649.
#> 10 Afghanistan  1997 22227415    41.8      635.
#> # ... with 386 more rows

To calculate the mean life expectancy then will be.

mean(.x$lifeExp)
#> [1] 60.0649

Now we need to copy this into the tilde-dot anonymous function for the rest. We will show this as a double column.

gap_nest %>% 
  mutate(avg_lifeExp = map_dbl(data, ~ mean(.x$lifeExp)))
#> # A tibble: 5 x 3
#> # Groups:   continent [5]
#>   continent data               avg_lifeExp
#>   <chr>     <list>                   <dbl>
#> 1 Asia      <tibble [396 x 5]>        60.1
#> 2 Europe    <tibble [360 x 5]>        71.9
#> 3 Africa    <tibble [624 x 5]>        48.9
#> 4 Americas  <tibble [300 x 5]>        64.7
#> 5 Oceania   <tibble [24 x 5]>         74.3

Without the nested columns…

gap_nest %>% 
  mutate(avg_lifeExp = map_dbl(data, ~ mean(.x$lifeExp))) %>% 
  select(continent, avg_lifeExp)
#> # A tibble: 5 x 2
#> # Groups:   continent [5]
#>   continent avg_lifeExp
#>   <chr>           <dbl>
#> 1 Asia             60.1
#> 2 Europe           71.9
#> 3 Africa           48.9
#> 4 Americas         64.7
#> 5 Oceania          74.3

Using summarise() will remove the nested column…

gap_nest %>% 
  summarise(avg_lifeExp2 = map_dbl(data, ~ mean(.x$lifeExp)))
#> # A tibble: 5 x 2
#>   continent avg_lifeExp2
#>   <chr>            <dbl>
#> 1 Africa            48.9
#> 2 Americas          64.7
#> 3 Asia              60.1
#> 4 Europe            71.9
#> 5 Oceania           74.3

How to fit a model separately for each continent.

Fitting models using nested columns in data frames allows to run such models all within a single data frame and then extracting values from the model as we need to test, evaluate, compare, and predict.

Building a simple linear model we will see how lifeExp can be modeled by population and gdpPercap.

First, we will fit a linear model for each continent and save its result as a nested column.

Starting from a basic unit of one continent as .x, again I will choose “Asia”, the linear model function will be..

lm(lifeExp ~ pop + gdpPercap + year, data = .x)

or simply lm(lifeExp ~ ., data = .x)

Testing.

lm(lifeExp ~ ., data = .x) %>% broom::tidy()
#> # A tibble: 36 x 5
#>    term                   estimate std.error statistic   p.value
#>    <chr>                     <dbl>     <dbl>     <dbl>     <dbl>
#>  1 (Intercept)              -842.      20.1     -42.0  1.08e-140
#>  2 countryBahrain             28.2      1.32     21.3  7.91e- 66
#>  3 countryBangladesh          11.9      1.29      9.22 2.52e- 18
#>  4 countryCambodia            10.5      1.27      8.23 3.35e- 15
#>  5 countryChina               18.2      2.85      6.40 4.96e- 10
#>  6 countryHong Kong China     36.1      1.31     27.5  2.13e- 90
#>  7 countryIndia               11.3      2.25      5.01 8.52e-  7
#>  8 countryIndonesia           16.0      1.32     12.1  1.48e- 28
#>  9 countryIran                21.0      1.28     16.4  1.82e- 45
#> 10 countryIraq                19.1      1.28     14.9  1.36e- 39
#> # ... with 26 more rows

It does. Now let’s fit this into our nested df.

gap_lm <- gap_nest %>% 
  mutate(lm_lifeExp = map(data, ~ (lm(lifeExp ~ pop + gdpPercap + year, data = .x))))

gap_lm
#> # A tibble: 5 x 3
#> # Groups:   continent [5]
#>   continent data               lm_lifeExp
#>   <chr>     <list>             <list>    
#> 1 Asia      <tibble [396 x 5]> <lm>      
#> 2 Europe    <tibble [360 x 5]> <lm>      
#> 3 Africa    <tibble [624 x 5]> <lm>      
#> 4 Americas  <tibble [300 x 5]> <lm>      
#> 5 Oceania   <tibble [24 x 5]>  <lm>

To find the values of the model for Ocenia we use then…

gap_lm %>% 
  pluck("lm_lifeExp", 5)
#> 
#> Call:
#> lm(formula = lifeExp ~ pop + gdpPercap + year, data = .x)
#> 
#> Coefficients:
#> (Intercept)          pop    gdpPercap         year  
#>  -2.097e+02    8.365e-09    2.027e-04    1.415e-01

Imagine we want to predict now the response of the data stored in the data column using the corresponding linear model.

gap_pred <- gap_lm %>%
  mutate(pred = map2(lm_lifeExp, data, function(.lm, .data) predict(.lm, .data)))

gap_pred
#> # A tibble: 5 x 4
#> # Groups:   continent [5]
#>   continent data               lm_lifeExp pred       
#>   <chr>     <list>             <list>     <list>     
#> 1 Asia      <tibble [396 x 5]> <lm>       <dbl [396]>
#> 2 Europe    <tibble [360 x 5]> <lm>       <dbl [360]>
#> 3 Africa    <tibble [624 x 5]> <lm>       <dbl [624]>
#> 4 Americas  <tibble [300 x 5]> <lm>       <dbl [300]>
#> 5 Oceania   <tibble [24 x 5]>  <lm>       <dbl [24]>

We can then calculate the correlation between the predicted response and the actual lifeExp value to see if our model is good fit. We can use map2_dbl() since the correlation is one single value and it can be shown in the column.

gap_cor <- gap_pred %>% 
  mutate(cor = map2_dbl(pred, data, function(.pred, .data) cor(.pred, .data$lifeExp)))

gap_cor
#> # A tibble: 5 x 5
#> # Groups:   continent [5]
#>   continent data               lm_lifeExp pred          cor
#>   <chr>     <list>             <list>     <list>      <dbl>
#> 1 Asia      <tibble [396 x 5]> <lm>       <dbl [396]> 0.723
#> 2 Europe    <tibble [360 x 5]> <lm>       <dbl [360]> 0.834
#> 3 Africa    <tibble [624 x 5]> <lm>       <dbl [624]> 0.645
#> 4 Americas  <tibble [300 x 5]> <lm>       <dbl [300]> 0.779
#> 5 Oceania   <tibble [24 x 5]>  <lm>       <dbl [24]>  0.987

Advanced example.


sessionInfo()
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18362)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=Spanish_Spain.1252  LC_CTYPE=Spanish_Spain.1252   
#> [3] LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C                  
#> [5] LC_TIME=Spanish_Spain.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] forcats_0.5.0   stringr_1.4.0   dplyr_0.8.5     purrr_0.3.3    
#> [5] readr_1.3.1     tidyr_1.0.2     tibble_3.0.0    tidyverse_1.3.0
#> [9] ggplot2_3.3.0  
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.0.0 xfun_0.12        haven_2.2.0      lattice_0.20-40 
#>  [5] colorspace_1.4-1 vctrs_0.2.4      generics_0.0.2   htmltools_0.4.0 
#>  [9] yaml_2.2.1       utf8_1.1.4       rlang_0.4.5      later_1.0.0     
#> [13] pillar_1.4.3     glue_1.4.0       withr_2.1.2      DBI_1.1.0       
#> [17] dbplyr_1.4.2     readxl_1.3.1     modelr_0.1.6     lifecycle_0.2.0 
#> [21] cellranger_1.1.0 munsell_0.5.0    gtable_0.3.0     workflowr_1.6.2 
#> [25] rvest_0.3.5      evaluate_0.14    labeling_0.3     knitr_1.28      
#> [29] httpuv_1.5.2     fansi_0.4.0      broom_0.5.5      Rcpp_1.0.4.6    
#> [33] promises_1.1.0   scales_1.1.0     backports_1.1.6  jsonlite_1.6.1  
#> [37] farver_2.0.3     fs_1.4.1         hms_0.5.3        digest_0.6.25   
#> [41] stringi_1.4.6    grid_3.6.1       rprojroot_1.3-2  here_0.1        
#> [45] cli_2.0.2        tools_3.6.1      magrittr_1.5     crayon_1.3.4    
#> [49] whisker_0.4      pkgconfig_2.0.3  ellipsis_0.3.0   xml2_1.3.1      
#> [53] reprex_0.3.0     lubridate_1.7.8  rstudioapi_0.11  assertthat_0.2.1
#> [57] rmarkdown_2.1    httr_1.4.1       R6_2.4.1         nlme_3.1-144    
#> [61] git2r_0.26.1     compiler_3.6.1