map() Gapminder data example

Last updated: 2020-05-03

Checks: 7 0

Knit directory: 033_purrr_learning/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200501)

The command set.seed(20200501) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 9848eb4

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 9848eb4. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rproj.user/

Untracked files:
    Untracked:  data/gap_copy.rds
    Untracked:  data/gap_mod.rds
    Untracked:  data/gapminder_raw.csv

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/03_the-map-gapminder-example.Rmd) and HTML (docs/03_the-map-gapminder-example.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	9848eb4	ogorodriguez	2020-05-03	Section 4 examples
html	147739b	ogorodriguez	2020-05-02	Build site.
Rmd	68350cf	ogorodriguez	2020-05-02	Gampminder Examples from simple and complex

The Gapminder data example

In order to dive deeply into the functionalities of the purrr package, the author used the gapminder dataset that collects macroeconomic information about all countries.

The idea is to start from a simple example into another that is more complex.

The workflow proposed is to load the data online as saving it with meaningful name so it indicates it is the raw data. I will used teh suffix _raw, as opposed to the one used in the reference website. Then a copy of the data set will be defined so that any modifications or data munging needed will be done to that copy.

That raw data file will be saved into the /data folder

# Download the data directly to the internet
write_csv(read_csv("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder-FiveYearData.csv"), here::here("data", "gapminder_raw.csv"))

# Let's create the working copy
gap_copy <- read_csv(here::here("data", "gapminder_raw.csv"))

Now that we have our working copy of the gapminder data set. We can see some description of the file.

gap_copy %>% dim()
#> [1] 1704    6

The file has 1704 rows and 6 columns.

Let’s take a glimpse of the data set

gap_copy %>% 
  glimpse()
#> Rows: 1,704
#> Columns: 6
#> $ country   <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan...
#> $ year      <dbl> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 199...
#> $ pop       <dbl> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372,...
#> $ continent <chr> "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "...
#> $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 4...
#> $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.113...

We can see some column types of the likes of character (which I think should be factors, and some numeric)

The skimr package skim() function will helps us see some more distributions on the character variables.

gap_copy %>% 
  skimr::skim()

Data summary
Name	Piped data
Number of rows	1704
Number of columns	6
_______________________
Column type frequency:
character	2
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
country	0	1	4	24	0	142	0
continent	0	1	4	8	0	5	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	1	1979.50	17.27	1952.00	1965.75	1979.50	1993.25	2007.0	▇▅▅▅▇
pop	1	29601212.33	106157896.75	60011.00	2793664.00	7023595.50	19585221.75	1318683096.0	▇▁▁▁▁
lifeExp	1	59.47	12.92	23.60	48.20	60.71	70.85	82.6	▁▆▇▇▇
gdpPercap	1	7215.33	9857.45	241.17	1202.06	3531.85	9325.46	113523.1	▇▁▁▁▁

The gapminder data set is widely famous and it is maintained very carefully. There are no NAs. It is a tidy data since every column correspond to a variable and every row correspond to an observation.

gap_copy %>% 
  head(10)
#> # A tibble: 10 x 6
#>    country      year      pop continent lifeExp gdpPercap
#>    <chr>       <dbl>    <dbl> <chr>       <dbl>     <dbl>
#>  1 Afghanistan  1952  8425333 Asia         28.8      779.
#>  2 Afghanistan  1957  9240934 Asia         30.3      821.
#>  3 Afghanistan  1962 10267083 Asia         32.0      853.
#>  4 Afghanistan  1967 11537966 Asia         34.0      836.
#>  5 Afghanistan  1972 13079460 Asia         36.1      740.
#>  6 Afghanistan  1977 14880372 Asia         38.4      786.
#>  7 Afghanistan  1982 12881816 Asia         39.9      978.
#>  8 Afghanistan  1987 13867957 Asia         40.8      852.
#>  9 Afghanistan  1992 16317921 Asia         41.7      649.
#> 10 Afghanistan  1997 22227415 Asia         41.8      635.

In this case, when using map() the functon will iterate over every column.

Identifying the types of each column

One simple example will be to extract the types or class or each column in single object (vector or df for example.)

# as a data frame
gap_copy %>% 
  map_df(~ tibble(class = class(.x)))
#> # A tibble: 6 x 1
#>   class    
#>   <chr>    
#> 1 character
#> 2 numeric  
#> 3 numeric  
#> 4 character
#> 5 numeric  
#> 6 numeric

Some columns are type character. The idea is that they are type factor. We can do a conversion using modify() perhaps.

# Converting only the column types that are character into factors
gap_mod <- modify_if(.x = gap_copy,
          .p = function(x) is.character(x),
          .f = ~ (as_factor(.)))

gap_mod %>% 
  glimpse()
#> Rows: 1,704
#> Columns: 6
#> $ country   <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afgha...
#> $ year      <dbl> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 199...
#> $ pop       <dbl> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372,...
#> $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
#> $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 4...
#> $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.113...

Now let’s extract again its col types.

gap_copy %>% 
  map_chr(class)
#>     country        year         pop   continent     lifeExp   gdpPercap 
#> "character"   "numeric"   "numeric" "character"   "numeric"   "numeric"

Getting the distinct values of each columns

To get the distinct values, we can pass the map_dbl() function since it is a count and its result has to be numeric.

gap_mod %>% 
  map_dbl(n_distinct)
#>   country      year       pop continent   lifeExp gdpPercap 
#>       142        12      1704         5      1626      1704

These results are vectors. Ideally, it will be visually easier to interpret and even inviting to further manage if the previous information were presented as a data frame. I tried to doing that in the class() example but was unable to.

This complicates it a bit. It seems that for us to have a data frame as a result of a map() operation (a data frame with meaningful columns and titles) we will need to pass an anonymous function to apply to each column.

For example the following function calculates the distinct entries and the type of the current column.

gap_mod %>% 
  map_df(~ tibble(n_distinct = n_distinct(.x),
                  class = class(.x)))
#> # A tibble: 6 x 2
#>   n_distinct class  
#>        <int> <chr>  
#> 1        142 factor 
#> 2         12 numeric
#> 3       1704 numeric
#> 4          5 factor 
#> 5       1626 numeric
#> 6       1704 numeric

If we want to add the name of each column as another column in the previous data frame. I tried doing that by using the colnames() or names() functions but I got an undesired result. To solve this, the.id argument of map_df() has to be used. That .id argument attaches the name of the element being iterated as a column in the output.

gap_mod %>% 
  map_df(~ tibble(n_distinct = n_distinct(.x),
                  class = class(.x)),
         .id = "variable")
#> # A tibble: 6 x 3
#>   variable  n_distinct class  
#>   <chr>          <int> <chr>  
#> 1 country          142 factor 
#> 2 year              12 numeric
#> 3 pop             1704 numeric
#> 4 continent          5 factor 
#> 5 lifeExp         1626 numeric
#> 6 gdpPercap       1704 numeric

Working from the simple to the complex

One helper to understanding how map works is trying to figure out what type of output we want with one part of the data. Then if we are able to obtain the desired result, we can then insert that formula into our map() functiont to make it work over the whole data frame.

For example. I am going to extract the first column of the gapmnder data frame. I am going to name .x for convenience later in copy pasting it into my map_df(). I will be using the gap_copy data set.

.x <- gap_copy %>% pluck(1)

Let’s preview it

.x %>% head()
#> [1] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan"
#> [6] "Afghanistan"

Now, I want to create a data frame that shows only the distinct values in this column together with its class.

tibble(n_distinct = n_distinct(.x),
       class = class(.x))
#> # A tibble: 1 x 2
#>   n_distinct class    
#>        <int> <chr>    
#> 1        142 character

The result is the desired one. The idea is now to plug the previous formula into a map_df() to iterate over the all of the data frame gap_copy

gap_copy %>% 
  map_df(~ tibble(n_distinct = n_distinct(.x),
       class = class(.x)),
       .id = "variable")
#> # A tibble: 6 x 3
#>   variable  n_distinct class    
#>   <chr>          <int> <chr>    
#> 1 country          142 character
#> 2 year              12 numeric  
#> 3 pop             1704 numeric  
#> 4 continent          5 character
#> 5 lifeExp         1626 numeric  
#> 6 gdpPercap       1704 numeric

Now we will save the new datasets in our system.

write_rds(gap_copy, here::here("data", "gap_copy.rds"))
write_rds(gap_mod, here::here("data", "gap_mod.rds"))

sessionInfo()
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18362)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=Spanish_Spain.1252  LC_CTYPE=Spanish_Spain.1252   
#> [3] LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C                  
#> [5] LC_TIME=Spanish_Spain.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] forcats_0.5.0   stringr_1.4.0   dplyr_0.8.5     purrr_0.3.3    
#> [5] readr_1.3.1     tidyr_1.0.2     tibble_3.0.0    tidyverse_1.3.0
#> [9] ggplot2_3.3.0  
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.4.6     lubridate_1.7.8  here_0.1         lattice_0.20-40 
#>  [5] assertthat_0.2.1 rprojroot_1.3-2  digest_0.6.25    utf8_1.1.4      
#>  [9] R6_2.4.1         cellranger_1.1.0 repr_1.1.0       backports_1.1.6 
#> [13] reprex_0.3.0     evaluate_0.14    highr_0.8        httr_1.4.1      
#> [17] pillar_1.4.3     rlang_0.4.5      curl_4.3         readxl_1.3.1    
#> [21] rstudioapi_0.11  whisker_0.4      rmarkdown_2.1    munsell_0.5.0   
#> [25] broom_0.5.5      compiler_3.6.1   httpuv_1.5.2     modelr_0.1.6    
#> [29] xfun_0.12        base64enc_0.1-3  pkgconfig_2.0.3  htmltools_0.4.0 
#> [33] tidyselect_1.0.0 workflowr_1.6.2  fansi_0.4.0      crayon_1.3.4    
#> [37] dbplyr_1.4.2     withr_2.1.2      later_1.0.0      grid_3.6.1      
#> [41] nlme_3.1-144     jsonlite_1.6.1   gtable_0.3.0     lifecycle_0.2.0 
#> [45] DBI_1.1.0        git2r_0.26.1     magrittr_1.5     scales_1.1.0    
#> [49] cli_2.0.2        stringi_1.4.6    fs_1.4.1         promises_1.1.0  
#> [53] skimr_2.1        xml2_1.3.1       ellipsis_0.3.0   generics_0.0.2  
#> [57] vctrs_0.2.4      tools_3.6.1      glue_1.4.0       hms_0.5.3       
#> [61] yaml_2.2.1       colorspace_1.4-1 rvest_0.3.5      knitr_1.28      
#> [65] haven_2.2.0

map() Gapminder data example

ogorodriguez

2020-05-01

The Gapminder data example

Identifying the types of each column

Getting the distinct values of each columns

Working from the simple to the complex