Last updated: 2020-05-03
Checks: 7 0
Knit directory: 033_purrr_learning/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200501)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 9848eb4. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .Rproj.user/
Untracked files:
Untracked: data/gap_copy.rds
Untracked: data/gap_mod.rds
Untracked: data/gapminder_raw.csv
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made to the R Markdown (analysis/03_the-map-gapminder-example.Rmd
) and HTML (docs/03_the-map-gapminder-example.html
) files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 9848eb4 | ogorodriguez | 2020-05-03 | Section 4 examples |
html | 147739b | ogorodriguez | 2020-05-02 | Build site. |
Rmd | 68350cf | ogorodriguez | 2020-05-02 | Gampminder Examples from simple and complex |
In order to dive deeply into the functionalities of the purrr package, the author used the gapminder dataset that collects macroeconomic information about all countries.
The idea is to start from a simple example into another that is more complex.
The workflow proposed is to load the data online as saving it with meaningful name so it indicates it is the raw data. I will used teh suffix _raw
, as opposed to the one used in the reference website. Then a copy of the data set will be defined so that any modifications or data munging needed will be done to that copy.
That raw data file will be saved into the /data folder
# Download the data directly to the internet
write_csv(read_csv("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder-FiveYearData.csv"), here::here("data", "gapminder_raw.csv"))
# Let's create the working copy
gap_copy <- read_csv(here::here("data", "gapminder_raw.csv"))
Now that we have our working copy of the gapminder data set. We can see some description of the file.
gap_copy %>% dim()
#> [1] 1704 6
The file has 1704 rows and 6 columns.
Let’s take a glimpse of the data set
gap_copy %>%
glimpse()
#> Rows: 1,704
#> Columns: 6
#> $ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan...
#> $ year <dbl> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 199...
#> $ pop <dbl> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372,...
#> $ continent <chr> "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "...
#> $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 4...
#> $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.113...
We can see some column types of the likes of character (which I think should be factors, and some numeric)
The skimr package skim()
function will helps us see some more distributions on the character variables.
gap_copy %>%
skimr::skim()
Name | Piped data |
Number of rows | 1704 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
character | 2 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
country | 0 | 1 | 4 | 24 | 0 | 142 | 0 |
continent | 0 | 1 | 4 | 8 | 0 | 5 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
year | 0 | 1 | 1979.50 | 17.27 | 1952.00 | 1965.75 | 1979.50 | 1993.25 | 2007.0 | ▇▅▅▅▇ |
pop | 0 | 1 | 29601212.33 | 106157896.75 | 60011.00 | 2793664.00 | 7023595.50 | 19585221.75 | 1318683096.0 | ▇▁▁▁▁ |
lifeExp | 0 | 1 | 59.47 | 12.92 | 23.60 | 48.20 | 60.71 | 70.85 | 82.6 | ▁▆▇▇▇ |
gdpPercap | 0 | 1 | 7215.33 | 9857.45 | 241.17 | 1202.06 | 3531.85 | 9325.46 | 113523.1 | ▇▁▁▁▁ |
The gapminder data set is widely famous and it is maintained very carefully. There are no NAs. It is a tidy data since every column correspond to a variable and every row correspond to an observation.
gap_copy %>%
head(10)
#> # A tibble: 10 x 6
#> country year pop continent lifeExp gdpPercap
#> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 Afghanistan 1952 8425333 Asia 28.8 779.
#> 2 Afghanistan 1957 9240934 Asia 30.3 821.
#> 3 Afghanistan 1962 10267083 Asia 32.0 853.
#> 4 Afghanistan 1967 11537966 Asia 34.0 836.
#> 5 Afghanistan 1972 13079460 Asia 36.1 740.
#> 6 Afghanistan 1977 14880372 Asia 38.4 786.
#> 7 Afghanistan 1982 12881816 Asia 39.9 978.
#> 8 Afghanistan 1987 13867957 Asia 40.8 852.
#> 9 Afghanistan 1992 16317921 Asia 41.7 649.
#> 10 Afghanistan 1997 22227415 Asia 41.8 635.
In this case, when using map()
the functon will iterate over every column.
One simple example will be to extract the types or class or each column in single object (vector or df for example.)
# as a data frame
gap_copy %>%
map_df(~ tibble(class = class(.x)))
#> # A tibble: 6 x 1
#> class
#> <chr>
#> 1 character
#> 2 numeric
#> 3 numeric
#> 4 character
#> 5 numeric
#> 6 numeric
Some columns are type character. The idea is that they are type factor. We can do a conversion using modify()
perhaps.
# Converting only the column types that are character into factors
gap_mod <- modify_if(.x = gap_copy,
.p = function(x) is.character(x),
.f = ~ (as_factor(.)))
gap_mod %>%
glimpse()
#> Rows: 1,704
#> Columns: 6
#> $ country <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afgha...
#> $ year <dbl> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 199...
#> $ pop <dbl> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372,...
#> $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
#> $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 4...
#> $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.113...
Now let’s extract again its col types.
gap_copy %>%
map_chr(class)
#> country year pop continent lifeExp gdpPercap
#> "character" "numeric" "numeric" "character" "numeric" "numeric"
To get the distinct values, we can pass the map_dbl()
function since it is a count and its result has to be numeric.
gap_mod %>%
map_dbl(n_distinct)
#> country year pop continent lifeExp gdpPercap
#> 142 12 1704 5 1626 1704
These results are vectors. Ideally, it will be visually easier to interpret and even inviting to further manage if the previous information were presented as a data frame. I tried to doing that in the class()
example but was unable to.
This complicates it a bit. It seems that for us to have a data frame as a result of a map()
operation (a data frame with meaningful columns and titles) we will need to pass an anonymous function to apply to each column.
For example the following function calculates the distinct entries and the type of the current column.
gap_mod %>%
map_df(~ tibble(n_distinct = n_distinct(.x),
class = class(.x)))
#> # A tibble: 6 x 2
#> n_distinct class
#> <int> <chr>
#> 1 142 factor
#> 2 12 numeric
#> 3 1704 numeric
#> 4 5 factor
#> 5 1626 numeric
#> 6 1704 numeric
If we want to add the name of each column as another column in the previous data frame. I tried doing that by using the colnames()
or names()
functions but I got an undesired result. To solve this, the.id
argument of map_df()
has to be used. That .id
argument attaches the name of the element being iterated as a column in the output.
gap_mod %>%
map_df(~ tibble(n_distinct = n_distinct(.x),
class = class(.x)),
.id = "variable")
#> # A tibble: 6 x 3
#> variable n_distinct class
#> <chr> <int> <chr>
#> 1 country 142 factor
#> 2 year 12 numeric
#> 3 pop 1704 numeric
#> 4 continent 5 factor
#> 5 lifeExp 1626 numeric
#> 6 gdpPercap 1704 numeric
One helper to understanding how map works is trying to figure out what type of output we want with one part of the data. Then if we are able to obtain the desired result, we can then insert that formula into our map() functiont to make it work over the whole data frame.
For example. I am going to extract the first column of the gapmnder data frame. I am going to name .x for convenience later in copy pasting it into my map_df(). I will be using the gap_copy data set.
.x <- gap_copy %>% pluck(1)
Let’s preview it
.x %>% head()
#> [1] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan"
#> [6] "Afghanistan"
Now, I want to create a data frame that shows only the distinct values in this column together with its class.
tibble(n_distinct = n_distinct(.x),
class = class(.x))
#> # A tibble: 1 x 2
#> n_distinct class
#> <int> <chr>
#> 1 142 character
The result is the desired one. The idea is now to plug the previous formula into a map_df() to iterate over the all of the data frame gap_copy
gap_copy %>%
map_df(~ tibble(n_distinct = n_distinct(.x),
class = class(.x)),
.id = "variable")
#> # A tibble: 6 x 3
#> variable n_distinct class
#> <chr> <int> <chr>
#> 1 country 142 character
#> 2 year 12 numeric
#> 3 pop 1704 numeric
#> 4 continent 5 character
#> 5 lifeExp 1626 numeric
#> 6 gdpPercap 1704 numeric
Now we will save the new datasets in our system.
write_rds(gap_copy, here::here("data", "gap_copy.rds"))
write_rds(gap_mod, here::here("data", "gap_mod.rds"))
sessionInfo()
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18362)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252
#> [3] LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C
#> [5] LC_TIME=Spanish_Spain.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] forcats_0.5.0 stringr_1.4.0 dplyr_0.8.5 purrr_0.3.3
#> [5] readr_1.3.1 tidyr_1.0.2 tibble_3.0.0 tidyverse_1.3.0
#> [9] ggplot2_3.3.0
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.4.6 lubridate_1.7.8 here_0.1 lattice_0.20-40
#> [5] assertthat_0.2.1 rprojroot_1.3-2 digest_0.6.25 utf8_1.1.4
#> [9] R6_2.4.1 cellranger_1.1.0 repr_1.1.0 backports_1.1.6
#> [13] reprex_0.3.0 evaluate_0.14 highr_0.8 httr_1.4.1
#> [17] pillar_1.4.3 rlang_0.4.5 curl_4.3 readxl_1.3.1
#> [21] rstudioapi_0.11 whisker_0.4 rmarkdown_2.1 munsell_0.5.0
#> [25] broom_0.5.5 compiler_3.6.1 httpuv_1.5.2 modelr_0.1.6
#> [29] xfun_0.12 base64enc_0.1-3 pkgconfig_2.0.3 htmltools_0.4.0
#> [33] tidyselect_1.0.0 workflowr_1.6.2 fansi_0.4.0 crayon_1.3.4
#> [37] dbplyr_1.4.2 withr_2.1.2 later_1.0.0 grid_3.6.1
#> [41] nlme_3.1-144 jsonlite_1.6.1 gtable_0.3.0 lifecycle_0.2.0
#> [45] DBI_1.1.0 git2r_0.26.1 magrittr_1.5 scales_1.1.0
#> [49] cli_2.0.2 stringi_1.4.6 fs_1.4.1 promises_1.1.0
#> [53] skimr_2.1 xml2_1.3.1 ellipsis_0.3.0 generics_0.0.2
#> [57] vctrs_0.2.4 tools_3.6.1 glue_1.4.0 hms_0.5.3
#> [61] yaml_2.2.1 colorspace_1.4-1 rvest_0.3.5 knitr_1.28
#> [65] haven_2.2.0