8 Advanced Operations

In this chapter, we will cover a few more advanced, yet incredibly useful data tidying operations like grouping, joining, binding, and pivoting. Along the way, we will also make extensive use of dplyr functions learned in the previous chapter.

8.1 Grouping

Often, you will need to apply dplyr’s various operations like mutate(), summarize(), or slicing function not across the entire dataset but in groups. This is an important technique across data science, whether it’s data cleaning to exploration to visualization to modeling.

By default, data frames are not grouped when created or imported. You can create a grouping structure with the group_by() function. The basic syntax is df %>% group_by(col1, col2, ...) where col1, col2, … are variables whose values are used to determine groups. You can group by just 1 variable, 2 variables, or as many variables as needed. Rows with the same values in the chosen columns will be grouped together.

After grouping, operations that normally run across all rows now run across each group. Here’s a few simple examples using the familiar penguins dataset to start:

# import tidyverse, import magrittr (for extra %T>% pipe),
# tweak some readr/ggplot options (optional), and load dataset
library(tidyverse)
library(magrittr)
options(readr.show_col_types = FALSE)
source("https://bwu62.github.io/stat240-revamp/ggplot_theme_options.R")
penguins <- read_csv("https://bwu62.github.io/stat240-revamp/data/penguins.csv")
# group by species and get mean/median/sd body mass + sample size of each group
penguins %>%
  group_by(species) %>%
  summarize(
    mean_mass   = mean(body_mass_g),
    median_mass = median(body_mass_g),
    sd_mass     = sd(body_mass_g),
    n           = n()
  )
# A tibble: 3 × 5
  species   mean_mass median_mass sd_mass     n
  <chr>         <dbl>       <dbl>   <dbl> <int>
1 Adelie        3706.        3700    459.   146
2 Chinstrap     3733.        3700    384.    68
3 Gentoo        5092.        5050    501.   119
# we can also group by multiple, e.g. group by spcies + sex
penguins %>%
  group_by(species, sex) %>%
  summarize(
    mean_mass   = mean(body_mass_g),
    median_mass = median(body_mass_g),
    sd_mass     = sd(body_mass_g),
    n           = n()
  )
# A tibble: 6 × 6
# Groups:   species [3]
  species   sex    mean_mass median_mass sd_mass     n
  <chr>     <chr>      <dbl>       <dbl>   <dbl> <int>
1 Adelie    female     3369.        3400    269.    73
2 Adelie    male       4043.        4000    347.    73
3 Chinstrap female     3527.        3550    285.    34
4 Chinstrap male       3939.        3950    362.    34
5 Gentoo    female     4680.        4700    282.    58
6 Gentoo    male       5485.        5500    313.    61
# we can of course also mutate within groups
# e.g. convert bill length in mm to number of SDs
#      away from species group mean
# to better show the result, I'm forcing it to print all rows in order
# but hiding the output in a collapsible box for style
penguins %>%
  select(species, sex, bill_length_mm) %>%
  group_by(species) %>%
  mutate(
    n = n(),
    bill_length_std = (bill_length_mm - mean(bill_length_mm)) / sd(bill_length_mm)
  ) %>%
  arrange(species, bill_length_std) %>%
  print(n = Inf)
# A tibble: 333 × 5
# Groups:   species [3]
    species   sex    bill_length_mm     n bill_length_std
    <chr>     <chr>           <dbl> <int>           <dbl>
  1 Adelie    female           32.1   146        -2.53   
  2 Adelie    female           33.1   146        -2.15   
  3 Adelie    female           33.5   146        -2.00   
  4 Adelie    female           34     146        -1.81   
  5 Adelie    female           34.4   146        -1.66   
  6 Adelie    female           34.5   146        -1.62   
  7 Adelie    male             34.6   146        -1.59   
  8 Adelie    female           34.6   146        -1.59   
  9 Adelie    female           35     146        -1.44   
 10 Adelie    female           35     146        -1.44   
 11 Adelie    male             35.1   146        -1.40   
 12 Adelie    female           35.2   146        -1.36   
 13 Adelie    female           35.3   146        -1.32   
 14 Adelie    female           35.5   146        -1.25   
 15 Adelie    female           35.5   146        -1.25   
 16 Adelie    female           35.6   146        -1.21   
 17 Adelie    female           35.7   146        -1.17   
 18 Adelie    female           35.7   146        -1.17   
 19 Adelie    female           35.7   146        -1.17   
 20 Adelie    female           35.9   146        -1.10   
 21 Adelie    female           35.9   146        -1.10   
 22 Adelie    female           36     146        -1.06   
 23 Adelie    female           36     146        -1.06   
 24 Adelie    female           36     146        -1.06   
 25 Adelie    female           36     146        -1.06   
 26 Adelie    female           36.2   146        -0.985  
 27 Adelie    female           36.2   146        -0.985  
 28 Adelie    female           36.2   146        -0.985  
 29 Adelie    male             36.3   146        -0.948  
 30 Adelie    female           36.4   146        -0.910  
 31 Adelie    female           36.4   146        -0.910  
 32 Adelie    female           36.5   146        -0.873  
 33 Adelie    female           36.5   146        -0.873  
 34 Adelie    female           36.6   146        -0.835  
 35 Adelie    female           36.6   146        -0.835  
 36 Adelie    female           36.7   146        -0.798  
 37 Adelie    female           36.7   146        -0.798  
 38 Adelie    female           36.8   146        -0.760  
 39 Adelie    female           36.9   146        -0.723  
 40 Adelie    female           37     146        -0.685  
 41 Adelie    female           37     146        -0.685  
 42 Adelie    male             37.2   146        -0.610  
 43 Adelie    male             37.2   146        -0.610  
 44 Adelie    female           37.3   146        -0.572  
 45 Adelie    male             37.3   146        -0.572  
 46 Adelie    female           37.3   146        -0.572  
 47 Adelie    male             37.5   146        -0.497  
 48 Adelie    female           37.6   146        -0.460  
 49 Adelie    male             37.6   146        -0.460  
 50 Adelie    female           37.6   146        -0.460  
 51 Adelie    male             37.7   146        -0.422  
 52 Adelie    female           37.7   146        -0.422  
 53 Adelie    male             37.7   146        -0.422  
 54 Adelie    female           37.8   146        -0.385  
 55 Adelie    male             37.8   146        -0.385  
 56 Adelie    male             37.8   146        -0.385  
 57 Adelie    female           37.9   146        -0.347  
 58 Adelie    female           37.9   146        -0.347  
 59 Adelie    female           38.1   146        -0.272  
 60 Adelie    female           38.1   146        -0.272  
 61 Adelie    female           38.1   146        -0.272  
 62 Adelie    female           38.1   146        -0.272  
 63 Adelie    male             38.2   146        -0.234  
 64 Adelie    male             38.2   146        -0.234  
 65 Adelie    male             38.3   146        -0.197  
 66 Adelie    female           38.5   146        -0.122  
 67 Adelie    male             38.6   146        -0.0841 
 68 Adelie    female           38.6   146        -0.0841 
 69 Adelie    female           38.6   146        -0.0841 
 70 Adelie    female           38.7   146        -0.0466 
 71 Adelie    male             38.8   146        -0.00900
 72 Adelie    male             38.8   146        -0.00900
 73 Adelie    female           38.8   146        -0.00900
 74 Adelie    female           38.9   146         0.0286 
 75 Adelie    female           38.9   146         0.0286 
 76 Adelie    female           39     146         0.0661 
 77 Adelie    female           39     146         0.0661 
 78 Adelie    male             39     146         0.0661 
 79 Adelie    male             39.1   146         0.104  
 80 Adelie    male             39.2   146         0.141  
 81 Adelie    male             39.2   146         0.141  
 82 Adelie    male             39.2   146         0.141  
 83 Adelie    male             39.3   146         0.179  
 84 Adelie    female           39.5   146         0.254  
 85 Adelie    female           39.5   146         0.254  
 86 Adelie    female           39.5   146         0.254  
 87 Adelie    male             39.6   146         0.291  
 88 Adelie    female           39.6   146         0.291  
 89 Adelie    female           39.6   146         0.291  
 90 Adelie    male             39.6   146         0.291  
 91 Adelie    female           39.6   146         0.291  
 92 Adelie    male             39.7   146         0.329  
 93 Adelie    male             39.7   146         0.329  
 94 Adelie    female           39.7   146         0.329  
 95 Adelie    male             39.7   146         0.329  
 96 Adelie    male             39.8   146         0.367  
 97 Adelie    male             40.1   146         0.479  
 98 Adelie    female           40.2   146         0.517  
 99 Adelie    male             40.2   146         0.517  
100 Adelie    female           40.2   146         0.517  
101 Adelie    female           40.3   146         0.554  
102 Adelie    male             40.3   146         0.554  
103 Adelie    female           40.5   146         0.629  
104 Adelie    male             40.5   146         0.629  
105 Adelie    male             40.6   146         0.667  
106 Adelie    male             40.6   146         0.667  
107 Adelie    male             40.6   146         0.667  
108 Adelie    male             40.6   146         0.667  
109 Adelie    male             40.7   146         0.705  
110 Adelie    male             40.8   146         0.742  
111 Adelie    male             40.8   146         0.742  
112 Adelie    male             40.9   146         0.780  
113 Adelie    female           40.9   146         0.780  
114 Adelie    male             41     146         0.817  
115 Adelie    female           41.1   146         0.855  
116 Adelie    male             41.1   146         0.855  
117 Adelie    male             41.1   146         0.855  
118 Adelie    male             41.1   146         0.855  
119 Adelie    male             41.1   146         0.855  
120 Adelie    male             41.1   146         0.855  
121 Adelie    male             41.1   146         0.855  
122 Adelie    male             41.3   146         0.930  
123 Adelie    male             41.3   146         0.930  
124 Adelie    male             41.4   146         0.967  
125 Adelie    male             41.4   146         0.967  
126 Adelie    male             41.5   146         1.01   
127 Adelie    male             41.5   146         1.01   
128 Adelie    male             41.6   146         1.04   
129 Adelie    male             41.8   146         1.12   
130 Adelie    male             42     146         1.19   
131 Adelie    male             42.1   146         1.23   
132 Adelie    female           42.2   146         1.27   
133 Adelie    male             42.2   146         1.27   
134 Adelie    male             42.3   146         1.31   
135 Adelie    male             42.5   146         1.38   
136 Adelie    male             42.7   146         1.46   
137 Adelie    male             42.8   146         1.49   
138 Adelie    male             42.9   146         1.53   
139 Adelie    male             43.1   146         1.61   
140 Adelie    male             43.2   146         1.64   
141 Adelie    male             43.2   146         1.64   
142 Adelie    male             44.1   146         1.98   
143 Adelie    male             44.1   146         1.98   
144 Adelie    male             45.6   146         2.54   
145 Adelie    male             45.8   146         2.62   
146 Adelie    male             46     146         2.70   
147 Chinstrap female           40.9    68        -2.38   
148 Chinstrap female           42.4    68        -1.93   
149 Chinstrap female           42.5    68        -1.90   
150 Chinstrap female           42.5    68        -1.90   
151 Chinstrap female           43.2    68        -1.69   
152 Chinstrap female           43.5    68        -1.60   
153 Chinstrap female           45.2    68        -1.09   
154 Chinstrap female           45.2    68        -1.09   
155 Chinstrap female           45.4    68        -1.03   
156 Chinstrap female           45.5    68        -0.998  
157 Chinstrap female           45.6    68        -0.968  
158 Chinstrap female           45.7    68        -0.938  
159 Chinstrap female           45.7    68        -0.938  
160 Chinstrap female           45.9    68        -0.879  
161 Chinstrap female           46      68        -0.849  
162 Chinstrap female           46.1    68        -0.819  
163 Chinstrap female           46.2    68        -0.789  
164 Chinstrap female           46.4    68        -0.729  
165 Chinstrap female           46.4    68        -0.729  
166 Chinstrap female           46.5    68        -0.699  
167 Chinstrap female           46.6    68        -0.669  
168 Chinstrap female           46.7    68        -0.639  
169 Chinstrap female           46.8    68        -0.609  
170 Chinstrap female           46.9    68        -0.579  
171 Chinstrap female           47      68        -0.549  
172 Chinstrap female           47.5    68        -0.399  
173 Chinstrap female           47.6    68        -0.369  
174 Chinstrap female           48.1    68        -0.220  
175 Chinstrap male             48.5    68        -0.100  
176 Chinstrap male             49      68         0.0498 
177 Chinstrap male             49      68         0.0498 
178 Chinstrap male             49.2    68         0.110  
179 Chinstrap male             49.3    68         0.140  
180 Chinstrap male             49.5    68         0.199  
181 Chinstrap male             49.6    68         0.229  
182 Chinstrap male             49.7    68         0.259  
183 Chinstrap female           49.8    68         0.289  
184 Chinstrap male             50      68         0.349  
185 Chinstrap female           50.1    68         0.379  
186 Chinstrap male             50.2    68         0.409  
187 Chinstrap female           50.2    68         0.409  
188 Chinstrap male             50.3    68         0.439  
189 Chinstrap male             50.5    68         0.499  
190 Chinstrap female           50.5    68         0.499  
191 Chinstrap male             50.6    68         0.529  
192 Chinstrap male             50.7    68         0.559  
193 Chinstrap male             50.8    68         0.589  
194 Chinstrap male             50.8    68         0.589  
195 Chinstrap male             50.9    68         0.619  
196 Chinstrap female           50.9    68         0.619  
197 Chinstrap male             51      68         0.649  
198 Chinstrap male             51.3    68         0.739  
199 Chinstrap male             51.3    68         0.739  
200 Chinstrap male             51.3    68         0.739  
201 Chinstrap male             51.4    68         0.768  
202 Chinstrap male             51.5    68         0.798  
203 Chinstrap male             51.7    68         0.858  
204 Chinstrap male             51.9    68         0.918  
205 Chinstrap male             52      68         0.948  
206 Chinstrap male             52      68         0.948  
207 Chinstrap male             52      68         0.948  
208 Chinstrap male             52.2    68         1.01   
209 Chinstrap male             52.7    68         1.16   
210 Chinstrap male             52.8    68         1.19   
211 Chinstrap male             53.5    68         1.40   
212 Chinstrap male             54.2    68         1.61   
213 Chinstrap male             55.8    68         2.09   
214 Chinstrap female           58      68         2.74   
215 Gentoo    female           40.9   119        -2.15   
216 Gentoo    female           41.7   119        -1.89   
217 Gentoo    female           42     119        -1.79   
218 Gentoo    female           42.6   119        -1.60   
219 Gentoo    female           42.7   119        -1.57   
220 Gentoo    female           42.8   119        -1.54   
221 Gentoo    female           42.9   119        -1.50   
222 Gentoo    female           43.2   119        -1.41   
223 Gentoo    female           43.3   119        -1.37   
224 Gentoo    female           43.3   119        -1.37   
225 Gentoo    female           43.4   119        -1.34   
226 Gentoo    female           43.5   119        -1.31   
227 Gentoo    female           43.5   119        -1.31   
228 Gentoo    female           43.6   119        -1.28   
229 Gentoo    female           43.8   119        -1.21   
230 Gentoo    female           44     119        -1.15   
231 Gentoo    male             44.4   119        -1.02   
232 Gentoo    female           44.5   119        -0.988  
233 Gentoo    female           44.9   119        -0.859  
234 Gentoo    female           44.9   119        -0.859  
235 Gentoo    male             45     119        -0.827  
236 Gentoo    female           45.1   119        -0.795  
237 Gentoo    female           45.1   119        -0.795  
238 Gentoo    female           45.1   119        -0.795  
239 Gentoo    male             45.2   119        -0.762  
240 Gentoo    female           45.2   119        -0.762  
241 Gentoo    male             45.2   119        -0.762  
242 Gentoo    female           45.2   119        -0.762  
243 Gentoo    female           45.3   119        -0.730  
244 Gentoo    female           45.3   119        -0.730  
245 Gentoo    female           45.4   119        -0.698  
246 Gentoo    female           45.5   119        -0.666  
247 Gentoo    female           45.5   119        -0.666  
248 Gentoo    male             45.5   119        -0.666  
249 Gentoo    female           45.5   119        -0.666  
250 Gentoo    female           45.7   119        -0.601  
251 Gentoo    female           45.8   119        -0.569  
252 Gentoo    female           45.8   119        -0.569  
253 Gentoo    female           46.1   119        -0.473  
254 Gentoo    male             46.1   119        -0.473  
255 Gentoo    female           46.2   119        -0.440  
256 Gentoo    male             46.2   119        -0.440  
257 Gentoo    female           46.2   119        -0.440  
258 Gentoo    male             46.3   119        -0.408  
259 Gentoo    male             46.4   119        -0.376  
260 Gentoo    female           46.4   119        -0.376  
261 Gentoo    female           46.5   119        -0.344  
262 Gentoo    female           46.5   119        -0.344  
263 Gentoo    female           46.5   119        -0.344  
264 Gentoo    female           46.5   119        -0.344  
265 Gentoo    female           46.6   119        -0.312  
266 Gentoo    male             46.7   119        -0.279  
267 Gentoo    male             46.8   119        -0.247  
268 Gentoo    male             46.8   119        -0.247  
269 Gentoo    female           46.8   119        -0.247  
270 Gentoo    female           46.9   119        -0.215  
271 Gentoo    female           47.2   119        -0.118  
272 Gentoo    female           47.2   119        -0.118  
273 Gentoo    male             47.3   119        -0.0863 
274 Gentoo    female           47.4   119        -0.0541 
275 Gentoo    female           47.5   119        -0.0219 
276 Gentoo    female           47.5   119        -0.0219 
277 Gentoo    female           47.5   119        -0.0219 
278 Gentoo    male             47.6   119         0.0103 
279 Gentoo    female           47.7   119         0.0425 
280 Gentoo    male             47.8   119         0.0747 
281 Gentoo    male             48.1   119         0.171  
282 Gentoo    female           48.2   119         0.203  
283 Gentoo    male             48.2   119         0.203  
284 Gentoo    male             48.4   119         0.268  
285 Gentoo    male             48.4   119         0.268  
286 Gentoo    female           48.4   119         0.268  
287 Gentoo    male             48.5   119         0.300  
288 Gentoo    female           48.5   119         0.300  
289 Gentoo    male             48.6   119         0.332  
290 Gentoo    female           48.7   119         0.364  
291 Gentoo    male             48.7   119         0.364  
292 Gentoo    male             48.7   119         0.364  
293 Gentoo    male             48.8   119         0.397  
294 Gentoo    male             49     119         0.461  
295 Gentoo    female           49.1   119         0.493  
296 Gentoo    female           49.1   119         0.493  
297 Gentoo    male             49.1   119         0.493  
298 Gentoo    male             49.2   119         0.525  
299 Gentoo    male             49.3   119         0.558  
300 Gentoo    male             49.4   119         0.590  
301 Gentoo    male             49.5   119         0.622  
302 Gentoo    male             49.5   119         0.622  
303 Gentoo    male             49.6   119         0.654  
304 Gentoo    male             49.6   119         0.654  
305 Gentoo    male             49.8   119         0.719  
306 Gentoo    male             49.8   119         0.719  
307 Gentoo    male             49.9   119         0.751  
308 Gentoo    male             50     119         0.783  
309 Gentoo    male             50     119         0.783  
310 Gentoo    male             50     119         0.783  
311 Gentoo    male             50     119         0.783  
312 Gentoo    male             50.1   119         0.815  
313 Gentoo    male             50.2   119         0.847  
314 Gentoo    male             50.4   119         0.912  
315 Gentoo    male             50.4   119         0.912  
316 Gentoo    male             50.5   119         0.944  
317 Gentoo    male             50.5   119         0.944  
318 Gentoo    female           50.5   119         0.944  
319 Gentoo    male             50.7   119         1.01   
320 Gentoo    male             50.8   119         1.04   
321 Gentoo    male             50.8   119         1.04   
322 Gentoo    male             51.1   119         1.14   
323 Gentoo    male             51.1   119         1.14   
324 Gentoo    male             51.3   119         1.20   
325 Gentoo    male             51.5   119         1.27   
326 Gentoo    male             52.1   119         1.46   
327 Gentoo    male             52.2   119         1.49   
328 Gentoo    male             52.5   119         1.59   
329 Gentoo    male             53.4   119         1.88   
330 Gentoo    male             54.3   119         2.17   
331 Gentoo    male             55.1   119         2.42   
332 Gentoo    male             55.9   119         2.68   
333 Gentoo    male             59.6   119         3.87   
# get the largest 3 penguins by bill depth from each species
penguins %>%
  group_by(species) %>%
  slice_max(bill_depth_mm, n = 3)
# A tibble: 9 × 8
# Groups:   species [3]
  species   island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex  
  <chr>     <chr>             <dbl>         <dbl>             <dbl>       <dbl> <chr>
1 Adelie    Torgers…           46            21.5               194        4200 male 
2 Adelie    Torgers…           38.6          21.2               191        3800 male 
3 Adelie    Dream              42.3          21.2               191        4150 male 
4 Chinstrap Dream              54.2          20.8               201        4300 male 
5 Chinstrap Dream              52            20.7               210        4800 male 
6 Chinstrap Dream              51.7          20.3               194        3775 male 
7 Gentoo    Biscoe             44.4          17.3               219        5250 male 
8 Gentoo    Biscoe             50.8          17.3               228        5600 male 
9 Gentoo    Biscoe             52.2          17.1               228        5400 male 
# ℹ 1 more variable: year <dbl>

8.1.1 Regrouping

Sometimes one group_by() may not be enough to get what you need; you may need to re-group_by() by something else to finish the job. For example, suppose you want to see what percent of each species came from different islands. This requires two uses of group_by():

# first group by species + island and summarize to get size of each group,
# then regroup by species and MUTATE (not summarize) totals for each species,
# then divide these to get proportion of each species from each island
penguins %>%
  group_by(species, island) %>%
  summarize(n = n()) %>%
  group_by(species) %>%
  mutate(
    species_total = sum(n),
    pct_of_species_from_island = n / species_total * 100
  )
# A tibble: 5 × 5
# Groups:   species [3]
  species   island        n species_total pct_of_species_from_island
  <chr>     <chr>     <int>         <int>                      <dbl>
1 Adelie    Biscoe       44           146                       30.1
2 Adelie    Dream        55           146                       37.7
3 Adelie    Torgersen    47           146                       32.2
4 Chinstrap Dream        68            68                      100  
5 Gentoo    Biscoe      119           119                      100  

This combination of df %>% group_by(...) %>% summarize(n = n()) is so common, we have this shortcut for it: df %>% count(...). We can demonstrate this in another example involving regrouping. Suppose we want to know what percent of each species were male/female:

# again, first group by species + sex and get size using the shortcut count(),
# then regroup by species and MUTATE totals for each species,
# then divide to get proportions of each species that were male/female
penguins %>%
  count(species, sex) %>%
  group_by(species) %>%
  mutate(
    species_total = sum(n),
    pct_of_species_each_sex = n / species_total * 100
  )
# A tibble: 6 × 5
# Groups:   species [3]
  species   sex        n species_total pct_of_species_each_sex
  <chr>     <chr>  <int>         <int>                   <dbl>
1 Adelie    female    73           146                    50  
2 Adelie    male      73           146                    50  
3 Chinstrap female    34            68                    50  
4 Chinstrap male      34            68                    50  
5 Gentoo    female    58           119                    48.7
6 Gentoo    male      61           119                    51.3

Groups are also useful for prepping data frames for plotting. For example, here’s a chunk that produces a bar plot showing how mean body mass changes by species and sex:

# get mean body mass by species + sex and plot
# the %T>% is a special pipe called a Tee pipe
# it's a shortcut for piping something into 2 different operations,
# useful for example when you want to print a data frame, then also plot it
# see https://magrittr.tidyverse.org/reference/tee.html for more
penguins %>% 
  group_by(species, sex) %>% 
  summarize(mean_mass = mean(body_mass_g)) %T>% print %>% 
  ggplot(aes(x = species, y = mean_mass, fill = sex)) +
  geom_col(position = "dodge2") +
  labs(x = "Species", y = "Mean body mass (g)",
       title = "Mean body mass of Palmer penguins by species + sex")
# A tibble: 6 × 3
# Groups:   species [3]
  species   sex    mean_mass
  <chr>     <chr>      <dbl>
1 Adelie    female     3369.
2 Adelie    male       4043.
3 Chinstrap female     3527.
4 Chinstrap male       3939.
5 Gentoo    female     4680.
6 Gentoo    male       5485.

8.1.2 Ungrouping

Many operations output grouped data frames. For example, look closely at the output of the previous chunks and you’ll see # Groups: species [3] in most of them. This means any further operations you run will continue to execute in a grouped way.

You can remove the grouping structure with ungroup(). This allows you to revert to running operations on the entire data frame. Example:

# get count of each species + sex combination,
# but this time get its percentage out of ALL observations
penguins %>%
  count(species, sex) %>%
  ungroup() %>%
  mutate(pct_of_all = n / sum(n))
# A tibble: 6 × 4
  species   sex        n pct_of_all
  <chr>     <chr>  <int>      <dbl>
1 Adelie    female    73      0.219
2 Adelie    male      73      0.219
3 Chinstrap female    34      0.102
4 Chinstrap male      34      0.102
5 Gentoo    female    58      0.174
6 Gentoo    male      61      0.183

8.1.3 More practice! (fertility data)

Let’s give the penguins dataset a rest and practice dplyr and grouping a bit more with a different dataset. The following chunk imports fertility.csv, the cleaned global fertility data set from World Bank, giving the average number of births per woman for each year and country from 1960 to present. For the past few decades, global fertility has been sharply declining for most countries. Many countries are now below the replacement rate of 2.1, leading to widespread concerns of a population collapse in the latter part of the 21st century.

fertility <- read_csv("https://bwu62.github.io/stat240-revamp/data/fertility.csv")
fertility
# A tibble: 13,792 × 6
   code  country     region     income_group  year  rate
   <chr> <chr>       <chr>      <chr>        <dbl> <dbl>
 1 AFG   Afghanistan South Asia Low income    1960  7.28
 2 AFG   Afghanistan South Asia Low income    1961  7.28
 3 AFG   Afghanistan South Asia Low income    1962  7.29
 4 AFG   Afghanistan South Asia Low income    1963  7.30
 5 AFG   Afghanistan South Asia Low income    1964  7.30
 6 AFG   Afghanistan South Asia Low income    1965  7.30
 7 AFG   Afghanistan South Asia Low income    1966  7.32
 8 AFG   Afghanistan South Asia Low income    1967  7.34
 9 AFG   Afghanistan South Asia Low income    1968  7.36
10 AFG   Afghanistan South Asia Low income    1969  7.39
# ℹ 13,782 more rows

Each country has over 60 years of data in this dataset. We can see an overview which countries are represented in the dataset and what their listed region and income group are by temporarily dropping years, removing duplicates, and printing the full output below in a collapsible box:

fertility %>%
  select(country:income_group) %>%
  distinct() %>%
  print(n = Inf)
# A tibble: 215 × 3
    country                  region                     income_group       
    <chr>                    <chr>                      <chr>              
  1 Afghanistan              South Asia                 Low income         
  2 Albania                  Europe & Central Asia      Upper middle income
  3 Algeria                  Middle East & North Africa Upper middle income
  4 American Samoa           East Asia & Pacific        High income        
  5 Andorra                  Europe & Central Asia      High income        
  6 Angola                   Sub-Saharan Africa         Lower middle income
  7 Antigua & Barbuda        Latin America & Caribbean  High income        
  8 Argentina                Latin America & Caribbean  Upper middle income
  9 Armenia                  Europe & Central Asia      Upper middle income
 10 Aruba                    Latin America & Caribbean  High income        
 11 Australia                East Asia & Pacific        High income        
 12 Austria                  Europe & Central Asia      High income        
 13 Azerbaijan               Europe & Central Asia      Upper middle income
 14 Bahamas                  Latin America & Caribbean  High income        
 15 Bahrain                  Middle East & North Africa High income        
 16 Bangladesh               South Asia                 Lower middle income
 17 Barbados                 Latin America & Caribbean  High income        
 18 Belarus                  Europe & Central Asia      Upper middle income
 19 Belgium                  Europe & Central Asia      High income        
 20 Belize                   Latin America & Caribbean  Upper middle income
 21 Benin                    Sub-Saharan Africa         Lower middle income
 22 Bermuda                  North America              High income        
 23 Bhutan                   South Asia                 Lower middle income
 24 Bolivia                  Latin America & Caribbean  Lower middle income
 25 Bosnia & Herzegovina     Europe & Central Asia      Upper middle income
 26 Botswana                 Sub-Saharan Africa         Upper middle income
 27 Brazil                   Latin America & Caribbean  Upper middle income
 28 Brunei                   East Asia & Pacific        High income        
 29 Bulgaria                 Europe & Central Asia      High income        
 30 Burkina Faso             Sub-Saharan Africa         Low income         
 31 Burundi                  Sub-Saharan Africa         Low income         
 32 Cambodia                 East Asia & Pacific        Lower middle income
 33 Cameroon                 Sub-Saharan Africa         Lower middle income
 34 Canada                   North America              High income        
 35 Cape Verde               Sub-Saharan Africa         Lower middle income
 36 Cayman Islands           Latin America & Caribbean  High income        
 37 Central African Republic Sub-Saharan Africa         Low income         
 38 Chad                     Sub-Saharan Africa         Low income         
 39 Channel Islands          Europe & Central Asia      High income        
 40 Chile                    Latin America & Caribbean  High income        
 41 China                    East Asia & Pacific        Upper middle income
 42 Colombia                 Latin America & Caribbean  Upper middle income
 43 Comoros                  Sub-Saharan Africa         Lower middle income
 44 Congo, Dem. Rep.         Sub-Saharan Africa         Low income         
 45 Congo, Rep.              Sub-Saharan Africa         Lower middle income
 46 Costa Rica               Latin America & Caribbean  Upper middle income
 47 Croatia                  Europe & Central Asia      High income        
 48 Cuba                     Latin America & Caribbean  Upper middle income
 49 Curaçao                  Latin America & Caribbean  High income        
 50 Cyprus                   Europe & Central Asia      High income        
 51 Czechia                  Europe & Central Asia      High income        
 52 Denmark                  Europe & Central Asia      High income        
 53 Djibouti                 Middle East & North Africa Lower middle income
 54 Dominica                 Latin America & Caribbean  Upper middle income
 55 Dominican Republic       Latin America & Caribbean  Upper middle income
 56 East Timor               East Asia & Pacific        Lower middle income
 57 Ecuador                  Latin America & Caribbean  Upper middle income
 58 Egypt                    Middle East & North Africa Lower middle income
 59 El Salvador              Latin America & Caribbean  Upper middle income
 60 Equatorial Guinea        Sub-Saharan Africa         Upper middle income
 61 Eritrea                  Sub-Saharan Africa         Low income         
 62 Estonia                  Europe & Central Asia      High income        
 63 Eswatini                 Sub-Saharan Africa         Lower middle income
 64 Ethiopia                 Sub-Saharan Africa         Low income         
 65 Faroe Islands            Europe & Central Asia      High income        
 66 Fiji                     East Asia & Pacific        Upper middle income
 67 Finland                  Europe & Central Asia      High income        
 68 France                   Europe & Central Asia      High income        
 69 French Polynesia         East Asia & Pacific        High income        
 70 Gabon                    Sub-Saharan Africa         Upper middle income
 71 Gambia                   Sub-Saharan Africa         Low income         
 72 Georgia                  Europe & Central Asia      Upper middle income
 73 Germany                  Europe & Central Asia      High income        
 74 Ghana                    Sub-Saharan Africa         Lower middle income
 75 Gibraltar                Europe & Central Asia      High income        
 76 Greece                   Europe & Central Asia      High income        
 77 Greenland                Europe & Central Asia      High income        
 78 Grenada                  Latin America & Caribbean  Upper middle income
 79 Guam                     East Asia & Pacific        High income        
 80 Guatemala                Latin America & Caribbean  Upper middle income
 81 Guinea                   Sub-Saharan Africa         Lower middle income
 82 Guinea-Bissau            Sub-Saharan Africa         Low income         
 83 Guyana                   Latin America & Caribbean  High income        
 84 Haiti                    Latin America & Caribbean  Lower middle income
 85 Honduras                 Latin America & Caribbean  Lower middle income
 86 Hong Kong                East Asia & Pacific        High income        
 87 Hungary                  Europe & Central Asia      High income        
 88 Iceland                  Europe & Central Asia      High income        
 89 India                    South Asia                 Lower middle income
 90 Indonesia                East Asia & Pacific        Upper middle income
 91 Iran                     Middle East & North Africa Upper middle income
 92 Iraq                     Middle East & North Africa Upper middle income
 93 Ireland                  Europe & Central Asia      High income        
 94 Isle of Man              Europe & Central Asia      High income        
 95 Israel                   Middle East & North Africa High income        
 96 Italy                    Europe & Central Asia      High income        
 97 Ivory Coast              Sub-Saharan Africa         Lower middle income
 98 Jamaica                  Latin America & Caribbean  Upper middle income
 99 Japan                    East Asia & Pacific        High income        
100 Jordan                   Middle East & North Africa Lower middle income
101 Kazakhstan               Europe & Central Asia      Upper middle income
102 Kenya                    Sub-Saharan Africa         Lower middle income
103 Kiribati                 East Asia & Pacific        Lower middle income
104 Kosovo                   Europe & Central Asia      Upper middle income
105 Kuwait                   Middle East & North Africa High income        
106 Kyrgyzstan               Europe & Central Asia      Lower middle income
107 Laos                     East Asia & Pacific        Lower middle income
108 Latvia                   Europe & Central Asia      High income        
109 Lebanon                  Middle East & North Africa Lower middle income
110 Lesotho                  Sub-Saharan Africa         Lower middle income
111 Liberia                  Sub-Saharan Africa         Low income         
112 Libya                    Middle East & North Africa Upper middle income
113 Liechtenstein            Europe & Central Asia      High income        
114 Lithuania                Europe & Central Asia      High income        
115 Luxembourg               Europe & Central Asia      High income        
116 Macao                    East Asia & Pacific        High income        
117 Madagascar               Sub-Saharan Africa         Low income         
118 Malawi                   Sub-Saharan Africa         Low income         
119 Malaysia                 East Asia & Pacific        Upper middle income
120 Maldives                 South Asia                 Upper middle income
121 Mali                     Sub-Saharan Africa         Low income         
122 Malta                    Middle East & North Africa High income        
123 Marshall Islands         East Asia & Pacific        Upper middle income
124 Mauritania               Sub-Saharan Africa         Lower middle income
125 Mauritius                Sub-Saharan Africa         Upper middle income
126 Mexico                   Latin America & Caribbean  Upper middle income
127 Micronesia               East Asia & Pacific        Lower middle income
128 Moldova                  Europe & Central Asia      Upper middle income
129 Monaco                   Europe & Central Asia      High income        
130 Mongolia                 East Asia & Pacific        Upper middle income
131 Montenegro               Europe & Central Asia      Upper middle income
132 Morocco                  Middle East & North Africa Lower middle income
133 Mozambique               Sub-Saharan Africa         Low income         
134 Myanmar                  East Asia & Pacific        Lower middle income
135 Namibia                  Sub-Saharan Africa         Upper middle income
136 Nauru                    East Asia & Pacific        High income        
137 Nepal                    South Asia                 Lower middle income
138 Netherlands              Europe & Central Asia      High income        
139 New Caledonia            East Asia & Pacific        High income        
140 New Zealand              East Asia & Pacific        High income        
141 Nicaragua                Latin America & Caribbean  Lower middle income
142 Niger                    Sub-Saharan Africa         Low income         
143 Nigeria                  Sub-Saharan Africa         Lower middle income
144 North Korea              East Asia & Pacific        Low income         
145 North Macedonia          Europe & Central Asia      Upper middle income
146 Northern Mariana Islands East Asia & Pacific        High income        
147 Norway                   Europe & Central Asia      High income        
148 Oman                     Middle East & North Africa High income        
149 Pakistan                 South Asia                 Lower middle income
150 Palau                    East Asia & Pacific        High income        
151 Panama                   Latin America & Caribbean  High income        
152 Papua New Guinea         East Asia & Pacific        Lower middle income
153 Paraguay                 Latin America & Caribbean  Upper middle income
154 Peru                     Latin America & Caribbean  Upper middle income
155 Philippines              East Asia & Pacific        Lower middle income
156 Poland                   Europe & Central Asia      High income        
157 Portugal                 Europe & Central Asia      High income        
158 Puerto Rico              Latin America & Caribbean  High income        
159 Qatar                    Middle East & North Africa High income        
160 Romania                  Europe & Central Asia      High income        
161 Russia                   Europe & Central Asia      High income        
162 Rwanda                   Sub-Saharan Africa         Low income         
163 Samoa                    East Asia & Pacific        Lower middle income
164 San Marino               Europe & Central Asia      High income        
165 Saudi Arabia             Middle East & North Africa High income        
166 Senegal                  Sub-Saharan Africa         Lower middle income
167 Serbia                   Europe & Central Asia      Upper middle income
168 Seychelles               Sub-Saharan Africa         High income        
169 Sierra Leone             Sub-Saharan Africa         Low income         
170 Singapore                East Asia & Pacific        High income        
171 Sint Maarten             Latin America & Caribbean  High income        
172 Slovakia                 Europe & Central Asia      High income        
173 Slovenia                 Europe & Central Asia      High income        
174 Solomon Islands          East Asia & Pacific        Lower middle income
175 Somalia                  Sub-Saharan Africa         Low income         
176 South Africa             Sub-Saharan Africa         Upper middle income
177 South Korea              East Asia & Pacific        High income        
178 South Sudan              Sub-Saharan Africa         Low income         
179 Spain                    Europe & Central Asia      High income        
180 Sri Lanka                South Asia                 Lower middle income
181 St. Kitts & Nevis        Latin America & Caribbean  High income        
182 St. Lucia                Latin America & Caribbean  Upper middle income
183 St. Martin               Latin America & Caribbean  High income        
184 St. Vincent & Grenadines Latin America & Caribbean  Upper middle income
185 Sudan                    Sub-Saharan Africa         Low income         
186 Suriname                 Latin America & Caribbean  Upper middle income
187 Sweden                   Europe & Central Asia      High income        
188 Switzerland              Europe & Central Asia      High income        
189 Syria                    Middle East & North Africa Low income         
190 São Tomé & Principe      Sub-Saharan Africa         Lower middle income
191 Tajikistan               Europe & Central Asia      Lower middle income
192 Tanzania                 Sub-Saharan Africa         Lower middle income
193 Thailand                 East Asia & Pacific        Upper middle income
194 Togo                     Sub-Saharan Africa         Low income         
195 Tonga                    East Asia & Pacific        Upper middle income
196 Trinidad & Tobago        Latin America & Caribbean  High income        
197 Tunisia                  Middle East & North Africa Lower middle income
198 Turkey                   Europe & Central Asia      Upper middle income
199 Turkmenistan             Europe & Central Asia      Upper middle income
200 Turks & Caicos Islands   Latin America & Caribbean  High income        
201 Tuvalu                   East Asia & Pacific        Upper middle income
202 Uganda                   Sub-Saharan Africa         Low income         
203 Ukraine                  Europe & Central Asia      Upper middle income
204 United Arab Emirates     Middle East & North Africa High income        
205 United Kingdom           Europe & Central Asia      High income        
206 United States            North America              High income        
207 Uruguay                  Latin America & Caribbean  High income        
208 Uzbekistan               Europe & Central Asia      Lower middle income
209 Vanuatu                  East Asia & Pacific        Lower middle income
210 Vietnam                  East Asia & Pacific        Lower middle income
211 Virgin Islands           Latin America & Caribbean  High income        
212 West Bank & Gaza         Middle East & North Africa Lower middle income
213 Yemen                    Middle East & North Africa Low income         
214 Zambia                   Sub-Saharan Africa         Lower middle income
215 Zimbabwe                 Sub-Saharan Africa         Lower middle income

One additional small processing step we should do before continuing is convert income_group to an ordered factor (see section 3.9.5), which will be important later.

fertility <- fertility %>% mutate(
  income_group = factor(income_group, ordered = TRUE, levels = c(
    "Low", "Lower middle", "Upper middle", "High"))
)

We can begin by running a few summaries to explore the dataset. To start, here’s a chunk showing the number of countries and median income group for countries in each region:

# strangely, base R median doesn't work on ordered categories,
# but we can use Median from DescTools instead
fertility %>%
  select(country, region, income_group) %>%
  distinct() %>%
  group_by(region) %>%
  summarize(n = n(), median = DescTools::Median(income_group)) %>%
  arrange(desc(median))
# A tibble: 7 × 3
  region                         n median
  <chr>                      <int> <lgl> 
1 East Asia & Pacific           37 NA    
2 Europe & Central Asia         58 NA    
3 Latin America & Caribbean     40 NA    
4 Middle East & North Africa    21 NA    
5 North America                  3 NA    
6 South Asia                     8 NA    
7 Sub-Saharan Africa            48 NA    

Next, here’s a chunk showing the median fertility rate in each region for the most recent year of 2023, as well as the countries with the highest and lowest 2023 rates (and what the rates are) in each region:

# first filter to get the right year, then
# sort by region, rate (so min, max are the first, last in each group)
# then summarize to get median, and min/max country/rate
fertility %>%
  filter(year == max(year)) %>%
  arrange(region, rate) %>%
  group_by(region) %>%
  summarize(
    n           = n(),
    median      = median(rate),
    min_country = first(country),
    min         = first(rate),
    max_country = last(country),
    max         = last(rate)
  ) %>%
  arrange(median)
# A tibble: 7 × 7
  region                         n median min_country   min max_country     max
  <chr>                      <int>  <dbl> <chr>       <dbl> <chr>         <dbl>
1 North America                  3   1.40 Canada      1.26  United States  1.62
2 Europe & Central Asia         58   1.49 Ukraine     0.977 Uzbekistan     3.5 
3 Latin America & Caribbean     41   1.62 Puerto Rico 0.92  St. Martin     2.72
4 South Asia                     8   1.98 Bhutan      1.46  Afghanistan    4.84
5 East Asia & Pacific           37   2.13 Macao       0.586 Samoa          3.83
6 Middle East & North Africa    21   2.36 Malta       1.06  Yemen          4.59
7 Sub-Saharan Africa            48   4.00 Mauritius   1.39  Somalia        6.13

We can also show the latest rate for each country, as well as the change from 2000, just before the start of the 21st century:

# first filter to get the right years, then
# sort by country, year (so 2000, 2022 are first and last in each group)
# then summarize to get 2000 and 2022 rates, mutate to get change,
# then ungroup, distinct, and arrange to display a neat output
# again, collapsing output due to lengthy print out
# %T>% is used again to both save and print the results
fertility_change <- fertility %>%
  filter(year %in% c(2000, max(year))) %>%
  arrange(country, year) %>%
  group_by(country) %>%
  mutate(
    rate2000 = first(rate),
    rate2022 = last(rate),
    change   = rate2022 - rate2000
  ) %>%
  select(country, region, rate2000, change, rate2022) %>%
  ungroup() %>%
  distinct() %>%
  arrange(rate2022) %T>%
  print(n = Inf)
# A tibble: 215 × 5
    country                  region                     rate2000  change rate2022
    <chr>                    <chr>                         <dbl>   <dbl>    <dbl>
  1 Macao                    East Asia & Pacific           0.933 -0.347     0.586
  2 South Korea              East Asia & Pacific           1.48  -0.759     0.721
  3 Hong Kong                East Asia & Pacific           1.03  -0.281     0.751
  4 Puerto Rico              Latin America & Caribbean     2.05  -1.13      0.92 
  5 Singapore                East Asia & Pacific           1.6   -0.63      0.97 
  6 Ukraine                  Europe & Central Asia         1.11  -0.131     0.977
  7 China                    East Asia & Pacific           1.63  -0.629     0.999
  8 Malta                    Middle East & North Africa    1.68  -0.62      1.06 
  9 Andorra                  Europe & Central Asia         1.27  -0.191     1.08 
 10 Spain                    Europe & Central Asia         1.22  -0.100     1.12 
 11 San Marino               Europe & Central Asia         1.40  -0.255     1.15 
 12 Poland                   Europe & Central Asia         1.37  -0.212     1.16 
 13 Chile                    Latin America & Caribbean     2.05  -0.879     1.17 
 14 Lithuania                Europe & Central Asia         1.39  -0.21      1.18 
 15 Curaçao                  Latin America & Caribbean     2.34  -1.14      1.2  
 16 Italy                    Europe & Central Asia         1.26  -0.0600    1.2  
 17 Japan                    East Asia & Pacific           1.36  -0.160     1.2  
 18 United Arab Emirates     Middle East & North Africa    2.73  -1.53      1.2  
 19 Belarus                  Europe & Central Asia         1.32  -0.109     1.21 
 20 Thailand                 East Asia & Pacific           1.73  -0.522     1.21 
 21 Luxembourg               Europe & Central Asia         1.76  -0.51      1.25 
 22 Canada                   North America                 1.51  -0.25      1.26 
 23 Finland                  Europe & Central Asia         1.73  -0.47      1.26 
 24 Estonia                  Europe & Central Asia         1.36  -0.0500    1.31 
 25 Austria                  Europe & Central Asia         1.36  -0.0400    1.32 
 26 Greece                   Europe & Central Asia         1.25   0.0700    1.32 
 27 Costa Rica               Latin America & Caribbean     2.41  -1.08      1.33 
 28 Switzerland              Europe & Central Asia         1.5   -0.17      1.33 
 29 Albania                  Europe & Central Asia         2.22  -0.869     1.35 
 30 Jamaica                  Latin America & Caribbean     2.35  -0.987     1.36 
 31 Latvia                   Europe & Central Asia         1.25   0.110     1.36 
 32 Channel Islands          Europe & Central Asia         1.49  -0.120     1.37 
 33 Bahamas                  Latin America & Caribbean     2.06  -0.687     1.37 
 34 St. Lucia                Latin America & Caribbean     2.21  -0.828     1.38 
 35 Cyprus                   Europe & Central Asia         1.64  -0.249     1.39 
 36 Germany                  Europe & Central Asia         1.38   0.0100    1.39 
 37 Mauritius                Sub-Saharan Africa            1.99  -0.6       1.39 
 38 Norway                   Europe & Central Asia         1.85  -0.45      1.4  
 39 Bermuda                  North America                 1.74  -0.337     1.40 
 40 Uruguay                  Latin America & Caribbean     2.20  -0.787     1.41 
 41 Russia                   Europe & Central Asia         1.20   0.215     1.41 
 42 Netherlands              Europe & Central Asia         1.72  -0.29      1.43 
 43 Cuba                     Latin America & Caribbean     1.61  -0.169     1.44 
 44 Portugal                 Europe & Central Asia         1.55  -0.110     1.44 
 45 Sint Maarten             Latin America & Caribbean     1.98  -0.535     1.45 
 46 Czechia                  Europe & Central Asia         1.15   0.3       1.45 
 47 Liechtenstein            Europe & Central Asia         1.57  -0.120     1.45 
 48 Sweden                   Europe & Central Asia         1.54  -0.0900    1.45 
 49 Croatia                  Europe & Central Asia         1.39   0.0700    1.46 
 50 Bhutan                   South Asia                    3.38  -1.92      1.46 
 51 Turks & Caicos Islands   Latin America & Caribbean     2.41  -0.943     1.46 
 52 Belgium                  Europe & Central Asia         1.67  -0.2       1.47 
 53 Dominica                 Latin America & Caribbean     2.35  -0.862     1.48 
 54 Bosnia & Herzegovina     Europe & Central Asia         1.28   0.204     1.49 
 55 Grenada                  Latin America & Caribbean     2.44  -0.954     1.49 
 56 Slovakia                 Europe & Central Asia         1.3    0.19      1.49 
 57 Argentina                Latin America & Caribbean     2.59  -1.09      1.5  
 58 Australia                East Asia & Pacific           1.76  -0.256     1.5  
 59 Denmark                  Europe & Central Asia         1.77  -0.27      1.5  
 60 Ireland                  Europe & Central Asia         1.89  -0.39      1.5  
 61 North Macedonia          Europe & Central Asia         1.86  -0.363     1.5  
 62 French Polynesia         East Asia & Pacific           2.60  -1.10      1.50 
 63 St. Kitts & Nevis        Latin America & Caribbean     2.13  -0.621     1.50 
 64 Hungary                  Europe & Central Asia         1.32   0.19      1.51 
 65 Slovenia                 Europe & Central Asia         1.26   0.25      1.51 
 66 Turkey                   Europe & Central Asia         2.49  -0.977     1.51 
 67 Cape Verde               Sub-Saharan Africa            3.54  -2.02      1.52 
 68 Kuwait                   Middle East & North Africa    2.89  -1.36      1.52 
 69 Cayman Islands           Latin America & Caribbean     1.71  -0.176     1.53 
 70 Trinidad & Tobago        Latin America & Caribbean     1.73  -0.2       1.53 
 71 Kosovo                   Europe & Central Asia         2.70  -1.15      1.54 
 72 Isle of Man              Europe & Central Asia         1.87  -0.318     1.55 
 73 Azerbaijan               Europe & Central Asia         2     -0.45      1.55 
 74 Malaysia                 East Asia & Pacific           2.94  -1.39      1.55 
 75 New Zealand              East Asia & Pacific           1.98  -0.42      1.56 
 76 United Kingdom           Europe & Central Asia         1.64  -0.0800    1.56 
 77 Maldives                 South Asia                    2.69  -1.12      1.58 
 78 Antigua & Barbuda        Latin America & Caribbean     2.20  -0.618     1.58 
 79 Iceland                  Europe & Central Asia         2.08  -0.49      1.59 
 80 Aruba                    Latin America & Caribbean     1.84  -0.243     1.60 
 81 Serbia                   Europe & Central Asia         1.48   0.130     1.61 
 82 United States            North America                 2.06  -0.440     1.62 
 83 Brazil                   Latin America & Caribbean     2.25  -0.628     1.62 
 84 Colombia                 Latin America & Caribbean     2.57  -0.929     1.64 
 85 France                   Europe & Central Asia         1.89  -0.23      1.66 
 86 Iran                     Middle East & North Africa    2.00  -0.304     1.70 
 87 Barbados                 Latin America & Caribbean     1.93  -0.225     1.71 
 88 Romania                  Europe & Central Asia         1.31   0.4       1.71 
 89 Qatar                    Middle East & North Africa    3.07  -1.34      1.73 
 90 Moldova                  Europe & Central Asia         1.50   0.235     1.73 
 91 Montenegro               Europe & Central Asia         1.90  -0.163     1.74 
 92 Brunei                   East Asia & Pacific           2.40  -0.651     1.75 
 93 Greenland                Europe & Central Asia         2.33  -0.56      1.77 
 94 El Salvador              Latin America & Caribbean     3.08  -1.30      1.78 
 95 St. Vincent & Grenadines Latin America & Caribbean     2.31  -0.531     1.78 
 96 North Korea              East Asia & Pacific           1.94  -0.161     1.78 
 97 Bulgaria                 Europe & Central Asia         1.26   0.55      1.81 
 98 Georgia                  Europe & Central Asia         1.61   0.201     1.81 
 99 Ecuador                  Latin America & Caribbean     3.10  -1.28      1.82 
100 Bahrain                  Middle East & North Africa    2.80  -0.974     1.82 
101 Tunisia                  Middle East & North Africa    2.02  -0.193     1.83 
102 Faroe Islands            Europe & Central Asia         2.59  -0.728     1.86 
103 Gibraltar                Europe & Central Asia         2.18  -0.297     1.89 
104 Armenia                  Europe & Central Asia         1.3    0.6       1.9  
105 Palau                    East Asia & Pacific           1.82   0.0880    1.91 
106 Mexico                   Latin America & Caribbean     2.71  -0.804     1.91 
107 Vietnam                  East Asia & Pacific           2.03  -0.115     1.91 
108 Philippines              East Asia & Pacific           3.75  -1.84      1.92 
109 Sri Lanka                South Asia                    2.21  -0.239     1.97 
110 India                    South Asia                    3.35  -1.38      1.98 
111 New Caledonia            East Asia & Pacific           2.51  -0.529     1.98 
112 Peru                     Latin America & Caribbean     2.84  -0.865     1.98 
113 Virgin Islands           Latin America & Caribbean     1.84   0.141     1.98 
114 Nepal                    South Asia                    3.98  -2         1.98 
115 Belize                   Latin America & Caribbean     3.63  -1.63      2.01 
116 Seychelles               Sub-Saharan Africa            2.08  -0.0600    2.02 
117 Monaco                   Europe & Central Asia         2.08   0.0280    2.11 
118 Myanmar                  East Asia & Pacific           2.81  -0.698     2.12 
119 Panama                   Latin America & Caribbean     2.74  -0.625     2.12 
120 Indonesia                East Asia & Pacific           2.50  -0.376     2.13 
121 Bangladesh               South Asia                    3.28  -1.12      2.16 
122 South Africa             Sub-Saharan Africa            2.41  -0.197     2.22 
123 Nicaragua                Latin America & Caribbean     3.12  -0.902     2.22 
124 Morocco                  Middle East & North Africa    2.77  -0.544     2.23 
125 Lebanon                  Middle East & North Africa    2.6   -0.361     2.24 
126 Dominican Republic       Latin America & Caribbean     2.87  -0.626     2.24 
127 Suriname                 Latin America & Caribbean     3.05  -0.797     2.25 
128 Saudi Arabia             Middle East & North Africa    4.21  -1.93      2.28 
129 Fiji                     East Asia & Pacific           2.99  -0.711     2.28 
130 American Samoa           East Asia & Pacific           3.89  -1.60      2.29 
131 Guatemala                Latin America & Caribbean     4.58  -2.27      2.31 
132 Northern Mariana Islands East Asia & Pacific           1.64   0.708     2.35 
133 Libya                    Middle East & North Africa    2.91  -0.551     2.36 
134 Guyana                   Latin America & Caribbean     3.06  -0.645     2.41 
135 Laos                     East Asia & Pacific           4.43  -2.01      2.42 
136 Paraguay                 Latin America & Caribbean     3.51  -1.09      2.42 
137 Honduras                 Latin America & Caribbean     4.29  -1.79      2.5  
138 Oman                     Middle East & North Africa    3.85  -1.33      2.52 
139 Bolivia                  Latin America & Caribbean     3.99  -1.44      2.55 
140 Cambodia                 East Asia & Pacific           3.79  -1.21      2.58 
141 Djibouti                 Middle East & North Africa    4.60  -1.99      2.61 
142 Jordan                   Middle East & North Africa    3.86  -1.23      2.64 
143 Haiti                    Latin America & Caribbean     4.39  -1.73      2.66 
144 Turkmenistan             Europe & Central Asia         2.9   -0.21      2.69 
145 Lesotho                  Sub-Saharan Africa            3.60  -0.911     2.69 
146 Kyrgyzstan               Europe & Central Asia         2.4    0.300     2.7  
147 Mongolia                 East Asia & Pacific           2.2    0.5       2.7  
148 East Timor               East Asia & Pacific           5.93  -3.23      2.71 
149 Syria                    Middle East & North Africa    4.07  -1.36      2.71 
150 St. Martin               Latin America & Caribbean     2.88  -0.158     2.72 
151 Botswana                 Sub-Saharan Africa            3.29  -0.562     2.73 
152 Micronesia               East Asia & Pacific           4.22  -1.47      2.75 
153 Egypt                    Middle East & North Africa    3.50  -0.755     2.75 
154 Eswatini                 Sub-Saharan Africa            4.03  -1.28      2.75 
155 Algeria                  Middle East & North Africa    2.59   0.176     2.77 
156 Guam                     East Asia & Pacific           2.92  -0.134     2.78 
157 Israel                   Middle East & North Africa    2.95  -0.100     2.85 
158 Marshall Islands         East Asia & Pacific           4.70  -1.78      2.92 
159 Kazakhstan               Europe & Central Asia         1.90   1.11      3.01 
160 Tajikistan               Europe & Central Asia         3.49  -0.418     3.07 
161 Papua New Guinea         East Asia & Pacific           4.53  -1.43      3.10 
162 Tonga                    East Asia & Pacific           4.16  -1.03      3.13 
163 Kiribati                 East Asia & Pacific           4.07  -0.924     3.15 
164 Kenya                    Sub-Saharan Africa            5.14  -1.93      3.21 
165 Tuvalu                   East Asia & Pacific           3.84  -0.633     3.21 
166 Namibia                  Sub-Saharan Africa            3.99  -0.779     3.21 
167 Iraq                     Middle East & North Africa    4.92  -1.67      3.25 
168 West Bank & Gaza         Middle East & North Africa    5.46  -2.15      3.31 
169 Nauru                    East Asia & Pacific           3.54  -0.212     3.33 
170 Ghana                    Sub-Saharan Africa            4.83  -1.43      3.40 
171 Uzbekistan               Europe & Central Asia         2.70   0.804     3.5  
172 Solomon Islands          East Asia & Pacific           4.76  -1.2       3.56 
173 Vanuatu                  East Asia & Pacific           4.54  -0.941     3.60 
174 Pakistan                 South Asia                    5.35  -1.75      3.60 
175 São Tomé & Principe      Sub-Saharan Africa            5.16  -1.51      3.64 
176 Gabon                    Sub-Saharan Africa            4.47  -0.822     3.65 
177 Malawi                   Sub-Saharan Africa            6.00  -2.35      3.65 
178 Rwanda                   Sub-Saharan Africa            5.97  -2.27      3.70 
179 Eritrea                  Sub-Saharan Africa            5.40  -1.68      3.71 
180 Zimbabwe                 Sub-Saharan Africa            4.01  -0.285     3.72 
181 Sierra Leone             Sub-Saharan Africa            6.36  -2.56      3.79 
182 Senegal                  Sub-Saharan Africa            5.50  -1.68      3.82 
183 Samoa                    East Asia & Pacific           4.52  -0.694     3.83 
184 Guinea-Bissau            Sub-Saharan Africa            5.79  -1.95      3.84 
185 South Sudan              Sub-Saharan Africa            6.80  -2.94      3.86 
186 Comoros                  Sub-Saharan Africa            5.23  -1.35      3.88 
187 Liberia                  Sub-Saharan Africa            5.88  -1.92      3.95 
188 Madagascar               Sub-Saharan Africa            5.66  -1.69      3.97 
189 Ethiopia                 Sub-Saharan Africa            6.65  -2.66      3.99 
190 Gambia                   Sub-Saharan Africa            5.72  -1.71      4.01 
191 Equatorial Guinea        Sub-Saharan Africa            5.83  -1.75      4.08 
192 Zambia                   Sub-Saharan Africa            5.92  -1.82      4.10 
193 Congo, Rep.              Sub-Saharan Africa            4.72  -0.567     4.16 
194 Burkina Faso             Sub-Saharan Africa            6.52  -2.33      4.19 
195 Togo                     Sub-Saharan Africa            5.16  -0.973     4.19 
196 Guinea                   Sub-Saharan Africa            5.92  -1.70      4.22 
197 Ivory Coast              Sub-Saharan Africa            5.77  -1.49      4.28 
198 Uganda                   Sub-Saharan Africa            6.79  -2.50      4.28 
199 Cameroon                 Sub-Saharan Africa            5.51  -1.19      4.32 
200 Sudan                    Sub-Saharan Africa            5.60  -1.27      4.32 
201 Nigeria                  Sub-Saharan Africa            6.12  -1.64      4.48 
202 Benin                    Sub-Saharan Africa            5.94  -1.38      4.56 
203 Yemen                    Middle East & North Africa    6.32  -1.73      4.59 
204 Tanzania                 Sub-Saharan Africa            5.67  -1.06      4.61 
205 Mauritania               Sub-Saharan Africa            5.46  -0.759     4.70 
206 Mozambique               Sub-Saharan Africa            5.83  -1.07      4.76 
207 Afghanistan              South Asia                    7.57  -2.73      4.84 
208 Burundi                  Sub-Saharan Africa            6.87  -1.99      4.88 
209 Angola                   Sub-Saharan Africa            6.64  -1.52      5.12 
210 Mali                     Sub-Saharan Africa            6.89  -1.27      5.61 
211 Central African Republic Sub-Saharan Africa            5.85   0.161     6.01 
212 Congo, Dem. Rep.         Sub-Saharan Africa            6.70  -0.651     6.05 
213 Niger                    Sub-Saharan Africa            7.83  -1.77      6.06 
214 Chad                     Sub-Saharan Africa            7.25  -1.13      6.12 
215 Somalia                  Sub-Saharan Africa            7.64  -1.51      6.13 

Which countries had the biggest drop in fertility?

fertility_change %>% slice_min(change, n = 10)
# A tibble: 10 × 5
   country      region                    rate2000 change rate2022
   <chr>        <chr>                        <dbl>  <dbl>    <dbl>
 1 East Timor   East Asia & Pacific           5.93  -3.23     2.71
 2 South Sudan  Sub-Saharan Africa            6.80  -2.94     3.86
 3 Afghanistan  South Asia                    7.57  -2.73     4.84
 4 Ethiopia     Sub-Saharan Africa            6.65  -2.66     3.99
 5 Sierra Leone Sub-Saharan Africa            6.36  -2.56     3.79
 6 Uganda       Sub-Saharan Africa            6.79  -2.50     4.28
 7 Malawi       Sub-Saharan Africa            6.00  -2.35     3.65
 8 Burkina Faso Sub-Saharan Africa            6.52  -2.33     4.19
 9 Guatemala    Latin America & Caribbean     4.58  -2.27     2.31
10 Rwanda       Sub-Saharan Africa            5.97  -2.27     3.70

Did any countries have actually increased?

fertility_change %>% slice_max(change, n = 10)
# A tibble: 10 × 5
   country                  region                rate2000 change rate2022
   <chr>                    <chr>                    <dbl>  <dbl>    <dbl>
 1 Kazakhstan               Europe & Central Asia     1.90  1.11      3.01
 2 Uzbekistan               Europe & Central Asia     2.70  0.804     3.5 
 3 Northern Mariana Islands East Asia & Pacific       1.64  0.708     2.35
 4 Armenia                  Europe & Central Asia     1.3   0.6       1.9 
 5 Bulgaria                 Europe & Central Asia     1.26  0.55      1.81
 6 Mongolia                 East Asia & Pacific       2.2   0.5       2.7 
 7 Romania                  Europe & Central Asia     1.31  0.4       1.71
 8 Kyrgyzstan               Europe & Central Asia     2.4   0.300     2.7 
 9 Czechia                  Europe & Central Asia     1.15  0.3       1.45
10 Slovenia                 Europe & Central Asia     1.26  0.25      1.51

How many countries are now above vs below replacement rate? We can do this by grouping by if the rate is at least 2.1—did I mention you can use transformations inside group_by() as well?

# group by a new column called at_replacement indicating rate>=2.1
fertility %>%
  group_by(at_replacement = rate >= 2.1) %>%
  summarize(n = n()) %>%
  mutate(pct = 100 * n / sum(n))
# A tibble: 2 × 3
  at_replacement     n   pct
  <lgl>          <int> <dbl>
1 FALSE           3800  27.6
2 TRUE            9992  72.4
# we can repeat this grouping by region, this time using count()
# and just showing the percentage of countries at replacement
fertility %>%
  count(region, at_replacement = rate >= 2.1) %>%
  mutate(pct_at_rep = 100 * n / sum(n)) %>%
  filter(at_replacement) %>%
  select(-n, -at_replacement) %>%
  arrange(-pct_at_rep)
# A tibble: 7 × 2
  region                     pct_at_rep
  <chr>                           <dbl>
1 Sub-Saharan Africa             21.9  
2 Latin America & Caribbean      14.3  
3 East Asia & Pacific            13.3  
4 Europe & Central Asia          10.7  
5 Middle East & North Africa      8.48 
6 South Asia                      3.42 
7 North America                   0.290

Let’s make a few plots of the data as well. Here are 2 plots showing median fertility rate over time grouping by either region or income level:

# first get mean rate in each region + year, then pipe into line plot
# fct_reorder2() is used to reorder legend to be same order as end of lines
# see https://forcats.tidyverse.org/reference/fct_reorder.html for more details
fertility %>%
  group_by(region, year) %>%
  summarize(median_rate = median(rate)) %>%
  ggplot(aes(x = year, y = median_rate,
             linetype = fct_reorder2(region, year, median_rate),
             color = fct_reorder2(region, year, median_rate))) +
  geom_hline(yintercept = 2.1, linetype = "dotted") + geom_line(linewidth = 1) +
  labs(x = "Year", y = "Median fertility rate", linetype = "Region", color = "Region",
       title = "Median fertility by region from 1960-2022",
       subtitle = "(black dotted line at replacement rate of 2.1)") +
  scale_x_continuous(expand = c(0, 0), breaks = seq(1960, 2022, 10),
                     minor_breaks = seq(1960, 2022, 2)) +
  scale_y_continuous(expand = c(0, 0), limits = c(1.2, 7.2),
                     breaks = 2:7, minor_breaks = seq(1.2, 7.2, .2))
# first get mean rate in each income group + year, then pipe into line plot
fertility %>%
  group_by(income_group, year) %>%
  summarize(median_rate = median(rate)) %>%
  ggplot(aes(x = year, y = median_rate,
             linetype = income_group, color = income_group)) +
  geom_hline(yintercept = 2.1, linetype = "dotted") + geom_line(linewidth = 1) +
  labs(x = "Year", y = "Median fertility rate",
       linetype = "Income group", color = "Income group",
       title = "Median fertility by income group from 1960-2022",
       subtitle = "(black dotted line at replacement rate of 2.1)") +
  scale_x_continuous(expand = c(0, 0), breaks = seq(1960, 2022, 10),
                     minor_breaks = seq(1960, 2022, 2)) +
  scale_y_continuous(expand = c(0, 0), limits = c(1.4, 7.2),
                     breaks = 2:7, minor_breaks = seq(1.6, 7.2, .2))

8.1.3.1 Bonus: choropleth

Here’s a bonus plotly choropleth just for fun (you do NOT need to learn this). You can change the year and pan/zoom to see each specific country. Red color indicates countries below the replacement rate 2.1, white indicates at the replacement rate, and blue indicates above the replacement rate.

library(plotly)
plot_ly(fertility, type = "choropleth", locations = ~code, z = ~rate,
        text = ~country, frame = ~year, zmin = 0.7, zmax = 8.9, colorscale = list(
        c(0, "rgb(255,0,0)"), c(0.1707, "rgb(255,255,255)"), c(1, "rgb(0,0,255)"))) %>%
  layout(margin = list(l = 10, r = 10, b = 10, t = 35)) %>% animation_opts(frame = 100)

8.2 Merging

Let’s move on to another topic: merging data frames. Often, the information you need may be spread across several data frames from different samples or different sources, in which case you may need to merge data frames together.

There are generally 2 common ways data may need to be merged: binding rows, and joining columns.

  • Binding rows is done when two data frames share the same columns, but have different rows, and you want to combine all the rows together.
  • Joining columns is done when two data frames share the same rows, but have different columns, and you want to combine all the columns together.

These two are NOT the same, so it’s important to remember which is which.

8.2.1 Binding

Binding is the simpler of the two, so let’s start here. Returning briefly to the penguins data set, note there’s a year column which indicates when each sample was collected. This means there were 3 different studies conducted. Suppose each of the years was originally in its own data frame (I suspect this is likely to be true).

penguins2007 <- penguins %>% filter(year == 2007)
penguins2008 <- penguins %>% filter(year == 2008)
penguins2009 <- penguins %>% filter(year == 2009)
print(penguins2007, n = 5)
# A tibble: 103 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex   
  <chr>   <chr>              <dbl>         <dbl>             <dbl>       <dbl> <chr> 
1 Adelie  Torgersen           39.1          18.7               181        3750 male  
2 Adelie  Torgersen           39.5          17.4               186        3800 female
3 Adelie  Torgersen           40.3          18                 195        3250 female
4 Adelie  Torgersen           36.7          19.3               193        3450 female
5 Adelie  Torgersen           39.3          20.6               190        3650 male  
# ℹ 98 more rows
# ℹ 1 more variable: year <dbl>
print(penguins2008, n = 5)
# A tibble: 113 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex   
  <chr>   <chr>           <dbl>         <dbl>             <dbl>       <dbl> <chr> 
1 Adelie  Biscoe           39.6          17.7               186        3500 female
2 Adelie  Biscoe           40.1          18.9               188        4300 male  
3 Adelie  Biscoe           35            17.9               190        3450 female
4 Adelie  Biscoe           42            19.5               200        4050 male  
5 Adelie  Biscoe           34.5          18.1               187        2900 female
# ℹ 108 more rows
# ℹ 1 more variable: year <dbl>
print(penguins2009, n = 5)
# A tibble: 117 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex   
  <chr>   <chr>           <dbl>         <dbl>             <dbl>       <dbl> <chr> 
1 Adelie  Biscoe           35            17.9               192        3725 female
2 Adelie  Biscoe           41            20                 203        4725 male  
3 Adelie  Biscoe           37.7          16                 183        3075 female
4 Adelie  Biscoe           37.8          20                 190        4250 male  
5 Adelie  Biscoe           37.9          18.6               193        2925 female
# ℹ 112 more rows
# ℹ 1 more variable: year <dbl>

We can see our 3 data frames penguins2007, penguins2008, penguins2009 share the exact same columns (same number of columns, same column names, same column types) but have different rows. Each data frame can be thought of as containing a fraction of the overall samples or subjects. In this case, we must row-bind the 3 data frames together.

The function bind_rows(df1, df2, ...) lets us do this. Just pass each data frame that needs binding in:

penguins_bound <- bind_rows(penguins2007, penguins2008, penguins2009)
penguins_bound
# A tibble: 333 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex  
   <chr>   <chr>              <dbl>         <dbl>             <dbl>       <dbl> <chr>
 1 Adelie  Torgersen           39.1          18.7               181        3750 male 
 2 Adelie  Torgersen           39.5          17.4               186        3800 fema…
 3 Adelie  Torgersen           40.3          18                 195        3250 fema…
 4 Adelie  Torgersen           36.7          19.3               193        3450 fema…
 5 Adelie  Torgersen           39.3          20.6               190        3650 male 
 6 Adelie  Torgersen           38.9          17.8               181        3625 fema…
 7 Adelie  Torgersen           39.2          19.6               195        4675 male 
 8 Adelie  Torgersen           41.1          17.6               182        3200 fema…
 9 Adelie  Torgersen           38.6          21.2               191        3800 male 
10 Adelie  Torgersen           34.6          21.1               198        4400 male 
# ℹ 323 more rows
# ℹ 1 more variable: year <dbl>

It may be hard to see, since only the first 10 rows are printed here, but if you look at the row numbers, you can see we now have 103+113+117=333 rows after binding.

For best results, ensure input data frames have the exact same columns, with the same names and types!

8.2.2 Joining

Joining is slightly more complicated. This is for when all the rows are together, but the columns you need are spread out over several data frames. Often, this occurs when you obtain data from different sources and want to combine them to look for patterns, but it can also happen sometimes with multiple data frames gathered from the same source.

Joining is performed by first matching rows using a set of key variables before adding in the additional columns. The key variables should uniquely identify rows. Here’s a simple demo to start. Suppose you have the following data frames:

x <- tibble(
  A = c("a", "b", "c"),
  B = c(1, 2, 3)
)
y <- tibble(
  A = c("a", "b", "d"),
  C = ymd("2024.1.1") + 0:2
)
x
# A tibble: 3 × 2
  A         B
  <chr> <dbl>
1 a         1
2 b         2
3 c         3
y
# A tibble: 3 × 2
  A     C         
  <chr> <date>    
1 a     2024-01-01
2 b     2024-01-02
3 d     2024-01-03

Suppose you want to join x and y using column A as the “key”, i.e. “A” is the column that uniquely matches rows across data frames. There are 4 different joins you can use: inner_join(), full_join(), left_join(), right_join() , depending on which rows you want in the result:

  • inner_join(x, y) matches rows using key variable(s), then returns only rows in both x and y,
  • full_join(x, y) matches rows using key variable(s), then returns all rows in either x or y,
  • left_join(x, y) matches rows using key variable(s), then returns only rows in x,
  • right_join(x, y) matches rows using key variable(s), then returns only rows in y.

Below are demos of each join. Note each join has the exact same columns; the only difference is which rows are included in the output. Also note NAs are automatically used to fill in any missing values that arise from the merge.

# only return rows in both x and y
inner_join(x, y)
# A tibble: 2 × 3
  A         B C         
  <chr> <dbl> <date>    
1 a         1 2024-01-01
2 b         2 2024-01-02
# return rows in either x or y
full_join(x, y)
# A tibble: 4 × 3
  A         B C         
  <chr> <dbl> <date>    
1 a         1 2024-01-01
2 b         2 2024-01-02
3 c         3 NA        
4 d        NA 2024-01-03
# return only rows in x
left_join(x, y)
# A tibble: 3 × 3
  A         B C         
  <chr> <dbl> <date>    
1 a         1 2024-01-01
2 b         2 2024-01-02
3 c         3 NA        
# return only rows in y
right_join(x, y)
# A tibble: 3 × 3
  A         B C         
  <chr> <dbl> <date>    
1 a         1 2024-01-01
2 b         2 2024-01-02
3 d        NA 2024-01-03

The above 4 are called the “mutating joins” since they mutate (i.e. change) the data frame by adding additional columns. There are 2 other less commonly used joins called the “filterting joins” called semi_join() and anti_join() that do not actually add columns; they are only used to filter rows that match or don’t match another data frame. Examples:

# semi_join(x, y) will FILTER rows in x that have matches in y
# note there are NO new columns added to x
semi_join(x, y)
# A tibble: 2 × 2
  A         B
  <chr> <dbl>
1 a         1
2 b         2
# conversely, anti_join() will FILTER rows in x with NO match in y
# again, note NO new columns were added to x
anti_join(x, y)
# A tibble: 1 × 2
  A         B
  <chr> <dbl>
1 c         3

These examples only have 1 key column, but the same works if you have 2 or more key columns; rows are matched if they have ALL the same values in the key columns.

When joining, R will automatically look for columns with the same name to use as key columns. Therefore, to run smoothly, it’s highly recommended to ensure the following before joining:

  1. Key columns should have the same name and type.
  2. All other non-key columns should have different names.
  3. When picking a join, carefully consider which rows you will actually need later.
  4. Generally, key values should be unique for each row. If you need to have duplicate key values, watch the joins closely to ensure the join works the way you want.

Let’s see a real example of joins in action.

8.2.3 Example: cleaning fertility

The World Bank fertility data set we used earlier obviously didn’t start out perfectly clean and tidy. When you download the CSV file from the data page it comes as a zip of the following two (renamed) files: fertility_meta.csv and fertility_raw.csv. One of them contains metadata for each country, while the other contains the actual annual fertility rates.

fertility_meta <- read_csv(
  "https://bwu62.github.io/stat240-revamp/data/fertility_meta.csv")
fertility_raw <- read_csv(
  "https://bwu62.github.io/stat240-revamp/data/fertility_raw.csv")
fertility_meta
# A tibble: 265 × 5
   `Country Code` Region                     IncomeGroup       SpecialNotes TableName
   <chr>          <chr>                      <chr>             <chr>        <chr>    
 1 ABW            Latin America & Caribbean  High income        <NA>        Aruba    
 2 AFE            <NA>                       <NA>              "26 countri… Africa E…
 3 AFG            South Asia                 Low income        "The report… Afghanis…
 4 AFW            <NA>                       <NA>              "22 countri… Africa W…
 5 AGO            Sub-Saharan Africa         Lower middle inc… "The World … Angola   
 6 ALB            Europe & Central Asia      Upper middle inc…  <NA>        Albania  
 7 AND            Europe & Central Asia      High income        <NA>        Andorra  
 8 ARB            <NA>                       <NA>              "Arab World… Arab Wor…
 9 ARE            Middle East & North Africa High income        <NA>        United A…
10 ARG            Latin America & Caribbean  Upper middle inc… "The World … Argentina
# ℹ 255 more rows
fertility_raw
# A tibble: 266 × 68
   `Country Name`      `Country Code` `Indicator Name` `Indicator Code` `1960` `1961`
   <chr>               <chr>          <chr>            <chr>             <dbl>  <dbl>
 1 Aruba               ABW            Fertility rate,… SP.DYN.TFRT.IN     4.57   4.42
 2 Africa Eastern and… AFE            Fertility rate,… SP.DYN.TFRT.IN     6.65   6.67
 3 Afghanistan         AFG            Fertility rate,… SP.DYN.TFRT.IN     7.28   7.28
 4 Africa Western and… AFW            Fertility rate,… SP.DYN.TFRT.IN     6.47   6.48
 5 Angola              AGO            Fertility rate,… SP.DYN.TFRT.IN     6.71   6.79
 6 Albania             ALB            Fertility rate,… SP.DYN.TFRT.IN     6.38   6.27
 7 Andorra             AND            Fertility rate,… SP.DYN.TFRT.IN     2.54   2.54
 8 Arab World          ARB            Fertility rate,… SP.DYN.TFRT.IN     6.92   6.97
 9 United Arab Emirat… ARE            Fertility rate,… SP.DYN.TFRT.IN     6.50   6.49
10 Argentina           ARG            Fertility rate,… SP.DYN.TFRT.IN     3.14   3.12
# ℹ 256 more rows
# ℹ 62 more variables: `1962` <dbl>, `1963` <dbl>, `1964` <dbl>, `1965` <dbl>,
#   `1966` <dbl>, `1967` <dbl>, `1968` <dbl>, `1969` <dbl>, `1970` <dbl>,
#   `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>,
#   `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>,
#   `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>,
#   `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>, …

First, there are several columns like "SpecialNotes", "Indicator Name", and "Indicator Code" that we don’t need; we can remove those with select(). Country names are duplicated in both data frames, so we can pick one to use. Let’s pick "TableName" since the names are slightly nicer (check this yourself). The rest of the columns should probably be renamed with rename().

We can also change the "Low income", "Lower middle income", … values into just "Low", "Lower middle", … to slightly simplify the column. Shorter labels will also be easier to work with in filter() or in plots.

Note as well the existence of several NAs in region and income. Again, verify this yourself, but these are all further subregion groupings of countries (e.g. AFE and AFW are for Eastern/Southern and Western/Central African countries), so we can simply drop these rows.

# apply the processing steps above
fertility_meta2 <- fertility_meta %>%
  rename(code = "Country Code", country = "TableName", region = "Region", income_group = "IncomeGroup") %>%
  select(code, country, region, income_group) %>%
  mutate(income_group = sub(" income", "", income_group)) %>%
  filter(!is.na(income_group))
fertility_raw2 <- fertility_raw %>%
  select(2, "1960":last_col()) %>%
  rename(code = "Country Code")
print(fertility_meta2, n = 5)
# A tibble: 216 × 4
  code  country     region                    income_group
  <chr> <chr>       <chr>                     <chr>       
1 ABW   Aruba       Latin America & Caribbean High        
2 AFG   Afghanistan South Asia                Low         
3 AGO   Angola      Sub-Saharan Africa        Lower middle
4 ALB   Albania     Europe & Central Asia     Upper middle
5 AND   Andorra     Europe & Central Asia     High        
# ℹ 211 more rows
print(fertility_raw2, n = 5)
# A tibble: 266 × 65
  code  `1960` `1961` `1962` `1963` `1964` `1965` `1966` `1967` `1968` `1969` `1970`
  <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 ABW     4.57   4.42   4.26   4.11   3.94   3.80   3.62   3.45   3.28   3.11   2.97
2 AFE     6.65   6.67   6.69   6.71   6.72   6.74   6.77   6.78   6.78   6.78   6.79
3 AFG     7.28   7.28   7.29   7.30   7.30   7.30   7.32   7.34   7.36   7.39   7.4 
4 AFW     6.47   6.48   6.49   6.50   6.52   6.53   6.56   6.58   6.61   6.63   6.66
5 AGO     6.71   6.79   6.87   6.95   7.04   7.12   7.19   7.27   7.33   7.39   7.43
# ℹ 261 more rows
# ℹ 53 more variables: `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>,
#   `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>,
#   `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>,
#   `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>,
#   `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>,
#   `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, `1998` <dbl>, `1999` <dbl>, …

Now comes the key step! We’re going to use the "code" columns (which are the 3-letter ISO 3166-1 country codes) as the key columns and join the two data frames together. Let’s use inner_join() to keep only countries that appear in both, since we want to look at region and income group together with fertility rates.

Note these data frames already satisfy the recommended conditions, i.e. the key columns have the same name and type and uniquely identify each country, and all other columns have different names, so there should be no issues.

fertility_joined <- inner_join(fertility_meta2, fertility_raw2)
fertility_joined
# A tibble: 216 × 68
   code  country region income_group `1960` `1961` `1962` `1963` `1964` `1965` `1966`
   <chr> <chr>   <chr>  <chr>         <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 ABW   Aruba   Latin… High           4.57   4.42   4.26   4.11   3.94   3.80   3.62
 2 AFG   Afghan… South… Low            7.28   7.28   7.29   7.30   7.30   7.30   7.32
 3 AGO   Angola  Sub-S… Lower middle   6.71   6.79   6.87   6.95   7.04   7.12   7.19
 4 ALB   Albania Europ… Upper middle   6.38   6.27   6.11   5.93   5.71   5.47   5.32
 5 AND   Andorra Europ… High           2.54   2.54   2.55   2.60   2.69   2.72   2.76
 6 ARE   United… Middl… High           6.50   6.49   6.49   6.50   6.49   6.50   6.50
 7 ARG   Argent… Latin… Upper middle   3.14   3.12   3.13   3.11   3.08   3.00   2.97
 8 ARM   Armenia Europ… Upper middle   4.79   4.67   4.52   4.34   4.15   3.99   3.83
 9 ASM   Americ… East … High           6.58   6.58   6.60   6.69   6.78   6.73   6.62
10 ATG   Antigu… Latin… High           4.60   4.56   4.55   4.54   4.48   4.42   4.32
# ℹ 206 more rows
# ℹ 57 more variables: `1967` <dbl>, `1968` <dbl>, `1969` <dbl>, `1970` <dbl>,
#   `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>,
#   `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>,
#   `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>,
#   `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>,
#   `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, …

This already looks pretty good! Our data frames joined nicely along the key "code" column, and now our region, income, and fertility data is in 1 single data frame instead of spread across 2 data frames.

However, you may have noticed we still have a problem: our annual fertility rates are spread out across 64 columns, with each year in its own column. This is not compatible with our tidyverse functions. How do we fix it?

8.3 Pivoting

This is where pivoting becomes relevant. Pivoting (also sometimes called reshaping) refers to the process of transforming data between different representations of it. Let’s start with a small example to better illustrate this point.

Suppose two people Alice and Bob are playing a game. They agree to play 3 rounds. Each round, each person tries to score points up to a maximum of 10. The person that wins the most rounds wins overall. Suppose these are the final scores:

# create demo scores data frame
# sample() is used here to randomly generate points
scores_long <- tibble(
  round  = rep(1:3, each = 2),
  player = rep(c("Alice", "Bob"), 3),
  points = c(9, 4, 7, 1, 2, 7)
)
scores_long
# A tibble: 6 × 3
  round player points
  <int> <chr>   <dbl>
1     1 Alice       9
2     1 Bob         4
3     2 Alice       7
4     2 Bob         1
5     3 Alice       2
6     3 Bob         7

However, the same data can also be represented in this alternative format:

scores_wide <- tibble(
  round = 1:3,
  Alice = c(9, 7, 2),
  Bob   = c(4, 1, 7)
)
scores_wide
# A tibble: 3 × 3
  round Alice   Bob
  <int> <dbl> <dbl>
1     1     9     4
2     2     7     1
3     3     2     7

Despite the difference in shape and dimensions, the data contained in these data frames is identical. These are often called the long and wide representations of data, so named because one tends to be longer (i.e. more rows) and the other wider (i.e. more columns). Generally, long formats are easier to use with R/tidyverse, whereas wide formats are easier for humans to read, but there are of course exceptions to everything.

Usually, for data to be considered “tidy” it needs to be in a longer format. Tidy data is data that satisfies the following:

  1. Each variable is a column; each column is a variable.
  2. Each observation is a row; each row is an observation.
  3. Each value is a cell; each cell is a single value.33

Data in this format will generally be easier to work with using tidyverse, and broadly speaking R, since this format is highly versatile and most functions have been designed to work best on this format.

Hopefully, it should be easy to see that scores_long above satisfies these properties. Each variable is indeed a column (and vice versa), each row an observation (vv.), etc. It should also be evident scores_wide is NOT tidy, since each row actually contains 2 scores, and each column is a person, not a variable (the distinction here is subtle but meaningful).

Tidyr’s pivot_longer() and pivot_wider() functions let you easily convert between these two representations. When used correctly, these two should be inverses of each other, i.e. you can take a data frame and pivot it longer, then pivot it wider (or vice versa) and get back the original data frame. There are many ways of using these functions (see help page) but here’s a brief guide to their most common and important usage:

  • df %>% pivot_longer(cols, names_to = "..", values_to = "..") turns a “wider” representation into a “longer” representation.
    • cols is a set of columns where the actual observations are “spread out over”. You can set this using numbers, names, ranges, or selector functions (just like for select()). In the scores_wide data frame, this would be the two columns Alice and Bob since this is where the points for each round are stored.
    • names_to sets the name for the new column containing the names of the cols. In the scores_wide data frame, it will be filled with repetitions of "Alice" and "Bob" corresponding to each round’s points.
    • values_to sets the name for the new column containing the values inside cols. In the scores_wide data frame, it will be filled with the actual points scored in each round.
    # example:
    scores_wide %>%
      pivot_longer(Alice:Bob, names_to = "player", values_to = "points")
    # A tibble: 6 × 3
      round player points
      <int> <chr>   <dbl>
    1     1 Alice       9
    2     1 Bob         4
    3     2 Alice       7
    4     2 Bob         1
    5     3 Alice       2
    6     3 Bob         7
  • df %>% pivot_wider(names_from = "..", values_from = "..") turns a “longer” represetation into a “wider” representation.
    • names_from sets the column of names that will each become their own column in the wider data frame. In the scores_long data frame, this would be the player column.
    • values_from sets the column of actual observations that will fill each of the newly created wider set of columns. In the scores_long data frame, this would be the points column.
    # example:
    scores_long %>%
      pivot_wider(names_from = "player", values_from = "points")
    # A tibble: 3 × 3
      round Alice   Bob
      <int> <dbl> <dbl>
    1     1     9     4
    2     2     7     1
    3     3     2     7

Note in the above examples, passing scores_wide to pivot_longer() gives us scores_long exactly, and passing scores_long to pivot_wider() gives us scores_wide exactly.

8.3.1 Example: finish cleaning fertility

Let’s turn back to the fertility_joined example. After joining, we were left with the following data frame:

fertility_joined
# A tibble: 216 × 68
   code  country region income_group `1960` `1961` `1962` `1963` `1964` `1965` `1966`
   <chr> <chr>   <chr>  <chr>         <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 ABW   Aruba   Latin… High           4.57   4.42   4.26   4.11   3.94   3.80   3.62
 2 AFG   Afghan… South… Low            7.28   7.28   7.29   7.30   7.30   7.30   7.32
 3 AGO   Angola  Sub-S… Lower middle   6.71   6.79   6.87   6.95   7.04   7.12   7.19
 4 ALB   Albania Europ… Upper middle   6.38   6.27   6.11   5.93   5.71   5.47   5.32
 5 AND   Andorra Europ… High           2.54   2.54   2.55   2.60   2.69   2.72   2.76
 6 ARE   United… Middl… High           6.50   6.49   6.49   6.50   6.49   6.50   6.50
 7 ARG   Argent… Latin… Upper middle   3.14   3.12   3.13   3.11   3.08   3.00   2.97
 8 ARM   Armenia Europ… Upper middle   4.79   4.67   4.52   4.34   4.15   3.99   3.83
 9 ASM   Americ… East … High           6.58   6.58   6.60   6.69   6.78   6.73   6.62
10 ATG   Antigu… Latin… High           4.60   4.56   4.55   4.54   4.48   4.42   4.32
# ℹ 206 more rows
# ℹ 57 more variables: `1967` <dbl>, `1968` <dbl>, `1969` <dbl>, `1970` <dbl>,
#   `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>,
#   `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>,
#   `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>,
#   `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>,
#   `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, …

Now it should be clear all we need to do is apply pivot_longer():

# let's pivot_longer(), then drop any rows with missing values
# after our careful processing, it should be safe to just do drop_na()
# we can also again convert income_group to ordered categories if needed,
# and finally arrange by country and year for extra neatness
fertility <- fertility_joined %>%
  pivot_longer("1960":last_col(), names_to = "year", values_to = "rate") %>%
  drop_na() %>%
  mutate(income_group = factor(income_group, ordered = TRUE, levels = c(
    "Low", "Lower middle", "Upper middle", "High"))) %>%
  arrange(country, year)
# print our final, clean & tidy data frame!
fertility
# A tibble: 13,792 × 6
   code  country     region     income_group year   rate
   <chr> <chr>       <chr>      <ord>        <chr> <dbl>
 1 AFG   Afghanistan South Asia Low          1960   7.28
 2 AFG   Afghanistan South Asia Low          1961   7.28
 3 AFG   Afghanistan South Asia Low          1962   7.29
 4 AFG   Afghanistan South Asia Low          1963   7.30
 5 AFG   Afghanistan South Asia Low          1964   7.30
 6 AFG   Afghanistan South Asia Low          1965   7.30
 7 AFG   Afghanistan South Asia Low          1966   7.32
 8 AFG   Afghanistan South Asia Low          1967   7.34
 9 AFG   Afghanistan South Asia Low          1968   7.36
10 AFG   Afghanistan South Asia Low          1969   7.39
# ℹ 13,782 more rows

Our fertility data set is now 100% fully cleaned and ready for exploration, visualization, and modeling!

Before we move on, let’s demonstrate one more usage for pivot_wider(). This function is also great for making human-friendly summary tables, especially when looking at how 3 variables interrelate. For example, suppose we want to look at, for each region, the percentage of countries in each income group. If we just compute the percentages and print, we get something like this:

# count how many countries in each region + income group,
# regroup by region to compute percentages,
# remove the n count column, and print
region_income_pct <- fertility %>%
  count(region, income_group) %>%
  group_by(region) %>%
  mutate(pct = round(100 * n / sum(n), 1)) %>%
  select(-n) %>%
  print()
# A tibble: 22 × 3
# Groups:   region [7]
   region                    income_group   pct
   <chr>                     <ord>        <dbl>
 1 East Asia & Pacific       Low            2.7
 2 East Asia & Pacific       Lower middle  32.4
 3 East Asia & Pacific       Upper middle  24.3
 4 East Asia & Pacific       High          40.5
 5 Europe & Central Asia     Lower middle   5.2
 6 Europe & Central Asia     Upper middle  25.9
 7 Europe & Central Asia     High          68.9
 8 Latin America & Caribbean Lower middle   9.8
 9 Latin America & Caribbean Upper middle  46.3
10 Latin America & Caribbean High          43.9
# ℹ 12 more rows

This would be great for plotting if that’s what we wanted to do, e.g:

# plot income group percentages by region
# str_wrap used to line-break long region names
region_income_pct %>% 
  ggplot(aes(x = str_wrap(region, 12), y = pct, fill = income_group)) +
  geom_col() + labs(x = NULL, y = "Percent", fill = "Income level",
                    title = "% countries in each income level by region")

However, we can also see the numeric values laid out as a table by using pivot_wider() to put each income_group in its own column:

# pivot region/income percentages to wider table format
region_income_pct %>%
  pivot_wider(names_from = income_group, values_from = pct)
# A tibble: 7 × 5
# Groups:   region [7]
  region                       Low `Lower middle` `Upper middle`  High
  <chr>                      <dbl>          <dbl>          <dbl> <dbl>
1 East Asia & Pacific          2.7           32.4           24.3  40.5
2 Europe & Central Asia       NA              5.2           25.9  68.9
3 Latin America & Caribbean   NA              9.8           46.3  43.9
4 Middle East & North Africa   9.7           31.8           19.5  39  
5 North America               NA             NA             NA   100  
6 South Asia                  12.5           75             12.5  NA  
7 Sub-Saharan Africa          45.8           39.6           12.5   2.1

The NAs are due to a lack of countries in that region/income combination. In this case, we can fill in missing values by setting the values_fill argument:

# fill in missing values with 0 in region/income table
region_income_pct %>%
  pivot_wider(names_from = income_group, values_from = pct, values_fill = 0)
# A tibble: 7 × 5
# Groups:   region [7]
  region                       Low `Lower middle` `Upper middle`  High
  <chr>                      <dbl>          <dbl>          <dbl> <dbl>
1 East Asia & Pacific          2.7           32.4           24.3  40.5
2 Europe & Central Asia        0              5.2           25.9  68.9
3 Latin America & Caribbean    0              9.8           46.3  43.9
4 Middle East & North Africa   9.7           31.8           19.5  39  
5 North America                0              0              0   100  
6 South Asia                  12.5           75             12.5   0  
7 Sub-Saharan Africa          45.8           39.6           12.5   2.1