8 Advanced Operations
In this chapter, we will cover a few more advanced, yet incredibly useful data tidying operations like grouping, joining, binding, and pivoting. Along the way, we will also make extensive use of dplyr functions learned in the previous chapter.
8.1 Grouping
Often, you will need to apply dplyr’s various operations like mutate()
, summarize()
, or slicing function not across the entire dataset but in groups. This is an important technique across data science, whether it’s data cleaning to exploration to visualization to modeling.
By default, data frames are not grouped when created or imported. You can create a grouping structure with the group_by()
function. The basic syntax is df %>% group_by(col1, col2, ...)
where col1
, col2
, … are variables whose values are used to determine groups. You can group by just 1 variable, 2 variables, or as many variables as needed. Rows with the same values in the chosen columns will be grouped together.
After grouping, operations that normally run across all rows now run across each group. Here’s a few simple examples using the familiar penguins dataset to start:
# import tidyverse, import magrittr (for extra %T>% pipe),
# tweak some readr/ggplot options (optional), and load dataset
library(tidyverse)
library(magrittr)
options(readr.show_col_types = FALSE)
source("https://bwu62.github.io/stat240-revamp/ggplot_theme_options.R")
penguins <- read_csv("https://bwu62.github.io/stat240-revamp/data/penguins.csv")
# group by species and get mean/median/sd body mass + sample size of each group
penguins %>%
group_by(species) %>%
summarize(
mean_mass = mean(body_mass_g),
median_mass = median(body_mass_g),
sd_mass = sd(body_mass_g),
n = n()
)
# A tibble: 3 × 5
species mean_mass median_mass sd_mass n
<chr> <dbl> <dbl> <dbl> <int>
1 Adelie 3706. 3700 459. 146
2 Chinstrap 3733. 3700 384. 68
3 Gentoo 5092. 5050 501. 119
# we can also group by multiple, e.g. group by spcies + sex
penguins %>%
group_by(species, sex) %>%
summarize(
mean_mass = mean(body_mass_g),
median_mass = median(body_mass_g),
sd_mass = sd(body_mass_g),
n = n()
)
# A tibble: 6 × 6
# Groups: species [3]
species sex mean_mass median_mass sd_mass n
<chr> <chr> <dbl> <dbl> <dbl> <int>
1 Adelie female 3369. 3400 269. 73
2 Adelie male 4043. 4000 347. 73
3 Chinstrap female 3527. 3550 285. 34
4 Chinstrap male 3939. 3950 362. 34
5 Gentoo female 4680. 4700 282. 58
6 Gentoo male 5485. 5500 313. 61
# we can of course also mutate within groups
# e.g. convert bill length in mm to number of SDs
# away from species group mean
# to better show the result, I'm forcing it to print all rows in order
# but hiding the output in a collapsible box for style
penguins %>%
select(species, sex, bill_length_mm) %>%
group_by(species) %>%
mutate(
n = n(),
bill_length_std = (bill_length_mm - mean(bill_length_mm)) / sd(bill_length_mm)
) %>%
arrange(species, bill_length_std) %>%
print(n = Inf)
# A tibble: 333 × 5
# Groups: species [3]
species sex bill_length_mm n bill_length_std
<chr> <chr> <dbl> <int> <dbl>
1 Adelie female 32.1 146 -2.53
2 Adelie female 33.1 146 -2.15
3 Adelie female 33.5 146 -2.00
4 Adelie female 34 146 -1.81
5 Adelie female 34.4 146 -1.66
6 Adelie female 34.5 146 -1.62
7 Adelie male 34.6 146 -1.59
8 Adelie female 34.6 146 -1.59
9 Adelie female 35 146 -1.44
10 Adelie female 35 146 -1.44
11 Adelie male 35.1 146 -1.40
12 Adelie female 35.2 146 -1.36
13 Adelie female 35.3 146 -1.32
14 Adelie female 35.5 146 -1.25
15 Adelie female 35.5 146 -1.25
16 Adelie female 35.6 146 -1.21
17 Adelie female 35.7 146 -1.17
18 Adelie female 35.7 146 -1.17
19 Adelie female 35.7 146 -1.17
20 Adelie female 35.9 146 -1.10
21 Adelie female 35.9 146 -1.10
22 Adelie female 36 146 -1.06
23 Adelie female 36 146 -1.06
24 Adelie female 36 146 -1.06
25 Adelie female 36 146 -1.06
26 Adelie female 36.2 146 -0.985
27 Adelie female 36.2 146 -0.985
28 Adelie female 36.2 146 -0.985
29 Adelie male 36.3 146 -0.948
30 Adelie female 36.4 146 -0.910
31 Adelie female 36.4 146 -0.910
32 Adelie female 36.5 146 -0.873
33 Adelie female 36.5 146 -0.873
34 Adelie female 36.6 146 -0.835
35 Adelie female 36.6 146 -0.835
36 Adelie female 36.7 146 -0.798
37 Adelie female 36.7 146 -0.798
38 Adelie female 36.8 146 -0.760
39 Adelie female 36.9 146 -0.723
40 Adelie female 37 146 -0.685
41 Adelie female 37 146 -0.685
42 Adelie male 37.2 146 -0.610
43 Adelie male 37.2 146 -0.610
44 Adelie female 37.3 146 -0.572
45 Adelie male 37.3 146 -0.572
46 Adelie female 37.3 146 -0.572
47 Adelie male 37.5 146 -0.497
48 Adelie female 37.6 146 -0.460
49 Adelie male 37.6 146 -0.460
50 Adelie female 37.6 146 -0.460
51 Adelie male 37.7 146 -0.422
52 Adelie female 37.7 146 -0.422
53 Adelie male 37.7 146 -0.422
54 Adelie female 37.8 146 -0.385
55 Adelie male 37.8 146 -0.385
56 Adelie male 37.8 146 -0.385
57 Adelie female 37.9 146 -0.347
58 Adelie female 37.9 146 -0.347
59 Adelie female 38.1 146 -0.272
60 Adelie female 38.1 146 -0.272
61 Adelie female 38.1 146 -0.272
62 Adelie female 38.1 146 -0.272
63 Adelie male 38.2 146 -0.234
64 Adelie male 38.2 146 -0.234
65 Adelie male 38.3 146 -0.197
66 Adelie female 38.5 146 -0.122
67 Adelie male 38.6 146 -0.0841
68 Adelie female 38.6 146 -0.0841
69 Adelie female 38.6 146 -0.0841
70 Adelie female 38.7 146 -0.0466
71 Adelie male 38.8 146 -0.00900
72 Adelie male 38.8 146 -0.00900
73 Adelie female 38.8 146 -0.00900
74 Adelie female 38.9 146 0.0286
75 Adelie female 38.9 146 0.0286
76 Adelie female 39 146 0.0661
77 Adelie female 39 146 0.0661
78 Adelie male 39 146 0.0661
79 Adelie male 39.1 146 0.104
80 Adelie male 39.2 146 0.141
81 Adelie male 39.2 146 0.141
82 Adelie male 39.2 146 0.141
83 Adelie male 39.3 146 0.179
84 Adelie female 39.5 146 0.254
85 Adelie female 39.5 146 0.254
86 Adelie female 39.5 146 0.254
87 Adelie male 39.6 146 0.291
88 Adelie female 39.6 146 0.291
89 Adelie female 39.6 146 0.291
90 Adelie male 39.6 146 0.291
91 Adelie female 39.6 146 0.291
92 Adelie male 39.7 146 0.329
93 Adelie male 39.7 146 0.329
94 Adelie female 39.7 146 0.329
95 Adelie male 39.7 146 0.329
96 Adelie male 39.8 146 0.367
97 Adelie male 40.1 146 0.479
98 Adelie female 40.2 146 0.517
99 Adelie male 40.2 146 0.517
100 Adelie female 40.2 146 0.517
101 Adelie female 40.3 146 0.554
102 Adelie male 40.3 146 0.554
103 Adelie female 40.5 146 0.629
104 Adelie male 40.5 146 0.629
105 Adelie male 40.6 146 0.667
106 Adelie male 40.6 146 0.667
107 Adelie male 40.6 146 0.667
108 Adelie male 40.6 146 0.667
109 Adelie male 40.7 146 0.705
110 Adelie male 40.8 146 0.742
111 Adelie male 40.8 146 0.742
112 Adelie male 40.9 146 0.780
113 Adelie female 40.9 146 0.780
114 Adelie male 41 146 0.817
115 Adelie female 41.1 146 0.855
116 Adelie male 41.1 146 0.855
117 Adelie male 41.1 146 0.855
118 Adelie male 41.1 146 0.855
119 Adelie male 41.1 146 0.855
120 Adelie male 41.1 146 0.855
121 Adelie male 41.1 146 0.855
122 Adelie male 41.3 146 0.930
123 Adelie male 41.3 146 0.930
124 Adelie male 41.4 146 0.967
125 Adelie male 41.4 146 0.967
126 Adelie male 41.5 146 1.01
127 Adelie male 41.5 146 1.01
128 Adelie male 41.6 146 1.04
129 Adelie male 41.8 146 1.12
130 Adelie male 42 146 1.19
131 Adelie male 42.1 146 1.23
132 Adelie female 42.2 146 1.27
133 Adelie male 42.2 146 1.27
134 Adelie male 42.3 146 1.31
135 Adelie male 42.5 146 1.38
136 Adelie male 42.7 146 1.46
137 Adelie male 42.8 146 1.49
138 Adelie male 42.9 146 1.53
139 Adelie male 43.1 146 1.61
140 Adelie male 43.2 146 1.64
141 Adelie male 43.2 146 1.64
142 Adelie male 44.1 146 1.98
143 Adelie male 44.1 146 1.98
144 Adelie male 45.6 146 2.54
145 Adelie male 45.8 146 2.62
146 Adelie male 46 146 2.70
147 Chinstrap female 40.9 68 -2.38
148 Chinstrap female 42.4 68 -1.93
149 Chinstrap female 42.5 68 -1.90
150 Chinstrap female 42.5 68 -1.90
151 Chinstrap female 43.2 68 -1.69
152 Chinstrap female 43.5 68 -1.60
153 Chinstrap female 45.2 68 -1.09
154 Chinstrap female 45.2 68 -1.09
155 Chinstrap female 45.4 68 -1.03
156 Chinstrap female 45.5 68 -0.998
157 Chinstrap female 45.6 68 -0.968
158 Chinstrap female 45.7 68 -0.938
159 Chinstrap female 45.7 68 -0.938
160 Chinstrap female 45.9 68 -0.879
161 Chinstrap female 46 68 -0.849
162 Chinstrap female 46.1 68 -0.819
163 Chinstrap female 46.2 68 -0.789
164 Chinstrap female 46.4 68 -0.729
165 Chinstrap female 46.4 68 -0.729
166 Chinstrap female 46.5 68 -0.699
167 Chinstrap female 46.6 68 -0.669
168 Chinstrap female 46.7 68 -0.639
169 Chinstrap female 46.8 68 -0.609
170 Chinstrap female 46.9 68 -0.579
171 Chinstrap female 47 68 -0.549
172 Chinstrap female 47.5 68 -0.399
173 Chinstrap female 47.6 68 -0.369
174 Chinstrap female 48.1 68 -0.220
175 Chinstrap male 48.5 68 -0.100
176 Chinstrap male 49 68 0.0498
177 Chinstrap male 49 68 0.0498
178 Chinstrap male 49.2 68 0.110
179 Chinstrap male 49.3 68 0.140
180 Chinstrap male 49.5 68 0.199
181 Chinstrap male 49.6 68 0.229
182 Chinstrap male 49.7 68 0.259
183 Chinstrap female 49.8 68 0.289
184 Chinstrap male 50 68 0.349
185 Chinstrap female 50.1 68 0.379
186 Chinstrap male 50.2 68 0.409
187 Chinstrap female 50.2 68 0.409
188 Chinstrap male 50.3 68 0.439
189 Chinstrap male 50.5 68 0.499
190 Chinstrap female 50.5 68 0.499
191 Chinstrap male 50.6 68 0.529
192 Chinstrap male 50.7 68 0.559
193 Chinstrap male 50.8 68 0.589
194 Chinstrap male 50.8 68 0.589
195 Chinstrap male 50.9 68 0.619
196 Chinstrap female 50.9 68 0.619
197 Chinstrap male 51 68 0.649
198 Chinstrap male 51.3 68 0.739
199 Chinstrap male 51.3 68 0.739
200 Chinstrap male 51.3 68 0.739
201 Chinstrap male 51.4 68 0.768
202 Chinstrap male 51.5 68 0.798
203 Chinstrap male 51.7 68 0.858
204 Chinstrap male 51.9 68 0.918
205 Chinstrap male 52 68 0.948
206 Chinstrap male 52 68 0.948
207 Chinstrap male 52 68 0.948
208 Chinstrap male 52.2 68 1.01
209 Chinstrap male 52.7 68 1.16
210 Chinstrap male 52.8 68 1.19
211 Chinstrap male 53.5 68 1.40
212 Chinstrap male 54.2 68 1.61
213 Chinstrap male 55.8 68 2.09
214 Chinstrap female 58 68 2.74
215 Gentoo female 40.9 119 -2.15
216 Gentoo female 41.7 119 -1.89
217 Gentoo female 42 119 -1.79
218 Gentoo female 42.6 119 -1.60
219 Gentoo female 42.7 119 -1.57
220 Gentoo female 42.8 119 -1.54
221 Gentoo female 42.9 119 -1.50
222 Gentoo female 43.2 119 -1.41
223 Gentoo female 43.3 119 -1.37
224 Gentoo female 43.3 119 -1.37
225 Gentoo female 43.4 119 -1.34
226 Gentoo female 43.5 119 -1.31
227 Gentoo female 43.5 119 -1.31
228 Gentoo female 43.6 119 -1.28
229 Gentoo female 43.8 119 -1.21
230 Gentoo female 44 119 -1.15
231 Gentoo male 44.4 119 -1.02
232 Gentoo female 44.5 119 -0.988
233 Gentoo female 44.9 119 -0.859
234 Gentoo female 44.9 119 -0.859
235 Gentoo male 45 119 -0.827
236 Gentoo female 45.1 119 -0.795
237 Gentoo female 45.1 119 -0.795
238 Gentoo female 45.1 119 -0.795
239 Gentoo male 45.2 119 -0.762
240 Gentoo female 45.2 119 -0.762
241 Gentoo male 45.2 119 -0.762
242 Gentoo female 45.2 119 -0.762
243 Gentoo female 45.3 119 -0.730
244 Gentoo female 45.3 119 -0.730
245 Gentoo female 45.4 119 -0.698
246 Gentoo female 45.5 119 -0.666
247 Gentoo female 45.5 119 -0.666
248 Gentoo male 45.5 119 -0.666
249 Gentoo female 45.5 119 -0.666
250 Gentoo female 45.7 119 -0.601
251 Gentoo female 45.8 119 -0.569
252 Gentoo female 45.8 119 -0.569
253 Gentoo female 46.1 119 -0.473
254 Gentoo male 46.1 119 -0.473
255 Gentoo female 46.2 119 -0.440
256 Gentoo male 46.2 119 -0.440
257 Gentoo female 46.2 119 -0.440
258 Gentoo male 46.3 119 -0.408
259 Gentoo male 46.4 119 -0.376
260 Gentoo female 46.4 119 -0.376
261 Gentoo female 46.5 119 -0.344
262 Gentoo female 46.5 119 -0.344
263 Gentoo female 46.5 119 -0.344
264 Gentoo female 46.5 119 -0.344
265 Gentoo female 46.6 119 -0.312
266 Gentoo male 46.7 119 -0.279
267 Gentoo male 46.8 119 -0.247
268 Gentoo male 46.8 119 -0.247
269 Gentoo female 46.8 119 -0.247
270 Gentoo female 46.9 119 -0.215
271 Gentoo female 47.2 119 -0.118
272 Gentoo female 47.2 119 -0.118
273 Gentoo male 47.3 119 -0.0863
274 Gentoo female 47.4 119 -0.0541
275 Gentoo female 47.5 119 -0.0219
276 Gentoo female 47.5 119 -0.0219
277 Gentoo female 47.5 119 -0.0219
278 Gentoo male 47.6 119 0.0103
279 Gentoo female 47.7 119 0.0425
280 Gentoo male 47.8 119 0.0747
281 Gentoo male 48.1 119 0.171
282 Gentoo female 48.2 119 0.203
283 Gentoo male 48.2 119 0.203
284 Gentoo male 48.4 119 0.268
285 Gentoo male 48.4 119 0.268
286 Gentoo female 48.4 119 0.268
287 Gentoo male 48.5 119 0.300
288 Gentoo female 48.5 119 0.300
289 Gentoo male 48.6 119 0.332
290 Gentoo female 48.7 119 0.364
291 Gentoo male 48.7 119 0.364
292 Gentoo male 48.7 119 0.364
293 Gentoo male 48.8 119 0.397
294 Gentoo male 49 119 0.461
295 Gentoo female 49.1 119 0.493
296 Gentoo female 49.1 119 0.493
297 Gentoo male 49.1 119 0.493
298 Gentoo male 49.2 119 0.525
299 Gentoo male 49.3 119 0.558
300 Gentoo male 49.4 119 0.590
301 Gentoo male 49.5 119 0.622
302 Gentoo male 49.5 119 0.622
303 Gentoo male 49.6 119 0.654
304 Gentoo male 49.6 119 0.654
305 Gentoo male 49.8 119 0.719
306 Gentoo male 49.8 119 0.719
307 Gentoo male 49.9 119 0.751
308 Gentoo male 50 119 0.783
309 Gentoo male 50 119 0.783
310 Gentoo male 50 119 0.783
311 Gentoo male 50 119 0.783
312 Gentoo male 50.1 119 0.815
313 Gentoo male 50.2 119 0.847
314 Gentoo male 50.4 119 0.912
315 Gentoo male 50.4 119 0.912
316 Gentoo male 50.5 119 0.944
317 Gentoo male 50.5 119 0.944
318 Gentoo female 50.5 119 0.944
319 Gentoo male 50.7 119 1.01
320 Gentoo male 50.8 119 1.04
321 Gentoo male 50.8 119 1.04
322 Gentoo male 51.1 119 1.14
323 Gentoo male 51.1 119 1.14
324 Gentoo male 51.3 119 1.20
325 Gentoo male 51.5 119 1.27
326 Gentoo male 52.1 119 1.46
327 Gentoo male 52.2 119 1.49
328 Gentoo male 52.5 119 1.59
329 Gentoo male 53.4 119 1.88
330 Gentoo male 54.3 119 2.17
331 Gentoo male 55.1 119 2.42
332 Gentoo male 55.9 119 2.68
333 Gentoo male 59.6 119 3.87
# get the largest 3 penguins by bill depth from each species
penguins %>%
group_by(species) %>%
slice_max(bill_depth_mm, n = 3)
# A tibble: 9 × 8
# Groups: species [3]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 Adelie Torgers… 46 21.5 194 4200 male
2 Adelie Torgers… 38.6 21.2 191 3800 male
3 Adelie Dream 42.3 21.2 191 4150 male
4 Chinstrap Dream 54.2 20.8 201 4300 male
5 Chinstrap Dream 52 20.7 210 4800 male
6 Chinstrap Dream 51.7 20.3 194 3775 male
7 Gentoo Biscoe 44.4 17.3 219 5250 male
8 Gentoo Biscoe 50.8 17.3 228 5600 male
9 Gentoo Biscoe 52.2 17.1 228 5400 male
# ℹ 1 more variable: year <dbl>
8.1.1 Regrouping
Sometimes one group_by()
may not be enough to get what you need; you may need to re-group_by()
by something else to finish the job. For example, suppose you want to see what percent of each species came from different islands. This requires two uses of group_by()
:
# first group by species + island and summarize to get size of each group,
# then regroup by species and MUTATE (not summarize) totals for each species,
# then divide these to get proportion of each species from each island
penguins %>%
group_by(species, island) %>%
summarize(n = n()) %>%
group_by(species) %>%
mutate(
species_total = sum(n),
pct_of_species_from_island = n / species_total * 100
)
# A tibble: 5 × 5
# Groups: species [3]
species island n species_total pct_of_species_from_island
<chr> <chr> <int> <int> <dbl>
1 Adelie Biscoe 44 146 30.1
2 Adelie Dream 55 146 37.7
3 Adelie Torgersen 47 146 32.2
4 Chinstrap Dream 68 68 100
5 Gentoo Biscoe 119 119 100
This combination of df %>% group_by(...) %>% summarize(n = n())
is so common, we have this shortcut for it: df %>% count(...)
. We can demonstrate this in another example involving regrouping. Suppose we want to know what percent of each species were male/female:
# again, first group by species + sex and get size using the shortcut count(),
# then regroup by species and MUTATE totals for each species,
# then divide to get proportions of each species that were male/female
penguins %>%
count(species, sex) %>%
group_by(species) %>%
mutate(
species_total = sum(n),
pct_of_species_each_sex = n / species_total * 100
)
# A tibble: 6 × 5
# Groups: species [3]
species sex n species_total pct_of_species_each_sex
<chr> <chr> <int> <int> <dbl>
1 Adelie female 73 146 50
2 Adelie male 73 146 50
3 Chinstrap female 34 68 50
4 Chinstrap male 34 68 50
5 Gentoo female 58 119 48.7
6 Gentoo male 61 119 51.3
Groups are also useful for prepping data frames for plotting. For example, here’s a chunk that produces a bar plot showing how mean body mass changes by species and sex:
# get mean body mass by species + sex and plot
# the %T>% is a special pipe called a Tee pipe
# it's a shortcut for piping something into 2 different operations,
# useful for example when you want to print a data frame, then also plot it
# see https://magrittr.tidyverse.org/reference/tee.html for more
penguins %>%
group_by(species, sex) %>%
summarize(mean_mass = mean(body_mass_g)) %T>% print %>%
ggplot(aes(x = species, y = mean_mass, fill = sex)) +
geom_col(position = "dodge2") +
labs(x = "Species", y = "Mean body mass (g)",
title = "Mean body mass of Palmer penguins by species + sex")
# A tibble: 6 × 3
# Groups: species [3]
species sex mean_mass
<chr> <chr> <dbl>
1 Adelie female 3369.
2 Adelie male 4043.
3 Chinstrap female 3527.
4 Chinstrap male 3939.
5 Gentoo female 4680.
6 Gentoo male 5485.
8.1.2 Ungrouping
Many operations output grouped data frames. For example, look closely at the output of the previous chunks and you’ll see # Groups: species [3]
in most of them. This means any further operations you run will continue to execute in a grouped way.
You can remove the grouping structure with ungroup()
. This allows you to revert to running operations on the entire data frame. Example:
# get count of each species + sex combination,
# but this time get its percentage out of ALL observations
penguins %>%
count(species, sex) %>%
ungroup() %>%
mutate(pct_of_all = n / sum(n))
# A tibble: 6 × 4
species sex n pct_of_all
<chr> <chr> <int> <dbl>
1 Adelie female 73 0.219
2 Adelie male 73 0.219
3 Chinstrap female 34 0.102
4 Chinstrap male 34 0.102
5 Gentoo female 58 0.174
6 Gentoo male 61 0.183
8.1.3 More practice! (fertility
data)
Let’s give the penguins dataset a rest and practice dplyr and grouping a bit more with a different dataset. The following chunk imports fertility.csv
, the cleaned global fertility data set from World Bank, giving the average number of births per woman for each year and country from 1960 to present. For the past few decades, global fertility has been sharply declining for most countries. Many countries are now below the replacement rate of 2.1, leading to widespread concerns of a population collapse in the latter part of the 21st century.
fertility <- read_csv("https://bwu62.github.io/stat240-revamp/data/fertility.csv")
fertility
# A tibble: 13,792 × 6
code country region income_group year rate
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 AFG Afghanistan South Asia Low income 1960 7.28
2 AFG Afghanistan South Asia Low income 1961 7.28
3 AFG Afghanistan South Asia Low income 1962 7.29
4 AFG Afghanistan South Asia Low income 1963 7.30
5 AFG Afghanistan South Asia Low income 1964 7.30
6 AFG Afghanistan South Asia Low income 1965 7.30
7 AFG Afghanistan South Asia Low income 1966 7.32
8 AFG Afghanistan South Asia Low income 1967 7.34
9 AFG Afghanistan South Asia Low income 1968 7.36
10 AFG Afghanistan South Asia Low income 1969 7.39
# ℹ 13,782 more rows
Each country has over 60 years of data in this dataset. We can see an overview which countries are represented in the dataset and what their listed region and income group are by temporarily dropping years, removing duplicates, and printing the full output below in a collapsible box:
# A tibble: 215 × 3
country region income_group
<chr> <chr> <chr>
1 Afghanistan South Asia Low income
2 Albania Europe & Central Asia Upper middle income
3 Algeria Middle East & North Africa Upper middle income
4 American Samoa East Asia & Pacific High income
5 Andorra Europe & Central Asia High income
6 Angola Sub-Saharan Africa Lower middle income
7 Antigua & Barbuda Latin America & Caribbean High income
8 Argentina Latin America & Caribbean Upper middle income
9 Armenia Europe & Central Asia Upper middle income
10 Aruba Latin America & Caribbean High income
11 Australia East Asia & Pacific High income
12 Austria Europe & Central Asia High income
13 Azerbaijan Europe & Central Asia Upper middle income
14 Bahamas Latin America & Caribbean High income
15 Bahrain Middle East & North Africa High income
16 Bangladesh South Asia Lower middle income
17 Barbados Latin America & Caribbean High income
18 Belarus Europe & Central Asia Upper middle income
19 Belgium Europe & Central Asia High income
20 Belize Latin America & Caribbean Upper middle income
21 Benin Sub-Saharan Africa Lower middle income
22 Bermuda North America High income
23 Bhutan South Asia Lower middle income
24 Bolivia Latin America & Caribbean Lower middle income
25 Bosnia & Herzegovina Europe & Central Asia Upper middle income
26 Botswana Sub-Saharan Africa Upper middle income
27 Brazil Latin America & Caribbean Upper middle income
28 Brunei East Asia & Pacific High income
29 Bulgaria Europe & Central Asia High income
30 Burkina Faso Sub-Saharan Africa Low income
31 Burundi Sub-Saharan Africa Low income
32 Cambodia East Asia & Pacific Lower middle income
33 Cameroon Sub-Saharan Africa Lower middle income
34 Canada North America High income
35 Cape Verde Sub-Saharan Africa Lower middle income
36 Cayman Islands Latin America & Caribbean High income
37 Central African Republic Sub-Saharan Africa Low income
38 Chad Sub-Saharan Africa Low income
39 Channel Islands Europe & Central Asia High income
40 Chile Latin America & Caribbean High income
41 China East Asia & Pacific Upper middle income
42 Colombia Latin America & Caribbean Upper middle income
43 Comoros Sub-Saharan Africa Lower middle income
44 Congo, Dem. Rep. Sub-Saharan Africa Low income
45 Congo, Rep. Sub-Saharan Africa Lower middle income
46 Costa Rica Latin America & Caribbean Upper middle income
47 Croatia Europe & Central Asia High income
48 Cuba Latin America & Caribbean Upper middle income
49 Curaçao Latin America & Caribbean High income
50 Cyprus Europe & Central Asia High income
51 Czechia Europe & Central Asia High income
52 Denmark Europe & Central Asia High income
53 Djibouti Middle East & North Africa Lower middle income
54 Dominica Latin America & Caribbean Upper middle income
55 Dominican Republic Latin America & Caribbean Upper middle income
56 East Timor East Asia & Pacific Lower middle income
57 Ecuador Latin America & Caribbean Upper middle income
58 Egypt Middle East & North Africa Lower middle income
59 El Salvador Latin America & Caribbean Upper middle income
60 Equatorial Guinea Sub-Saharan Africa Upper middle income
61 Eritrea Sub-Saharan Africa Low income
62 Estonia Europe & Central Asia High income
63 Eswatini Sub-Saharan Africa Lower middle income
64 Ethiopia Sub-Saharan Africa Low income
65 Faroe Islands Europe & Central Asia High income
66 Fiji East Asia & Pacific Upper middle income
67 Finland Europe & Central Asia High income
68 France Europe & Central Asia High income
69 French Polynesia East Asia & Pacific High income
70 Gabon Sub-Saharan Africa Upper middle income
71 Gambia Sub-Saharan Africa Low income
72 Georgia Europe & Central Asia Upper middle income
73 Germany Europe & Central Asia High income
74 Ghana Sub-Saharan Africa Lower middle income
75 Gibraltar Europe & Central Asia High income
76 Greece Europe & Central Asia High income
77 Greenland Europe & Central Asia High income
78 Grenada Latin America & Caribbean Upper middle income
79 Guam East Asia & Pacific High income
80 Guatemala Latin America & Caribbean Upper middle income
81 Guinea Sub-Saharan Africa Lower middle income
82 Guinea-Bissau Sub-Saharan Africa Low income
83 Guyana Latin America & Caribbean High income
84 Haiti Latin America & Caribbean Lower middle income
85 Honduras Latin America & Caribbean Lower middle income
86 Hong Kong East Asia & Pacific High income
87 Hungary Europe & Central Asia High income
88 Iceland Europe & Central Asia High income
89 India South Asia Lower middle income
90 Indonesia East Asia & Pacific Upper middle income
91 Iran Middle East & North Africa Upper middle income
92 Iraq Middle East & North Africa Upper middle income
93 Ireland Europe & Central Asia High income
94 Isle of Man Europe & Central Asia High income
95 Israel Middle East & North Africa High income
96 Italy Europe & Central Asia High income
97 Ivory Coast Sub-Saharan Africa Lower middle income
98 Jamaica Latin America & Caribbean Upper middle income
99 Japan East Asia & Pacific High income
100 Jordan Middle East & North Africa Lower middle income
101 Kazakhstan Europe & Central Asia Upper middle income
102 Kenya Sub-Saharan Africa Lower middle income
103 Kiribati East Asia & Pacific Lower middle income
104 Kosovo Europe & Central Asia Upper middle income
105 Kuwait Middle East & North Africa High income
106 Kyrgyzstan Europe & Central Asia Lower middle income
107 Laos East Asia & Pacific Lower middle income
108 Latvia Europe & Central Asia High income
109 Lebanon Middle East & North Africa Lower middle income
110 Lesotho Sub-Saharan Africa Lower middle income
111 Liberia Sub-Saharan Africa Low income
112 Libya Middle East & North Africa Upper middle income
113 Liechtenstein Europe & Central Asia High income
114 Lithuania Europe & Central Asia High income
115 Luxembourg Europe & Central Asia High income
116 Macao East Asia & Pacific High income
117 Madagascar Sub-Saharan Africa Low income
118 Malawi Sub-Saharan Africa Low income
119 Malaysia East Asia & Pacific Upper middle income
120 Maldives South Asia Upper middle income
121 Mali Sub-Saharan Africa Low income
122 Malta Middle East & North Africa High income
123 Marshall Islands East Asia & Pacific Upper middle income
124 Mauritania Sub-Saharan Africa Lower middle income
125 Mauritius Sub-Saharan Africa Upper middle income
126 Mexico Latin America & Caribbean Upper middle income
127 Micronesia East Asia & Pacific Lower middle income
128 Moldova Europe & Central Asia Upper middle income
129 Monaco Europe & Central Asia High income
130 Mongolia East Asia & Pacific Upper middle income
131 Montenegro Europe & Central Asia Upper middle income
132 Morocco Middle East & North Africa Lower middle income
133 Mozambique Sub-Saharan Africa Low income
134 Myanmar East Asia & Pacific Lower middle income
135 Namibia Sub-Saharan Africa Upper middle income
136 Nauru East Asia & Pacific High income
137 Nepal South Asia Lower middle income
138 Netherlands Europe & Central Asia High income
139 New Caledonia East Asia & Pacific High income
140 New Zealand East Asia & Pacific High income
141 Nicaragua Latin America & Caribbean Lower middle income
142 Niger Sub-Saharan Africa Low income
143 Nigeria Sub-Saharan Africa Lower middle income
144 North Korea East Asia & Pacific Low income
145 North Macedonia Europe & Central Asia Upper middle income
146 Northern Mariana Islands East Asia & Pacific High income
147 Norway Europe & Central Asia High income
148 Oman Middle East & North Africa High income
149 Pakistan South Asia Lower middle income
150 Palau East Asia & Pacific High income
151 Panama Latin America & Caribbean High income
152 Papua New Guinea East Asia & Pacific Lower middle income
153 Paraguay Latin America & Caribbean Upper middle income
154 Peru Latin America & Caribbean Upper middle income
155 Philippines East Asia & Pacific Lower middle income
156 Poland Europe & Central Asia High income
157 Portugal Europe & Central Asia High income
158 Puerto Rico Latin America & Caribbean High income
159 Qatar Middle East & North Africa High income
160 Romania Europe & Central Asia High income
161 Russia Europe & Central Asia High income
162 Rwanda Sub-Saharan Africa Low income
163 Samoa East Asia & Pacific Lower middle income
164 San Marino Europe & Central Asia High income
165 Saudi Arabia Middle East & North Africa High income
166 Senegal Sub-Saharan Africa Lower middle income
167 Serbia Europe & Central Asia Upper middle income
168 Seychelles Sub-Saharan Africa High income
169 Sierra Leone Sub-Saharan Africa Low income
170 Singapore East Asia & Pacific High income
171 Sint Maarten Latin America & Caribbean High income
172 Slovakia Europe & Central Asia High income
173 Slovenia Europe & Central Asia High income
174 Solomon Islands East Asia & Pacific Lower middle income
175 Somalia Sub-Saharan Africa Low income
176 South Africa Sub-Saharan Africa Upper middle income
177 South Korea East Asia & Pacific High income
178 South Sudan Sub-Saharan Africa Low income
179 Spain Europe & Central Asia High income
180 Sri Lanka South Asia Lower middle income
181 St. Kitts & Nevis Latin America & Caribbean High income
182 St. Lucia Latin America & Caribbean Upper middle income
183 St. Martin Latin America & Caribbean High income
184 St. Vincent & Grenadines Latin America & Caribbean Upper middle income
185 Sudan Sub-Saharan Africa Low income
186 Suriname Latin America & Caribbean Upper middle income
187 Sweden Europe & Central Asia High income
188 Switzerland Europe & Central Asia High income
189 Syria Middle East & North Africa Low income
190 São Tomé & Principe Sub-Saharan Africa Lower middle income
191 Tajikistan Europe & Central Asia Lower middle income
192 Tanzania Sub-Saharan Africa Lower middle income
193 Thailand East Asia & Pacific Upper middle income
194 Togo Sub-Saharan Africa Low income
195 Tonga East Asia & Pacific Upper middle income
196 Trinidad & Tobago Latin America & Caribbean High income
197 Tunisia Middle East & North Africa Lower middle income
198 Turkey Europe & Central Asia Upper middle income
199 Turkmenistan Europe & Central Asia Upper middle income
200 Turks & Caicos Islands Latin America & Caribbean High income
201 Tuvalu East Asia & Pacific Upper middle income
202 Uganda Sub-Saharan Africa Low income
203 Ukraine Europe & Central Asia Upper middle income
204 United Arab Emirates Middle East & North Africa High income
205 United Kingdom Europe & Central Asia High income
206 United States North America High income
207 Uruguay Latin America & Caribbean High income
208 Uzbekistan Europe & Central Asia Lower middle income
209 Vanuatu East Asia & Pacific Lower middle income
210 Vietnam East Asia & Pacific Lower middle income
211 Virgin Islands Latin America & Caribbean High income
212 West Bank & Gaza Middle East & North Africa Lower middle income
213 Yemen Middle East & North Africa Low income
214 Zambia Sub-Saharan Africa Lower middle income
215 Zimbabwe Sub-Saharan Africa Lower middle income
One additional small processing step we should do before continuing is convert income_group
to an ordered factor (see section 3.9.5), which will be important later.
fertility <- fertility %>% mutate(
income_group = factor(income_group, ordered = TRUE, levels = c(
"Low", "Lower middle", "Upper middle", "High"))
)
We can begin by running a few summaries to explore the dataset. To start, here’s a chunk showing the number of countries and median income group for countries in each region:
# strangely, base R median doesn't work on ordered categories,
# but we can use Median from DescTools instead
fertility %>%
select(country, region, income_group) %>%
distinct() %>%
group_by(region) %>%
summarize(n = n(), median = DescTools::Median(income_group)) %>%
arrange(desc(median))
# A tibble: 7 × 3
region n median
<chr> <int> <lgl>
1 East Asia & Pacific 37 NA
2 Europe & Central Asia 58 NA
3 Latin America & Caribbean 40 NA
4 Middle East & North Africa 21 NA
5 North America 3 NA
6 South Asia 8 NA
7 Sub-Saharan Africa 48 NA
Next, here’s a chunk showing the median fertility rate in each region for the most recent year of 2023, as well as the countries with the highest and lowest 2023 rates (and what the rates are) in each region:
# first filter to get the right year, then
# sort by region, rate (so min, max are the first, last in each group)
# then summarize to get median, and min/max country/rate
fertility %>%
filter(year == max(year)) %>%
arrange(region, rate) %>%
group_by(region) %>%
summarize(
n = n(),
median = median(rate),
min_country = first(country),
min = first(rate),
max_country = last(country),
max = last(rate)
) %>%
arrange(median)
# A tibble: 7 × 7
region n median min_country min max_country max
<chr> <int> <dbl> <chr> <dbl> <chr> <dbl>
1 North America 3 1.40 Canada 1.26 United States 1.62
2 Europe & Central Asia 58 1.49 Ukraine 0.977 Uzbekistan 3.5
3 Latin America & Caribbean 41 1.62 Puerto Rico 0.92 St. Martin 2.72
4 South Asia 8 1.98 Bhutan 1.46 Afghanistan 4.84
5 East Asia & Pacific 37 2.13 Macao 0.586 Samoa 3.83
6 Middle East & North Africa 21 2.36 Malta 1.06 Yemen 4.59
7 Sub-Saharan Africa 48 4.00 Mauritius 1.39 Somalia 6.13
We can also show the latest rate for each country, as well as the change from 2000, just before the start of the 21st century:
# first filter to get the right years, then
# sort by country, year (so 2000, 2022 are first and last in each group)
# then summarize to get 2000 and 2022 rates, mutate to get change,
# then ungroup, distinct, and arrange to display a neat output
# again, collapsing output due to lengthy print out
# %T>% is used again to both save and print the results
fertility_change <- fertility %>%
filter(year %in% c(2000, max(year))) %>%
arrange(country, year) %>%
group_by(country) %>%
mutate(
rate2000 = first(rate),
rate2022 = last(rate),
change = rate2022 - rate2000
) %>%
select(country, region, rate2000, change, rate2022) %>%
ungroup() %>%
distinct() %>%
arrange(rate2022) %T>%
print(n = Inf)
# A tibble: 215 × 5
country region rate2000 change rate2022
<chr> <chr> <dbl> <dbl> <dbl>
1 Macao East Asia & Pacific 0.933 -0.347 0.586
2 South Korea East Asia & Pacific 1.48 -0.759 0.721
3 Hong Kong East Asia & Pacific 1.03 -0.281 0.751
4 Puerto Rico Latin America & Caribbean 2.05 -1.13 0.92
5 Singapore East Asia & Pacific 1.6 -0.63 0.97
6 Ukraine Europe & Central Asia 1.11 -0.131 0.977
7 China East Asia & Pacific 1.63 -0.629 0.999
8 Malta Middle East & North Africa 1.68 -0.62 1.06
9 Andorra Europe & Central Asia 1.27 -0.191 1.08
10 Spain Europe & Central Asia 1.22 -0.100 1.12
11 San Marino Europe & Central Asia 1.40 -0.255 1.15
12 Poland Europe & Central Asia 1.37 -0.212 1.16
13 Chile Latin America & Caribbean 2.05 -0.879 1.17
14 Lithuania Europe & Central Asia 1.39 -0.21 1.18
15 Curaçao Latin America & Caribbean 2.34 -1.14 1.2
16 Italy Europe & Central Asia 1.26 -0.0600 1.2
17 Japan East Asia & Pacific 1.36 -0.160 1.2
18 United Arab Emirates Middle East & North Africa 2.73 -1.53 1.2
19 Belarus Europe & Central Asia 1.32 -0.109 1.21
20 Thailand East Asia & Pacific 1.73 -0.522 1.21
21 Luxembourg Europe & Central Asia 1.76 -0.51 1.25
22 Canada North America 1.51 -0.25 1.26
23 Finland Europe & Central Asia 1.73 -0.47 1.26
24 Estonia Europe & Central Asia 1.36 -0.0500 1.31
25 Austria Europe & Central Asia 1.36 -0.0400 1.32
26 Greece Europe & Central Asia 1.25 0.0700 1.32
27 Costa Rica Latin America & Caribbean 2.41 -1.08 1.33
28 Switzerland Europe & Central Asia 1.5 -0.17 1.33
29 Albania Europe & Central Asia 2.22 -0.869 1.35
30 Jamaica Latin America & Caribbean 2.35 -0.987 1.36
31 Latvia Europe & Central Asia 1.25 0.110 1.36
32 Channel Islands Europe & Central Asia 1.49 -0.120 1.37
33 Bahamas Latin America & Caribbean 2.06 -0.687 1.37
34 St. Lucia Latin America & Caribbean 2.21 -0.828 1.38
35 Cyprus Europe & Central Asia 1.64 -0.249 1.39
36 Germany Europe & Central Asia 1.38 0.0100 1.39
37 Mauritius Sub-Saharan Africa 1.99 -0.6 1.39
38 Norway Europe & Central Asia 1.85 -0.45 1.4
39 Bermuda North America 1.74 -0.337 1.40
40 Uruguay Latin America & Caribbean 2.20 -0.787 1.41
41 Russia Europe & Central Asia 1.20 0.215 1.41
42 Netherlands Europe & Central Asia 1.72 -0.29 1.43
43 Cuba Latin America & Caribbean 1.61 -0.169 1.44
44 Portugal Europe & Central Asia 1.55 -0.110 1.44
45 Sint Maarten Latin America & Caribbean 1.98 -0.535 1.45
46 Czechia Europe & Central Asia 1.15 0.3 1.45
47 Liechtenstein Europe & Central Asia 1.57 -0.120 1.45
48 Sweden Europe & Central Asia 1.54 -0.0900 1.45
49 Croatia Europe & Central Asia 1.39 0.0700 1.46
50 Bhutan South Asia 3.38 -1.92 1.46
51 Turks & Caicos Islands Latin America & Caribbean 2.41 -0.943 1.46
52 Belgium Europe & Central Asia 1.67 -0.2 1.47
53 Dominica Latin America & Caribbean 2.35 -0.862 1.48
54 Bosnia & Herzegovina Europe & Central Asia 1.28 0.204 1.49
55 Grenada Latin America & Caribbean 2.44 -0.954 1.49
56 Slovakia Europe & Central Asia 1.3 0.19 1.49
57 Argentina Latin America & Caribbean 2.59 -1.09 1.5
58 Australia East Asia & Pacific 1.76 -0.256 1.5
59 Denmark Europe & Central Asia 1.77 -0.27 1.5
60 Ireland Europe & Central Asia 1.89 -0.39 1.5
61 North Macedonia Europe & Central Asia 1.86 -0.363 1.5
62 French Polynesia East Asia & Pacific 2.60 -1.10 1.50
63 St. Kitts & Nevis Latin America & Caribbean 2.13 -0.621 1.50
64 Hungary Europe & Central Asia 1.32 0.19 1.51
65 Slovenia Europe & Central Asia 1.26 0.25 1.51
66 Turkey Europe & Central Asia 2.49 -0.977 1.51
67 Cape Verde Sub-Saharan Africa 3.54 -2.02 1.52
68 Kuwait Middle East & North Africa 2.89 -1.36 1.52
69 Cayman Islands Latin America & Caribbean 1.71 -0.176 1.53
70 Trinidad & Tobago Latin America & Caribbean 1.73 -0.2 1.53
71 Kosovo Europe & Central Asia 2.70 -1.15 1.54
72 Isle of Man Europe & Central Asia 1.87 -0.318 1.55
73 Azerbaijan Europe & Central Asia 2 -0.45 1.55
74 Malaysia East Asia & Pacific 2.94 -1.39 1.55
75 New Zealand East Asia & Pacific 1.98 -0.42 1.56
76 United Kingdom Europe & Central Asia 1.64 -0.0800 1.56
77 Maldives South Asia 2.69 -1.12 1.58
78 Antigua & Barbuda Latin America & Caribbean 2.20 -0.618 1.58
79 Iceland Europe & Central Asia 2.08 -0.49 1.59
80 Aruba Latin America & Caribbean 1.84 -0.243 1.60
81 Serbia Europe & Central Asia 1.48 0.130 1.61
82 United States North America 2.06 -0.440 1.62
83 Brazil Latin America & Caribbean 2.25 -0.628 1.62
84 Colombia Latin America & Caribbean 2.57 -0.929 1.64
85 France Europe & Central Asia 1.89 -0.23 1.66
86 Iran Middle East & North Africa 2.00 -0.304 1.70
87 Barbados Latin America & Caribbean 1.93 -0.225 1.71
88 Romania Europe & Central Asia 1.31 0.4 1.71
89 Qatar Middle East & North Africa 3.07 -1.34 1.73
90 Moldova Europe & Central Asia 1.50 0.235 1.73
91 Montenegro Europe & Central Asia 1.90 -0.163 1.74
92 Brunei East Asia & Pacific 2.40 -0.651 1.75
93 Greenland Europe & Central Asia 2.33 -0.56 1.77
94 El Salvador Latin America & Caribbean 3.08 -1.30 1.78
95 St. Vincent & Grenadines Latin America & Caribbean 2.31 -0.531 1.78
96 North Korea East Asia & Pacific 1.94 -0.161 1.78
97 Bulgaria Europe & Central Asia 1.26 0.55 1.81
98 Georgia Europe & Central Asia 1.61 0.201 1.81
99 Ecuador Latin America & Caribbean 3.10 -1.28 1.82
100 Bahrain Middle East & North Africa 2.80 -0.974 1.82
101 Tunisia Middle East & North Africa 2.02 -0.193 1.83
102 Faroe Islands Europe & Central Asia 2.59 -0.728 1.86
103 Gibraltar Europe & Central Asia 2.18 -0.297 1.89
104 Armenia Europe & Central Asia 1.3 0.6 1.9
105 Palau East Asia & Pacific 1.82 0.0880 1.91
106 Mexico Latin America & Caribbean 2.71 -0.804 1.91
107 Vietnam East Asia & Pacific 2.03 -0.115 1.91
108 Philippines East Asia & Pacific 3.75 -1.84 1.92
109 Sri Lanka South Asia 2.21 -0.239 1.97
110 India South Asia 3.35 -1.38 1.98
111 New Caledonia East Asia & Pacific 2.51 -0.529 1.98
112 Peru Latin America & Caribbean 2.84 -0.865 1.98
113 Virgin Islands Latin America & Caribbean 1.84 0.141 1.98
114 Nepal South Asia 3.98 -2 1.98
115 Belize Latin America & Caribbean 3.63 -1.63 2.01
116 Seychelles Sub-Saharan Africa 2.08 -0.0600 2.02
117 Monaco Europe & Central Asia 2.08 0.0280 2.11
118 Myanmar East Asia & Pacific 2.81 -0.698 2.12
119 Panama Latin America & Caribbean 2.74 -0.625 2.12
120 Indonesia East Asia & Pacific 2.50 -0.376 2.13
121 Bangladesh South Asia 3.28 -1.12 2.16
122 South Africa Sub-Saharan Africa 2.41 -0.197 2.22
123 Nicaragua Latin America & Caribbean 3.12 -0.902 2.22
124 Morocco Middle East & North Africa 2.77 -0.544 2.23
125 Lebanon Middle East & North Africa 2.6 -0.361 2.24
126 Dominican Republic Latin America & Caribbean 2.87 -0.626 2.24
127 Suriname Latin America & Caribbean 3.05 -0.797 2.25
128 Saudi Arabia Middle East & North Africa 4.21 -1.93 2.28
129 Fiji East Asia & Pacific 2.99 -0.711 2.28
130 American Samoa East Asia & Pacific 3.89 -1.60 2.29
131 Guatemala Latin America & Caribbean 4.58 -2.27 2.31
132 Northern Mariana Islands East Asia & Pacific 1.64 0.708 2.35
133 Libya Middle East & North Africa 2.91 -0.551 2.36
134 Guyana Latin America & Caribbean 3.06 -0.645 2.41
135 Laos East Asia & Pacific 4.43 -2.01 2.42
136 Paraguay Latin America & Caribbean 3.51 -1.09 2.42
137 Honduras Latin America & Caribbean 4.29 -1.79 2.5
138 Oman Middle East & North Africa 3.85 -1.33 2.52
139 Bolivia Latin America & Caribbean 3.99 -1.44 2.55
140 Cambodia East Asia & Pacific 3.79 -1.21 2.58
141 Djibouti Middle East & North Africa 4.60 -1.99 2.61
142 Jordan Middle East & North Africa 3.86 -1.23 2.64
143 Haiti Latin America & Caribbean 4.39 -1.73 2.66
144 Turkmenistan Europe & Central Asia 2.9 -0.21 2.69
145 Lesotho Sub-Saharan Africa 3.60 -0.911 2.69
146 Kyrgyzstan Europe & Central Asia 2.4 0.300 2.7
147 Mongolia East Asia & Pacific 2.2 0.5 2.7
148 East Timor East Asia & Pacific 5.93 -3.23 2.71
149 Syria Middle East & North Africa 4.07 -1.36 2.71
150 St. Martin Latin America & Caribbean 2.88 -0.158 2.72
151 Botswana Sub-Saharan Africa 3.29 -0.562 2.73
152 Micronesia East Asia & Pacific 4.22 -1.47 2.75
153 Egypt Middle East & North Africa 3.50 -0.755 2.75
154 Eswatini Sub-Saharan Africa 4.03 -1.28 2.75
155 Algeria Middle East & North Africa 2.59 0.176 2.77
156 Guam East Asia & Pacific 2.92 -0.134 2.78
157 Israel Middle East & North Africa 2.95 -0.100 2.85
158 Marshall Islands East Asia & Pacific 4.70 -1.78 2.92
159 Kazakhstan Europe & Central Asia 1.90 1.11 3.01
160 Tajikistan Europe & Central Asia 3.49 -0.418 3.07
161 Papua New Guinea East Asia & Pacific 4.53 -1.43 3.10
162 Tonga East Asia & Pacific 4.16 -1.03 3.13
163 Kiribati East Asia & Pacific 4.07 -0.924 3.15
164 Kenya Sub-Saharan Africa 5.14 -1.93 3.21
165 Tuvalu East Asia & Pacific 3.84 -0.633 3.21
166 Namibia Sub-Saharan Africa 3.99 -0.779 3.21
167 Iraq Middle East & North Africa 4.92 -1.67 3.25
168 West Bank & Gaza Middle East & North Africa 5.46 -2.15 3.31
169 Nauru East Asia & Pacific 3.54 -0.212 3.33
170 Ghana Sub-Saharan Africa 4.83 -1.43 3.40
171 Uzbekistan Europe & Central Asia 2.70 0.804 3.5
172 Solomon Islands East Asia & Pacific 4.76 -1.2 3.56
173 Vanuatu East Asia & Pacific 4.54 -0.941 3.60
174 Pakistan South Asia 5.35 -1.75 3.60
175 São Tomé & Principe Sub-Saharan Africa 5.16 -1.51 3.64
176 Gabon Sub-Saharan Africa 4.47 -0.822 3.65
177 Malawi Sub-Saharan Africa 6.00 -2.35 3.65
178 Rwanda Sub-Saharan Africa 5.97 -2.27 3.70
179 Eritrea Sub-Saharan Africa 5.40 -1.68 3.71
180 Zimbabwe Sub-Saharan Africa 4.01 -0.285 3.72
181 Sierra Leone Sub-Saharan Africa 6.36 -2.56 3.79
182 Senegal Sub-Saharan Africa 5.50 -1.68 3.82
183 Samoa East Asia & Pacific 4.52 -0.694 3.83
184 Guinea-Bissau Sub-Saharan Africa 5.79 -1.95 3.84
185 South Sudan Sub-Saharan Africa 6.80 -2.94 3.86
186 Comoros Sub-Saharan Africa 5.23 -1.35 3.88
187 Liberia Sub-Saharan Africa 5.88 -1.92 3.95
188 Madagascar Sub-Saharan Africa 5.66 -1.69 3.97
189 Ethiopia Sub-Saharan Africa 6.65 -2.66 3.99
190 Gambia Sub-Saharan Africa 5.72 -1.71 4.01
191 Equatorial Guinea Sub-Saharan Africa 5.83 -1.75 4.08
192 Zambia Sub-Saharan Africa 5.92 -1.82 4.10
193 Congo, Rep. Sub-Saharan Africa 4.72 -0.567 4.16
194 Burkina Faso Sub-Saharan Africa 6.52 -2.33 4.19
195 Togo Sub-Saharan Africa 5.16 -0.973 4.19
196 Guinea Sub-Saharan Africa 5.92 -1.70 4.22
197 Ivory Coast Sub-Saharan Africa 5.77 -1.49 4.28
198 Uganda Sub-Saharan Africa 6.79 -2.50 4.28
199 Cameroon Sub-Saharan Africa 5.51 -1.19 4.32
200 Sudan Sub-Saharan Africa 5.60 -1.27 4.32
201 Nigeria Sub-Saharan Africa 6.12 -1.64 4.48
202 Benin Sub-Saharan Africa 5.94 -1.38 4.56
203 Yemen Middle East & North Africa 6.32 -1.73 4.59
204 Tanzania Sub-Saharan Africa 5.67 -1.06 4.61
205 Mauritania Sub-Saharan Africa 5.46 -0.759 4.70
206 Mozambique Sub-Saharan Africa 5.83 -1.07 4.76
207 Afghanistan South Asia 7.57 -2.73 4.84
208 Burundi Sub-Saharan Africa 6.87 -1.99 4.88
209 Angola Sub-Saharan Africa 6.64 -1.52 5.12
210 Mali Sub-Saharan Africa 6.89 -1.27 5.61
211 Central African Republic Sub-Saharan Africa 5.85 0.161 6.01
212 Congo, Dem. Rep. Sub-Saharan Africa 6.70 -0.651 6.05
213 Niger Sub-Saharan Africa 7.83 -1.77 6.06
214 Chad Sub-Saharan Africa 7.25 -1.13 6.12
215 Somalia Sub-Saharan Africa 7.64 -1.51 6.13
Which countries had the biggest drop in fertility?
# A tibble: 10 × 5
country region rate2000 change rate2022
<chr> <chr> <dbl> <dbl> <dbl>
1 East Timor East Asia & Pacific 5.93 -3.23 2.71
2 South Sudan Sub-Saharan Africa 6.80 -2.94 3.86
3 Afghanistan South Asia 7.57 -2.73 4.84
4 Ethiopia Sub-Saharan Africa 6.65 -2.66 3.99
5 Sierra Leone Sub-Saharan Africa 6.36 -2.56 3.79
6 Uganda Sub-Saharan Africa 6.79 -2.50 4.28
7 Malawi Sub-Saharan Africa 6.00 -2.35 3.65
8 Burkina Faso Sub-Saharan Africa 6.52 -2.33 4.19
9 Guatemala Latin America & Caribbean 4.58 -2.27 2.31
10 Rwanda Sub-Saharan Africa 5.97 -2.27 3.70
Did any countries have actually increased?
# A tibble: 10 × 5
country region rate2000 change rate2022
<chr> <chr> <dbl> <dbl> <dbl>
1 Kazakhstan Europe & Central Asia 1.90 1.11 3.01
2 Uzbekistan Europe & Central Asia 2.70 0.804 3.5
3 Northern Mariana Islands East Asia & Pacific 1.64 0.708 2.35
4 Armenia Europe & Central Asia 1.3 0.6 1.9
5 Bulgaria Europe & Central Asia 1.26 0.55 1.81
6 Mongolia East Asia & Pacific 2.2 0.5 2.7
7 Romania Europe & Central Asia 1.31 0.4 1.71
8 Kyrgyzstan Europe & Central Asia 2.4 0.300 2.7
9 Czechia Europe & Central Asia 1.15 0.3 1.45
10 Slovenia Europe & Central Asia 1.26 0.25 1.51
How many countries are now above vs below replacement rate? We can do this by grouping by if the rate is at least 2.1—did I mention you can use transformations inside group_by()
as well?
# group by a new column called at_replacement indicating rate>=2.1
fertility %>%
group_by(at_replacement = rate >= 2.1) %>%
summarize(n = n()) %>%
mutate(pct = 100 * n / sum(n))
# A tibble: 2 × 3
at_replacement n pct
<lgl> <int> <dbl>
1 FALSE 3800 27.6
2 TRUE 9992 72.4
# we can repeat this grouping by region, this time using count()
# and just showing the percentage of countries at replacement
fertility %>%
count(region, at_replacement = rate >= 2.1) %>%
mutate(pct_at_rep = 100 * n / sum(n)) %>%
filter(at_replacement) %>%
select(-n, -at_replacement) %>%
arrange(-pct_at_rep)
# A tibble: 7 × 2
region pct_at_rep
<chr> <dbl>
1 Sub-Saharan Africa 21.9
2 Latin America & Caribbean 14.3
3 East Asia & Pacific 13.3
4 Europe & Central Asia 10.7
5 Middle East & North Africa 8.48
6 South Asia 3.42
7 North America 0.290
Let’s make a few plots of the data as well. Here are 2 plots showing median fertility rate over time grouping by either region or income level:
# first get mean rate in each region + year, then pipe into line plot
# fct_reorder2() is used to reorder legend to be same order as end of lines
# see https://forcats.tidyverse.org/reference/fct_reorder.html for more details
fertility %>%
group_by(region, year) %>%
summarize(median_rate = median(rate)) %>%
ggplot(aes(x = year, y = median_rate,
linetype = fct_reorder2(region, year, median_rate),
color = fct_reorder2(region, year, median_rate))) +
geom_hline(yintercept = 2.1, linetype = "dotted") + geom_line(linewidth = 1) +
labs(x = "Year", y = "Median fertility rate", linetype = "Region", color = "Region",
title = "Median fertility by region from 1960-2022",
subtitle = "(black dotted line at replacement rate of 2.1)") +
scale_x_continuous(expand = c(0, 0), breaks = seq(1960, 2022, 10),
minor_breaks = seq(1960, 2022, 2)) +
scale_y_continuous(expand = c(0, 0), limits = c(1.2, 7.2),
breaks = 2:7, minor_breaks = seq(1.2, 7.2, .2))
# first get mean rate in each income group + year, then pipe into line plot
fertility %>%
group_by(income_group, year) %>%
summarize(median_rate = median(rate)) %>%
ggplot(aes(x = year, y = median_rate,
linetype = income_group, color = income_group)) +
geom_hline(yintercept = 2.1, linetype = "dotted") + geom_line(linewidth = 1) +
labs(x = "Year", y = "Median fertility rate",
linetype = "Income group", color = "Income group",
title = "Median fertility by income group from 1960-2022",
subtitle = "(black dotted line at replacement rate of 2.1)") +
scale_x_continuous(expand = c(0, 0), breaks = seq(1960, 2022, 10),
minor_breaks = seq(1960, 2022, 2)) +
scale_y_continuous(expand = c(0, 0), limits = c(1.4, 7.2),
breaks = 2:7, minor_breaks = seq(1.6, 7.2, .2))
8.1.3.1 Bonus: choropleth
Here’s a bonus plotly choropleth just for fun (you do NOT need to learn this). You can change the year and pan/zoom to see each specific country. Red color indicates countries below the replacement rate 2.1, white indicates at the replacement rate, and blue indicates above the replacement rate.
library(plotly)
plot_ly(fertility, type = "choropleth", locations = ~code, z = ~rate,
text = ~country, frame = ~year, zmin = 0.7, zmax = 8.9, colorscale = list(
c(0, "rgb(255,0,0)"), c(0.1707, "rgb(255,255,255)"), c(1, "rgb(0,0,255)"))) %>%
layout(margin = list(l = 10, r = 10, b = 10, t = 35)) %>% animation_opts(frame = 100)
8.2 Merging
Let’s move on to another topic: merging data frames. Often, the information you need may be spread across several data frames from different samples or different sources, in which case you may need to merge data frames together.
There are generally 2 common ways data may need to be merged: binding rows, and joining columns.
- Binding rows is done when two data frames share the same columns, but have different rows, and you want to combine all the rows together.
- Joining columns is done when two data frames share the same rows, but have different columns, and you want to combine all the columns together.
These two are NOT the same, so it’s important to remember which is which.
8.2.1 Binding
Binding is the simpler of the two, so let’s start here. Returning briefly to the penguins data set, note there’s a year column which indicates when each sample was collected. This means there were 3 different studies conducted. Suppose each of the years was originally in its own data frame (I suspect this is likely to be true).
penguins2007 <- penguins %>% filter(year == 2007)
penguins2008 <- penguins %>% filter(year == 2008)
penguins2009 <- penguins %>% filter(year == 2009)
print(penguins2007, n = 5)
# A tibble: 103 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 Adelie Torgersen 39.1 18.7 181 3750 male
2 Adelie Torgersen 39.5 17.4 186 3800 female
3 Adelie Torgersen 40.3 18 195 3250 female
4 Adelie Torgersen 36.7 19.3 193 3450 female
5 Adelie Torgersen 39.3 20.6 190 3650 male
# ℹ 98 more rows
# ℹ 1 more variable: year <dbl>
print(penguins2008, n = 5)
# A tibble: 113 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 Adelie Biscoe 39.6 17.7 186 3500 female
2 Adelie Biscoe 40.1 18.9 188 4300 male
3 Adelie Biscoe 35 17.9 190 3450 female
4 Adelie Biscoe 42 19.5 200 4050 male
5 Adelie Biscoe 34.5 18.1 187 2900 female
# ℹ 108 more rows
# ℹ 1 more variable: year <dbl>
print(penguins2009, n = 5)
# A tibble: 117 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 Adelie Biscoe 35 17.9 192 3725 female
2 Adelie Biscoe 41 20 203 4725 male
3 Adelie Biscoe 37.7 16 183 3075 female
4 Adelie Biscoe 37.8 20 190 4250 male
5 Adelie Biscoe 37.9 18.6 193 2925 female
# ℹ 112 more rows
# ℹ 1 more variable: year <dbl>
We can see our 3 data frames penguins2007
, penguins2008
, penguins2009
share the exact same columns (same number of columns, same column names, same column types) but have different rows. Each data frame can be thought of as containing a fraction of the overall samples or subjects. In this case, we must row-bind the 3 data frames together.
The function bind_rows(df1, df2, ...)
lets us do this. Just pass each data frame that needs binding in:
penguins_bound <- bind_rows(penguins2007, penguins2008, penguins2009)
penguins_bound
# A tibble: 333 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 Adelie Torgersen 39.1 18.7 181 3750 male
2 Adelie Torgersen 39.5 17.4 186 3800 fema…
3 Adelie Torgersen 40.3 18 195 3250 fema…
4 Adelie Torgersen 36.7 19.3 193 3450 fema…
5 Adelie Torgersen 39.3 20.6 190 3650 male
6 Adelie Torgersen 38.9 17.8 181 3625 fema…
7 Adelie Torgersen 39.2 19.6 195 4675 male
8 Adelie Torgersen 41.1 17.6 182 3200 fema…
9 Adelie Torgersen 38.6 21.2 191 3800 male
10 Adelie Torgersen 34.6 21.1 198 4400 male
# ℹ 323 more rows
# ℹ 1 more variable: year <dbl>
It may be hard to see, since only the first 10 rows are printed here, but if you look at the row numbers, you can see we now have 103+113+117=333 rows after binding.
For best results, ensure input data frames have the exact same columns, with the same names and types!
8.2.2 Joining
Joining is slightly more complicated. This is for when all the rows are together, but the columns you need are spread out over several data frames. Often, this occurs when you obtain data from different sources and want to combine them to look for patterns, but it can also happen sometimes with multiple data frames gathered from the same source.
Joining is performed by first matching rows using a set of key variables before adding in the additional columns. The key variables should uniquely identify rows. Here’s a simple demo to start. Suppose you have the following data frames:
x <- tibble(
A = c("a", "b", "c"),
B = c(1, 2, 3)
)
y <- tibble(
A = c("a", "b", "d"),
C = ymd("2024.1.1") + 0:2
)
x
# A tibble: 3 × 2
A B
<chr> <dbl>
1 a 1
2 b 2
3 c 3
y
# A tibble: 3 × 2
A C
<chr> <date>
1 a 2024-01-01
2 b 2024-01-02
3 d 2024-01-03
Suppose you want to join x
and y
using column A as the “key”, i.e. “A” is the column that uniquely matches rows across data frames. There are 4 different joins you can use: inner_join()
, full_join()
, left_join()
, right_join()
, depending on which rows you want in the result:
-
inner_join(x, y)
matches rows using key variable(s), then returns only rows in bothx
andy
, -
full_join(x, y)
matches rows using key variable(s), then returns all rows in eitherx
ory
, -
left_join(x, y)
matches rows using key variable(s), then returns only rows inx
, -
right_join(x, y)
matches rows using key variable(s), then returns only rows iny
.
Below are demos of each join. Note each join has the exact same columns; the only difference is which rows are included in the output. Also note NAs
are automatically used to fill in any missing values that arise from the merge.
# only return rows in both x and y
inner_join(x, y)
# A tibble: 2 × 3
A B C
<chr> <dbl> <date>
1 a 1 2024-01-01
2 b 2 2024-01-02
# return rows in either x or y
full_join(x, y)
# A tibble: 4 × 3
A B C
<chr> <dbl> <date>
1 a 1 2024-01-01
2 b 2 2024-01-02
3 c 3 NA
4 d NA 2024-01-03
# return only rows in x
left_join(x, y)
# A tibble: 3 × 3
A B C
<chr> <dbl> <date>
1 a 1 2024-01-01
2 b 2 2024-01-02
3 c 3 NA
# return only rows in y
right_join(x, y)
# A tibble: 3 × 3
A B C
<chr> <dbl> <date>
1 a 1 2024-01-01
2 b 2 2024-01-02
3 d NA 2024-01-03
The above 4 are called the “mutating joins” since they mutate (i.e. change) the data frame by adding additional columns. There are 2 other less commonly used joins called the “filterting joins” called semi_join()
and anti_join()
that do not actually add columns; they are only used to filter rows that match or don’t match another data frame. Examples:
# semi_join(x, y) will FILTER rows in x that have matches in y
# note there are NO new columns added to x
semi_join(x, y)
# A tibble: 2 × 2
A B
<chr> <dbl>
1 a 1
2 b 2
# conversely, anti_join() will FILTER rows in x with NO match in y
# again, note NO new columns were added to x
anti_join(x, y)
# A tibble: 1 × 2
A B
<chr> <dbl>
1 c 3
These examples only have 1 key column, but the same works if you have 2 or more key columns; rows are matched if they have ALL the same values in the key columns.
When joining, R will automatically look for columns with the same name to use as key columns. Therefore, to run smoothly, it’s highly recommended to ensure the following before joining:
- Key columns should have the same name and type.
- All other non-key columns should have different names.
- When picking a join, carefully consider which rows you will actually need later.
- Generally, key values should be unique for each row. If you need to have duplicate key values, watch the joins closely to ensure the join works the way you want.
Let’s see a real example of joins in action.
8.2.3 Example: cleaning fertility
The World Bank fertility data set we used earlier obviously didn’t start out perfectly clean and tidy. When you download the CSV file from the data page it comes as a zip of the following two (renamed) files: fertility_meta.csv
and fertility_raw.csv
. One of them contains metadata for each country, while the other contains the actual annual fertility rates.
fertility_meta <- read_csv(
"https://bwu62.github.io/stat240-revamp/data/fertility_meta.csv")
fertility_raw <- read_csv(
"https://bwu62.github.io/stat240-revamp/data/fertility_raw.csv")
fertility_meta
# A tibble: 265 × 5
`Country Code` Region IncomeGroup SpecialNotes TableName
<chr> <chr> <chr> <chr> <chr>
1 ABW Latin America & Caribbean High income <NA> Aruba
2 AFE <NA> <NA> "26 countri… Africa E…
3 AFG South Asia Low income "The report… Afghanis…
4 AFW <NA> <NA> "22 countri… Africa W…
5 AGO Sub-Saharan Africa Lower middle inc… "The World … Angola
6 ALB Europe & Central Asia Upper middle inc… <NA> Albania
7 AND Europe & Central Asia High income <NA> Andorra
8 ARB <NA> <NA> "Arab World… Arab Wor…
9 ARE Middle East & North Africa High income <NA> United A…
10 ARG Latin America & Caribbean Upper middle inc… "The World … Argentina
# ℹ 255 more rows
fertility_raw
# A tibble: 266 × 68
`Country Name` `Country Code` `Indicator Name` `Indicator Code` `1960` `1961`
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 Aruba ABW Fertility rate,… SP.DYN.TFRT.IN 4.57 4.42
2 Africa Eastern and… AFE Fertility rate,… SP.DYN.TFRT.IN 6.65 6.67
3 Afghanistan AFG Fertility rate,… SP.DYN.TFRT.IN 7.28 7.28
4 Africa Western and… AFW Fertility rate,… SP.DYN.TFRT.IN 6.47 6.48
5 Angola AGO Fertility rate,… SP.DYN.TFRT.IN 6.71 6.79
6 Albania ALB Fertility rate,… SP.DYN.TFRT.IN 6.38 6.27
7 Andorra AND Fertility rate,… SP.DYN.TFRT.IN 2.54 2.54
8 Arab World ARB Fertility rate,… SP.DYN.TFRT.IN 6.92 6.97
9 United Arab Emirat… ARE Fertility rate,… SP.DYN.TFRT.IN 6.50 6.49
10 Argentina ARG Fertility rate,… SP.DYN.TFRT.IN 3.14 3.12
# ℹ 256 more rows
# ℹ 62 more variables: `1962` <dbl>, `1963` <dbl>, `1964` <dbl>, `1965` <dbl>,
# `1966` <dbl>, `1967` <dbl>, `1968` <dbl>, `1969` <dbl>, `1970` <dbl>,
# `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>,
# `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>,
# `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>,
# `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>, …
First, there are several columns like "SpecialNotes"
, "Indicator Name"
, and "Indicator Code"
that we don’t need; we can remove those with select()
. Country names are duplicated in both data frames, so we can pick one to use. Let’s pick "TableName"
since the names are slightly nicer (check this yourself). The rest of the columns should probably be renamed with rename()
.
We can also change the "Low income"
, "Lower middle income"
, … values into just "Low"
, "Lower middle"
, … to slightly simplify the column. Shorter labels will also be easier to work with in filter()
or in plots.
Note as well the existence of several NAs in region and income. Again, verify this yourself, but these are all further subregion groupings of countries (e.g. AFE and AFW are for Eastern/Southern and Western/Central African countries), so we can simply drop these rows.
# apply the processing steps above
fertility_meta2 <- fertility_meta %>%
rename(code = "Country Code", country = "TableName", region = "Region", income_group = "IncomeGroup") %>%
select(code, country, region, income_group) %>%
mutate(income_group = sub(" income", "", income_group)) %>%
filter(!is.na(income_group))
fertility_raw2 <- fertility_raw %>%
select(2, "1960":last_col()) %>%
rename(code = "Country Code")
print(fertility_meta2, n = 5)
# A tibble: 216 × 4
code country region income_group
<chr> <chr> <chr> <chr>
1 ABW Aruba Latin America & Caribbean High
2 AFG Afghanistan South Asia Low
3 AGO Angola Sub-Saharan Africa Lower middle
4 ALB Albania Europe & Central Asia Upper middle
5 AND Andorra Europe & Central Asia High
# ℹ 211 more rows
print(fertility_raw2, n = 5)
# A tibble: 266 × 65
code `1960` `1961` `1962` `1963` `1964` `1965` `1966` `1967` `1968` `1969` `1970`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ABW 4.57 4.42 4.26 4.11 3.94 3.80 3.62 3.45 3.28 3.11 2.97
2 AFE 6.65 6.67 6.69 6.71 6.72 6.74 6.77 6.78 6.78 6.78 6.79
3 AFG 7.28 7.28 7.29 7.30 7.30 7.30 7.32 7.34 7.36 7.39 7.4
4 AFW 6.47 6.48 6.49 6.50 6.52 6.53 6.56 6.58 6.61 6.63 6.66
5 AGO 6.71 6.79 6.87 6.95 7.04 7.12 7.19 7.27 7.33 7.39 7.43
# ℹ 261 more rows
# ℹ 53 more variables: `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>,
# `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>,
# `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>,
# `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>,
# `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>,
# `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, `1998` <dbl>, `1999` <dbl>, …
Now comes the key step! We’re going to use the "code"
columns (which are the 3-letter ISO 3166-1 country codes) as the key columns and join the two data frames together. Let’s use inner_join()
to keep only countries that appear in both, since we want to look at region and income group together with fertility rates.
Note these data frames already satisfy the recommended conditions, i.e. the key columns have the same name and type and uniquely identify each country, and all other columns have different names, so there should be no issues.
fertility_joined <- inner_join(fertility_meta2, fertility_raw2)
fertility_joined
# A tibble: 216 × 68
code country region income_group `1960` `1961` `1962` `1963` `1964` `1965` `1966`
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ABW Aruba Latin… High 4.57 4.42 4.26 4.11 3.94 3.80 3.62
2 AFG Afghan… South… Low 7.28 7.28 7.29 7.30 7.30 7.30 7.32
3 AGO Angola Sub-S… Lower middle 6.71 6.79 6.87 6.95 7.04 7.12 7.19
4 ALB Albania Europ… Upper middle 6.38 6.27 6.11 5.93 5.71 5.47 5.32
5 AND Andorra Europ… High 2.54 2.54 2.55 2.60 2.69 2.72 2.76
6 ARE United… Middl… High 6.50 6.49 6.49 6.50 6.49 6.50 6.50
7 ARG Argent… Latin… Upper middle 3.14 3.12 3.13 3.11 3.08 3.00 2.97
8 ARM Armenia Europ… Upper middle 4.79 4.67 4.52 4.34 4.15 3.99 3.83
9 ASM Americ… East … High 6.58 6.58 6.60 6.69 6.78 6.73 6.62
10 ATG Antigu… Latin… High 4.60 4.56 4.55 4.54 4.48 4.42 4.32
# ℹ 206 more rows
# ℹ 57 more variables: `1967` <dbl>, `1968` <dbl>, `1969` <dbl>, `1970` <dbl>,
# `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>,
# `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>,
# `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>,
# `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>,
# `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, …
This already looks pretty good! Our data frames joined nicely along the key "code"
column, and now our region, income, and fertility data is in 1 single data frame instead of spread across 2 data frames.
However, you may have noticed we still have a problem: our annual fertility rates are spread out across 64 columns, with each year in its own column. This is not compatible with our tidyverse functions. How do we fix it?
8.3 Pivoting
This is where pivoting becomes relevant. Pivoting (also sometimes called reshaping) refers to the process of transforming data between different representations of it. Let’s start with a small example to better illustrate this point.
Suppose two people Alice and Bob are playing a game. They agree to play 3 rounds. Each round, each person tries to score points up to a maximum of 10. The person that wins the most rounds wins overall. Suppose these are the final scores:
# create demo scores data frame
# sample() is used here to randomly generate points
scores_long <- tibble(
round = rep(1:3, each = 2),
player = rep(c("Alice", "Bob"), 3),
points = c(9, 4, 7, 1, 2, 7)
)
scores_long
# A tibble: 6 × 3
round player points
<int> <chr> <dbl>
1 1 Alice 9
2 1 Bob 4
3 2 Alice 7
4 2 Bob 1
5 3 Alice 2
6 3 Bob 7
However, the same data can also be represented in this alternative format:
# A tibble: 3 × 3
round Alice Bob
<int> <dbl> <dbl>
1 1 9 4
2 2 7 1
3 3 2 7
Despite the difference in shape and dimensions, the data contained in these data frames is identical. These are often called the long and wide representations of data, so named because one tends to be longer (i.e. more rows) and the other wider (i.e. more columns). Generally, long formats are easier to use with R/tidyverse, whereas wide formats are easier for humans to read, but there are of course exceptions to everything.
Usually, for data to be considered “tidy” it needs to be in a longer format. Tidy data is data that satisfies the following:
- Each variable is a column; each column is a variable.
- Each observation is a row; each row is an observation.
- Each value is a cell; each cell is a single value.33
Data in this format will generally be easier to work with using tidyverse, and broadly speaking R, since this format is highly versatile and most functions have been designed to work best on this format.
Hopefully, it should be easy to see that scores_long
above satisfies these properties. Each variable is indeed a column (and vice versa), each row an observation (vv.), etc. It should also be evident scores_wide
is NOT tidy, since each row actually contains 2 scores, and each column is a person, not a variable (the distinction here is subtle but meaningful).
Tidyr’s pivot_longer()
and pivot_wider()
functions let you easily convert between these two representations. When used correctly, these two should be inverses of each other, i.e. you can take a data frame and pivot it longer, then pivot it wider (or vice versa) and get back the original data frame. There are many ways of using these functions (see help page) but here’s a brief guide to their most common and important usage:
-
df %>% pivot_longer(cols, names_to = "..", values_to = "..")
turns a “wider” representation into a “longer” representation.-
cols
is a set of columns where the actual observations are “spread out over”. You can set this using numbers, names, ranges, or selector functions (just like forselect()
). In thescores_wide
data frame, this would be the two columnsAlice
andBob
since this is where the points for each round are stored. -
names_to
sets the name for the new column containing the names of thecols
. In thescores_wide
data frame, it will be filled with repetitions of"Alice"
and"Bob"
corresponding to each round’s points. -
values_to
sets the name for the new column containing the values insidecols
. In thescores_wide
data frame, it will be filled with the actual points scored in each round.
# example: scores_wide %>% pivot_longer(Alice:Bob, names_to = "player", values_to = "points")
# A tibble: 6 × 3 round player points <int> <chr> <dbl> 1 1 Alice 9 2 1 Bob 4 3 2 Alice 7 4 2 Bob 1 5 3 Alice 2 6 3 Bob 7
-
-
df %>% pivot_wider(names_from = "..", values_from = "..")
turns a “longer” represetation into a “wider” representation.-
names_from
sets the column of names that will each become their own column in the wider data frame. In thescores_long
data frame, this would be theplayer
column. -
values_from
sets the column of actual observations that will fill each of the newly created wider set of columns. In thescores_long
data frame, this would be thepoints
column.
# example: scores_long %>% pivot_wider(names_from = "player", values_from = "points")
# A tibble: 3 × 3 round Alice Bob <int> <dbl> <dbl> 1 1 9 4 2 2 7 1 3 3 2 7
-
Note in the above examples, passing scores_wide
to pivot_longer()
gives us scores_long
exactly, and passing scores_long
to pivot_wider()
gives us scores_wide
exactly.
8.3.1 Example: finish cleaning fertility
Let’s turn back to the fertility_joined
example. After joining, we were left with the following data frame:
fertility_joined
# A tibble: 216 × 68
code country region income_group `1960` `1961` `1962` `1963` `1964` `1965` `1966`
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ABW Aruba Latin… High 4.57 4.42 4.26 4.11 3.94 3.80 3.62
2 AFG Afghan… South… Low 7.28 7.28 7.29 7.30 7.30 7.30 7.32
3 AGO Angola Sub-S… Lower middle 6.71 6.79 6.87 6.95 7.04 7.12 7.19
4 ALB Albania Europ… Upper middle 6.38 6.27 6.11 5.93 5.71 5.47 5.32
5 AND Andorra Europ… High 2.54 2.54 2.55 2.60 2.69 2.72 2.76
6 ARE United… Middl… High 6.50 6.49 6.49 6.50 6.49 6.50 6.50
7 ARG Argent… Latin… Upper middle 3.14 3.12 3.13 3.11 3.08 3.00 2.97
8 ARM Armenia Europ… Upper middle 4.79 4.67 4.52 4.34 4.15 3.99 3.83
9 ASM Americ… East … High 6.58 6.58 6.60 6.69 6.78 6.73 6.62
10 ATG Antigu… Latin… High 4.60 4.56 4.55 4.54 4.48 4.42 4.32
# ℹ 206 more rows
# ℹ 57 more variables: `1967` <dbl>, `1968` <dbl>, `1969` <dbl>, `1970` <dbl>,
# `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>,
# `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>,
# `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>,
# `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>,
# `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, …
Now it should be clear all we need to do is apply pivot_longer()
:
# let's pivot_longer(), then drop any rows with missing values
# after our careful processing, it should be safe to just do drop_na()
# we can also again convert income_group to ordered categories if needed,
# and finally arrange by country and year for extra neatness
fertility <- fertility_joined %>%
pivot_longer("1960":last_col(), names_to = "year", values_to = "rate") %>%
drop_na() %>%
mutate(income_group = factor(income_group, ordered = TRUE, levels = c(
"Low", "Lower middle", "Upper middle", "High"))) %>%
arrange(country, year)
# print our final, clean & tidy data frame!
fertility
# A tibble: 13,792 × 6
code country region income_group year rate
<chr> <chr> <chr> <ord> <chr> <dbl>
1 AFG Afghanistan South Asia Low 1960 7.28
2 AFG Afghanistan South Asia Low 1961 7.28
3 AFG Afghanistan South Asia Low 1962 7.29
4 AFG Afghanistan South Asia Low 1963 7.30
5 AFG Afghanistan South Asia Low 1964 7.30
6 AFG Afghanistan South Asia Low 1965 7.30
7 AFG Afghanistan South Asia Low 1966 7.32
8 AFG Afghanistan South Asia Low 1967 7.34
9 AFG Afghanistan South Asia Low 1968 7.36
10 AFG Afghanistan South Asia Low 1969 7.39
# ℹ 13,782 more rows
Our fertility
data set is now 100% fully cleaned and ready for exploration, visualization, and modeling!
Before we move on, let’s demonstrate one more usage for pivot_wider()
. This function is also great for making human-friendly summary tables, especially when looking at how 3 variables interrelate. For example, suppose we want to look at, for each region, the percentage of countries in each income group. If we just compute the percentages and print, we get something like this:
# count how many countries in each region + income group,
# regroup by region to compute percentages,
# remove the n count column, and print
region_income_pct <- fertility %>%
count(region, income_group) %>%
group_by(region) %>%
mutate(pct = round(100 * n / sum(n), 1)) %>%
select(-n) %>%
print()
# A tibble: 22 × 3
# Groups: region [7]
region income_group pct
<chr> <ord> <dbl>
1 East Asia & Pacific Low 2.7
2 East Asia & Pacific Lower middle 32.4
3 East Asia & Pacific Upper middle 24.3
4 East Asia & Pacific High 40.5
5 Europe & Central Asia Lower middle 5.2
6 Europe & Central Asia Upper middle 25.9
7 Europe & Central Asia High 68.9
8 Latin America & Caribbean Lower middle 9.8
9 Latin America & Caribbean Upper middle 46.3
10 Latin America & Caribbean High 43.9
# ℹ 12 more rows
This would be great for plotting if that’s what we wanted to do, e.g:
# plot income group percentages by region
# str_wrap used to line-break long region names
region_income_pct %>%
ggplot(aes(x = str_wrap(region, 12), y = pct, fill = income_group)) +
geom_col() + labs(x = NULL, y = "Percent", fill = "Income level",
title = "% countries in each income level by region")
However, we can also see the numeric values laid out as a table by using pivot_wider()
to put each income_group
in its own column:
# pivot region/income percentages to wider table format
region_income_pct %>%
pivot_wider(names_from = income_group, values_from = pct)
# A tibble: 7 × 5
# Groups: region [7]
region Low `Lower middle` `Upper middle` High
<chr> <dbl> <dbl> <dbl> <dbl>
1 East Asia & Pacific 2.7 32.4 24.3 40.5
2 Europe & Central Asia NA 5.2 25.9 68.9
3 Latin America & Caribbean NA 9.8 46.3 43.9
4 Middle East & North Africa 9.7 31.8 19.5 39
5 North America NA NA NA 100
6 South Asia 12.5 75 12.5 NA
7 Sub-Saharan Africa 45.8 39.6 12.5 2.1
The NAs are due to a lack of countries in that region/income combination. In this case, we can fill in missing values by setting the values_fill
argument:
# fill in missing values with 0 in region/income table
region_income_pct %>%
pivot_wider(names_from = income_group, values_from = pct, values_fill = 0)
# A tibble: 7 × 5
# Groups: region [7]
region Low `Lower middle` `Upper middle` High
<chr> <dbl> <dbl> <dbl> <dbl>
1 East Asia & Pacific 2.7 32.4 24.3 40.5
2 Europe & Central Asia 0 5.2 25.9 68.9
3 Latin America & Caribbean 0 9.8 46.3 43.9
4 Middle East & North Africa 9.7 31.8 19.5 39
5 North America 0 0 0 100
6 South Asia 12.5 75 12.5 0
7 Sub-Saharan Africa 45.8 39.6 12.5 2.1