8 Advanced Operations
In this chapter, we will cover a few more advanced, yet incredibly useful data tidying operations like grouping, joining, binding, and pivoting. Along the way, we will also make extensive use of dplyr functions learned in the previous chapter.
8.1 Grouping
Often, you will need to apply dplyr’s various operations like mutate()
, summarize()
, or slicing function not across the entire dataset but in groups. This is an important technique across data science, whether it’s data cleaning to exploration to visualization to modeling.
By default, data frames are not grouped when created or imported. You can create a grouping structure with the group_by()
function. The basic syntax is df %>% group_by(col1, col2, ...)
where col1
, col2
, … are variables whose values are used to determine groups. You can group by just 1 variable, 2 variables, or as many variables as needed. Rows with the same values in the chosen columns will be grouped together.
After grouping, operations that normally run across all rows now run across each group. Here’s a few simple examples using the familiar penguins dataset to start:
# import tidyverse, import magrittr (for extra %T>% pipe),
# tweak some readr/ggplot options (optional), and load dataset
library(tidyverse)
library(magrittr)
options(readr.show_col_types = FALSE)
source("https://bwu62.github.io/stat240-revamp/ggplot_theme_options.R")
penguins <- read_csv("https://bwu62.github.io/stat240-revamp/data/penguins.csv")
# group by species and get mean/median/sd body mass + sample size of each group
penguins %>%
group_by(species) %>%
summarize(
mean_mass = mean(body_mass_g),
median_mass = median(body_mass_g),
sd_mass = sd(body_mass_g),
n = n()
)
# A tibble: 3 × 5
species mean_mass median_mass sd_mass n
<chr> <dbl> <dbl> <dbl> <int>
1 Adelie 3706. 3700 459. 146
2 Chinstrap 3733. 3700 384. 68
3 Gentoo 5092. 5050 501. 119
# we can also group by multiple, e.g. group by spcies + sex
penguins %>%
group_by(species, sex) %>%
summarize(
mean_mass = mean(body_mass_g),
median_mass = median(body_mass_g),
sd_mass = sd(body_mass_g),
n = n()
)
# A tibble: 6 × 6
# Groups: species [3]
species sex mean_mass median_mass sd_mass n
<chr> <chr> <dbl> <dbl> <dbl> <int>
1 Adelie female 3369. 3400 269. 73
2 Adelie male 4043. 4000 347. 73
3 Chinstrap female 3527. 3550 285. 34
4 Chinstrap male 3939. 3950 362. 34
5 Gentoo female 4680. 4700 282. 58
6 Gentoo male 5485. 5500 313. 61
# we can of course also mutate within groups
# e.g. convert bill length in mm to number of SDs
# away from species group mean
# to better show the result, I'm forcing it to print all rows in order
# but hiding the output in a collapsible box for style
penguins %>%
select(species, sex, bill_length_mm) %>%
group_by(species) %>%
mutate(
n = n(),
bill_length_std = (bill_length_mm - mean(bill_length_mm)) / sd(bill_length_mm)
) %>%
arrange(species, bill_length_std) %>%
print(n = Inf)
# A tibble: 333 × 5
# Groups: species [3]
species sex bill_length_mm n bill_length_std
<chr> <chr> <dbl> <int> <dbl>
1 Adelie female 32.1 146 -2.53
2 Adelie female 33.1 146 -2.15
3 Adelie female 33.5 146 -2.00
4 Adelie female 34 146 -1.81
5 Adelie female 34.4 146 -1.66
6 Adelie female 34.5 146 -1.62
7 Adelie male 34.6 146 -1.59
8 Adelie female 34.6 146 -1.59
9 Adelie female 35 146 -1.44
10 Adelie female 35 146 -1.44
11 Adelie male 35.1 146 -1.40
12 Adelie female 35.2 146 -1.36
13 Adelie female 35.3 146 -1.32
14 Adelie female 35.5 146 -1.25
15 Adelie female 35.5 146 -1.25
16 Adelie female 35.6 146 -1.21
17 Adelie female 35.7 146 -1.17
18 Adelie female 35.7 146 -1.17
19 Adelie female 35.7 146 -1.17
20 Adelie female 35.9 146 -1.10
21 Adelie female 35.9 146 -1.10
22 Adelie female 36 146 -1.06
23 Adelie female 36 146 -1.06
24 Adelie female 36 146 -1.06
25 Adelie female 36 146 -1.06
26 Adelie female 36.2 146 -0.985
27 Adelie female 36.2 146 -0.985
28 Adelie female 36.2 146 -0.985
29 Adelie male 36.3 146 -0.948
30 Adelie female 36.4 146 -0.910
31 Adelie female 36.4 146 -0.910
32 Adelie female 36.5 146 -0.873
33 Adelie female 36.5 146 -0.873
34 Adelie female 36.6 146 -0.835
35 Adelie female 36.6 146 -0.835
36 Adelie female 36.7 146 -0.798
37 Adelie female 36.7 146 -0.798
38 Adelie female 36.8 146 -0.760
39 Adelie female 36.9 146 -0.723
40 Adelie female 37 146 -0.685
41 Adelie female 37 146 -0.685
42 Adelie male 37.2 146 -0.610
43 Adelie male 37.2 146 -0.610
44 Adelie female 37.3 146 -0.572
45 Adelie male 37.3 146 -0.572
46 Adelie female 37.3 146 -0.572
47 Adelie male 37.5 146 -0.497
48 Adelie female 37.6 146 -0.460
49 Adelie male 37.6 146 -0.460
50 Adelie female 37.6 146 -0.460
51 Adelie male 37.7 146 -0.422
52 Adelie female 37.7 146 -0.422
53 Adelie male 37.7 146 -0.422
54 Adelie female 37.8 146 -0.385
55 Adelie male 37.8 146 -0.385
56 Adelie male 37.8 146 -0.385
57 Adelie female 37.9 146 -0.347
58 Adelie female 37.9 146 -0.347
59 Adelie female 38.1 146 -0.272
60 Adelie female 38.1 146 -0.272
61 Adelie female 38.1 146 -0.272
62 Adelie female 38.1 146 -0.272
63 Adelie male 38.2 146 -0.234
64 Adelie male 38.2 146 -0.234
65 Adelie male 38.3 146 -0.197
66 Adelie female 38.5 146 -0.122
67 Adelie male 38.6 146 -0.0841
68 Adelie female 38.6 146 -0.0841
69 Adelie female 38.6 146 -0.0841
70 Adelie female 38.7 146 -0.0466
71 Adelie male 38.8 146 -0.00900
72 Adelie male 38.8 146 -0.00900
73 Adelie female 38.8 146 -0.00900
74 Adelie female 38.9 146 0.0286
75 Adelie female 38.9 146 0.0286
76 Adelie female 39 146 0.0661
77 Adelie female 39 146 0.0661
78 Adelie male 39 146 0.0661
79 Adelie male 39.1 146 0.104
80 Adelie male 39.2 146 0.141
81 Adelie male 39.2 146 0.141
82 Adelie male 39.2 146 0.141
83 Adelie male 39.3 146 0.179
84 Adelie female 39.5 146 0.254
85 Adelie female 39.5 146 0.254
86 Adelie female 39.5 146 0.254
87 Adelie male 39.6 146 0.291
88 Adelie female 39.6 146 0.291
89 Adelie female 39.6 146 0.291
90 Adelie male 39.6 146 0.291
91 Adelie female 39.6 146 0.291
92 Adelie male 39.7 146 0.329
93 Adelie male 39.7 146 0.329
94 Adelie female 39.7 146 0.329
95 Adelie male 39.7 146 0.329
96 Adelie male 39.8 146 0.367
97 Adelie male 40.1 146 0.479
98 Adelie female 40.2 146 0.517
99 Adelie male 40.2 146 0.517
100 Adelie female 40.2 146 0.517
101 Adelie female 40.3 146 0.554
102 Adelie male 40.3 146 0.554
103 Adelie female 40.5 146 0.629
104 Adelie male 40.5 146 0.629
105 Adelie male 40.6 146 0.667
106 Adelie male 40.6 146 0.667
107 Adelie male 40.6 146 0.667
108 Adelie male 40.6 146 0.667
109 Adelie male 40.7 146 0.705
110 Adelie male 40.8 146 0.742
111 Adelie male 40.8 146 0.742
112 Adelie male 40.9 146 0.780
113 Adelie female 40.9 146 0.780
114 Adelie male 41 146 0.817
115 Adelie female 41.1 146 0.855
116 Adelie male 41.1 146 0.855
117 Adelie male 41.1 146 0.855
118 Adelie male 41.1 146 0.855
119 Adelie male 41.1 146 0.855
120 Adelie male 41.1 146 0.855
121 Adelie male 41.1 146 0.855
122 Adelie male 41.3 146 0.930
123 Adelie male 41.3 146 0.930
124 Adelie male 41.4 146 0.967
125 Adelie male 41.4 146 0.967
126 Adelie male 41.5 146 1.01
127 Adelie male 41.5 146 1.01
128 Adelie male 41.6 146 1.04
129 Adelie male 41.8 146 1.12
130 Adelie male 42 146 1.19
131 Adelie male 42.1 146 1.23
132 Adelie female 42.2 146 1.27
133 Adelie male 42.2 146 1.27
134 Adelie male 42.3 146 1.31
135 Adelie male 42.5 146 1.38
136 Adelie male 42.7 146 1.46
137 Adelie male 42.8 146 1.49
138 Adelie male 42.9 146 1.53
139 Adelie male 43.1 146 1.61
140 Adelie male 43.2 146 1.64
141 Adelie male 43.2 146 1.64
142 Adelie male 44.1 146 1.98
143 Adelie male 44.1 146 1.98
144 Adelie male 45.6 146 2.54
145 Adelie male 45.8 146 2.62
146 Adelie male 46 146 2.70
147 Chinstrap female 40.9 68 -2.38
148 Chinstrap female 42.4 68 -1.93
149 Chinstrap female 42.5 68 -1.90
150 Chinstrap female 42.5 68 -1.90
151 Chinstrap female 43.2 68 -1.69
152 Chinstrap female 43.5 68 -1.60
153 Chinstrap female 45.2 68 -1.09
154 Chinstrap female 45.2 68 -1.09
155 Chinstrap female 45.4 68 -1.03
156 Chinstrap female 45.5 68 -0.998
157 Chinstrap female 45.6 68 -0.968
158 Chinstrap female 45.7 68 -0.938
159 Chinstrap female 45.7 68 -0.938
160 Chinstrap female 45.9 68 -0.879
161 Chinstrap female 46 68 -0.849
162 Chinstrap female 46.1 68 -0.819
163 Chinstrap female 46.2 68 -0.789
164 Chinstrap female 46.4 68 -0.729
165 Chinstrap female 46.4 68 -0.729
166 Chinstrap female 46.5 68 -0.699
167 Chinstrap female 46.6 68 -0.669
168 Chinstrap female 46.7 68 -0.639
169 Chinstrap female 46.8 68 -0.609
170 Chinstrap female 46.9 68 -0.579
171 Chinstrap female 47 68 -0.549
172 Chinstrap female 47.5 68 -0.399
173 Chinstrap female 47.6 68 -0.369
174 Chinstrap female 48.1 68 -0.220
175 Chinstrap male 48.5 68 -0.100
176 Chinstrap male 49 68 0.0498
177 Chinstrap male 49 68 0.0498
178 Chinstrap male 49.2 68 0.110
179 Chinstrap male 49.3 68 0.140
180 Chinstrap male 49.5 68 0.199
181 Chinstrap male 49.6 68 0.229
182 Chinstrap male 49.7 68 0.259
183 Chinstrap female 49.8 68 0.289
184 Chinstrap male 50 68 0.349
185 Chinstrap female 50.1 68 0.379
186 Chinstrap male 50.2 68 0.409
187 Chinstrap female 50.2 68 0.409
188 Chinstrap male 50.3 68 0.439
189 Chinstrap male 50.5 68 0.499
190 Chinstrap female 50.5 68 0.499
191 Chinstrap male 50.6 68 0.529
192 Chinstrap male 50.7 68 0.559
193 Chinstrap male 50.8 68 0.589
194 Chinstrap male 50.8 68 0.589
195 Chinstrap male 50.9 68 0.619
196 Chinstrap female 50.9 68 0.619
197 Chinstrap male 51 68 0.649
198 Chinstrap male 51.3 68 0.739
199 Chinstrap male 51.3 68 0.739
200 Chinstrap male 51.3 68 0.739
201 Chinstrap male 51.4 68 0.768
202 Chinstrap male 51.5 68 0.798
203 Chinstrap male 51.7 68 0.858
204 Chinstrap male 51.9 68 0.918
205 Chinstrap male 52 68 0.948
206 Chinstrap male 52 68 0.948
207 Chinstrap male 52 68 0.948
208 Chinstrap male 52.2 68 1.01
209 Chinstrap male 52.7 68 1.16
210 Chinstrap male 52.8 68 1.19
211 Chinstrap male 53.5 68 1.40
212 Chinstrap male 54.2 68 1.61
213 Chinstrap male 55.8 68 2.09
214 Chinstrap female 58 68 2.74
215 Gentoo female 40.9 119 -2.15
216 Gentoo female 41.7 119 -1.89
217 Gentoo female 42 119 -1.79
218 Gentoo female 42.6 119 -1.60
219 Gentoo female 42.7 119 -1.57
220 Gentoo female 42.8 119 -1.54
221 Gentoo female 42.9 119 -1.50
222 Gentoo female 43.2 119 -1.41
223 Gentoo female 43.3 119 -1.37
224 Gentoo female 43.3 119 -1.37
225 Gentoo female 43.4 119 -1.34
226 Gentoo female 43.5 119 -1.31
227 Gentoo female 43.5 119 -1.31
228 Gentoo female 43.6 119 -1.28
229 Gentoo female 43.8 119 -1.21
230 Gentoo female 44 119 -1.15
231 Gentoo male 44.4 119 -1.02
232 Gentoo female 44.5 119 -0.988
233 Gentoo female 44.9 119 -0.859
234 Gentoo female 44.9 119 -0.859
235 Gentoo male 45 119 -0.827
236 Gentoo female 45.1 119 -0.795
237 Gentoo female 45.1 119 -0.795
238 Gentoo female 45.1 119 -0.795
239 Gentoo male 45.2 119 -0.762
240 Gentoo female 45.2 119 -0.762
241 Gentoo male 45.2 119 -0.762
242 Gentoo female 45.2 119 -0.762
243 Gentoo female 45.3 119 -0.730
244 Gentoo female 45.3 119 -0.730
245 Gentoo female 45.4 119 -0.698
246 Gentoo female 45.5 119 -0.666
247 Gentoo female 45.5 119 -0.666
248 Gentoo male 45.5 119 -0.666
249 Gentoo female 45.5 119 -0.666
250 Gentoo female 45.7 119 -0.601
251 Gentoo female 45.8 119 -0.569
252 Gentoo female 45.8 119 -0.569
253 Gentoo female 46.1 119 -0.473
254 Gentoo male 46.1 119 -0.473
255 Gentoo female 46.2 119 -0.440
256 Gentoo male 46.2 119 -0.440
257 Gentoo female 46.2 119 -0.440
258 Gentoo male 46.3 119 -0.408
259 Gentoo male 46.4 119 -0.376
260 Gentoo female 46.4 119 -0.376
261 Gentoo female 46.5 119 -0.344
262 Gentoo female 46.5 119 -0.344
263 Gentoo female 46.5 119 -0.344
264 Gentoo female 46.5 119 -0.344
265 Gentoo female 46.6 119 -0.312
266 Gentoo male 46.7 119 -0.279
267 Gentoo male 46.8 119 -0.247
268 Gentoo male 46.8 119 -0.247
269 Gentoo female 46.8 119 -0.247
270 Gentoo female 46.9 119 -0.215
271 Gentoo female 47.2 119 -0.118
272 Gentoo female 47.2 119 -0.118
273 Gentoo male 47.3 119 -0.0863
274 Gentoo female 47.4 119 -0.0541
275 Gentoo female 47.5 119 -0.0219
276 Gentoo female 47.5 119 -0.0219
277 Gentoo female 47.5 119 -0.0219
278 Gentoo male 47.6 119 0.0103
279 Gentoo female 47.7 119 0.0425
280 Gentoo male 47.8 119 0.0747
281 Gentoo male 48.1 119 0.171
282 Gentoo female 48.2 119 0.203
283 Gentoo male 48.2 119 0.203
284 Gentoo male 48.4 119 0.268
285 Gentoo male 48.4 119 0.268
286 Gentoo female 48.4 119 0.268
287 Gentoo male 48.5 119 0.300
288 Gentoo female 48.5 119 0.300
289 Gentoo male 48.6 119 0.332
290 Gentoo female 48.7 119 0.364
291 Gentoo male 48.7 119 0.364
292 Gentoo male 48.7 119 0.364
293 Gentoo male 48.8 119 0.397
294 Gentoo male 49 119 0.461
295 Gentoo female 49.1 119 0.493
296 Gentoo female 49.1 119 0.493
297 Gentoo male 49.1 119 0.493
298 Gentoo male 49.2 119 0.525
299 Gentoo male 49.3 119 0.558
300 Gentoo male 49.4 119 0.590
301 Gentoo male 49.5 119 0.622
302 Gentoo male 49.5 119 0.622
303 Gentoo male 49.6 119 0.654
304 Gentoo male 49.6 119 0.654
305 Gentoo male 49.8 119 0.719
306 Gentoo male 49.8 119 0.719
307 Gentoo male 49.9 119 0.751
308 Gentoo male 50 119 0.783
309 Gentoo male 50 119 0.783
310 Gentoo male 50 119 0.783
311 Gentoo male 50 119 0.783
312 Gentoo male 50.1 119 0.815
313 Gentoo male 50.2 119 0.847
314 Gentoo male 50.4 119 0.912
315 Gentoo male 50.4 119 0.912
316 Gentoo male 50.5 119 0.944
317 Gentoo male 50.5 119 0.944
318 Gentoo female 50.5 119 0.944
319 Gentoo male 50.7 119 1.01
320 Gentoo male 50.8 119 1.04
321 Gentoo male 50.8 119 1.04
322 Gentoo male 51.1 119 1.14
323 Gentoo male 51.1 119 1.14
324 Gentoo male 51.3 119 1.20
325 Gentoo male 51.5 119 1.27
326 Gentoo male 52.1 119 1.46
327 Gentoo male 52.2 119 1.49
328 Gentoo male 52.5 119 1.59
329 Gentoo male 53.4 119 1.88
330 Gentoo male 54.3 119 2.17
331 Gentoo male 55.1 119 2.42
332 Gentoo male 55.9 119 2.68
333 Gentoo male 59.6 119 3.87
# get the largest 3 penguins by bill depth from each species
penguins %>%
group_by(species) %>%
slice_max(bill_depth_mm, n = 3)
# A tibble: 9 × 8
# Groups: species [3]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 Adelie Torgers… 46 21.5 194 4200 male
2 Adelie Torgers… 38.6 21.2 191 3800 male
3 Adelie Dream 42.3 21.2 191 4150 male
4 Chinstrap Dream 54.2 20.8 201 4300 male
5 Chinstrap Dream 52 20.7 210 4800 male
6 Chinstrap Dream 51.7 20.3 194 3775 male
7 Gentoo Biscoe 44.4 17.3 219 5250 male
8 Gentoo Biscoe 50.8 17.3 228 5600 male
9 Gentoo Biscoe 52.2 17.1 228 5400 male
# ℹ 1 more variable: year <dbl>
8.1.1 Regrouping
Sometimes one group_by()
may not be enough to get what you need; you may need to re-group_by()
by something else to finish the job. For example, suppose you want to see what percent of each species came from different islands. This requires two uses of group_by()
:
# first group by species + island and summarize to get size of each group,
# then regroup by species and MUTATE (not summarize) totals for each species,
# then divide these to get proportion of each species from each island
penguins %>%
group_by(species, island) %>%
summarize(n = n()) %>%
group_by(species) %>%
mutate(
species_total = sum(n),
pct_of_species_from_island = n / species_total * 100
)
# A tibble: 5 × 5
# Groups: species [3]
species island n species_total pct_of_species_from_island
<chr> <chr> <int> <int> <dbl>
1 Adelie Biscoe 44 146 30.1
2 Adelie Dream 55 146 37.7
3 Adelie Torgersen 47 146 32.2
4 Chinstrap Dream 68 68 100
5 Gentoo Biscoe 119 119 100
This combination of df %>% group_by(...) %>% summarize(n = n())
is so common, we have this shortcut for it: df %>% count(...)
. We can demonstrate this in another example involving regrouping. Suppose we want to know what percent of each species were male/female:
# again, first group by species + sex and get size using the shortcut count(),
# then regroup by species and MUTATE totals for each species,
# then divide to get proportions of each species that were male/female
penguins %>%
count(species, sex) %>%
group_by(species) %>%
mutate(
species_total = sum(n),
pct_of_species_each_sex = n / species_total * 100
)
# A tibble: 6 × 5
# Groups: species [3]
species sex n species_total pct_of_species_each_sex
<chr> <chr> <int> <int> <dbl>
1 Adelie female 73 146 50
2 Adelie male 73 146 50
3 Chinstrap female 34 68 50
4 Chinstrap male 34 68 50
5 Gentoo female 58 119 48.7
6 Gentoo male 61 119 51.3
Groups are also useful for prepping data frames for plotting. For example, here’s a chunk that produces a bar plot showing how mean body mass changes by species and sex:
# get mean body mass by species + sex and plot
# the %T>% is a special pipe called a Tee pipe
# it's a shortcut for piping something into 2 different operations,
# useful for example when you want to print a data frame, then also plot it
# see https://magrittr.tidyverse.org/reference/tee.html for more
penguins %>%
group_by(species, sex) %>%
summarize(mean_mass = mean(body_mass_g)) %T>% print %>%
ggplot(aes(x = species, y = mean_mass, fill = sex)) +
geom_col(position = "dodge2") +
labs(x = "Species", y = "Mean body mass (g)",
title = "Mean body mass of Palmer penguins by species + sex")
# A tibble: 6 × 3
# Groups: species [3]
species sex mean_mass
<chr> <chr> <dbl>
1 Adelie female 3369.
2 Adelie male 4043.
3 Chinstrap female 3527.
4 Chinstrap male 3939.
5 Gentoo female 4680.
6 Gentoo male 5485.
8.1.2 Ungrouping
Many operations output grouped data frames. For example, look closely at the output of the previous chunks and you’ll see # Groups: species [3]
in most of them. This means any further operations you run will continue to execute in a grouped way.
You can remove the grouping structure with ungroup()
. This allows you to revert to running operations on the entire data frame. Example:
# get count of each species + sex combination,
# but this time get its percentage out of ALL observations
penguins %>%
count(species, sex) %>%
ungroup() %>%
mutate(pct_of_all = n / sum(n))
# A tibble: 6 × 4
species sex n pct_of_all
<chr> <chr> <int> <dbl>
1 Adelie female 73 0.219
2 Adelie male 73 0.219
3 Chinstrap female 34 0.102
4 Chinstrap male 34 0.102
5 Gentoo female 58 0.174
6 Gentoo male 61 0.183
8.1.3 More practice! (fertility
data)
Let’s give the penguins dataset a rest and practice dplyr and grouping a bit more with a different dataset. The following chunk imports fertility.csv
, the cleaned global fertility data set from World Bank, giving the average number of births per woman for each year and country from 1960 to present. For the past few decades, global fertility has been sharply declining for most countries. Many countries are now below the replacement rate of 2.1, leading to widespread concerns of a population collapse in the latter part of the 21st century.
fertility <- read_csv("https://bwu62.github.io/stat240-revamp/data/fertility.csv")
fertility
# A tibble: 13,050 × 6
code country region income_group year rate
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 AFG Afghanistan South Asia Low 1960 7.28
2 AFG Afghanistan South Asia Low 1961 7.28
3 AFG Afghanistan South Asia Low 1962 7.29
4 AFG Afghanistan South Asia Low 1963 7.30
5 AFG Afghanistan South Asia Low 1964 7.30
6 AFG Afghanistan South Asia Low 1965 7.30
7 AFG Afghanistan South Asia Low 1966 7.32
8 AFG Afghanistan South Asia Low 1967 7.34
9 AFG Afghanistan South Asia Low 1968 7.36
10 AFG Afghanistan South Asia Low 1969 7.39
# ℹ 13,040 more rows
Each country has over 60 years of data in this dataset. We can see an overview which countries are represented in the dataset and what their listed region and income group are by temporarily dropping years, removing duplicates, and printing the full output below in a collapsible box:
# A tibble: 211 × 3
country region income_group
<chr> <chr> <chr>
1 Afghanistan South Asia Low
2 Albania Europe & Central Asia Upper middle
3 Algeria Middle East & North Africa Upper middle
4 Andorra Europe & Central Asia High
5 Angola Sub-Saharan Africa Lower middle
6 Antigua & Barbuda Latin America & Caribbean High
7 Argentina Latin America & Caribbean Upper middle
8 Armenia Europe & Central Asia Upper middle
9 Aruba Latin America & Caribbean High
10 Australia East Asia & Pacific High
11 Austria Europe & Central Asia High
12 Azerbaijan Europe & Central Asia Upper middle
13 Bahamas Latin America & Caribbean High
14 Bahrain Middle East & North Africa High
15 Bangladesh South Asia Lower middle
16 Barbados Latin America & Caribbean High
17 Belarus Europe & Central Asia Upper middle
18 Belgium Europe & Central Asia High
19 Belize Latin America & Caribbean Upper middle
20 Benin Sub-Saharan Africa Lower middle
21 Bermuda North America High
22 Bhutan South Asia Lower middle
23 Bolivia Latin America & Caribbean Lower middle
24 Bosnia & Herzegovina Europe & Central Asia Upper middle
25 Botswana Sub-Saharan Africa Upper middle
26 Brazil Latin America & Caribbean Upper middle
27 Brunei East Asia & Pacific High
28 Bulgaria Europe & Central Asia High
29 Burkina Faso Sub-Saharan Africa Low
30 Burundi Sub-Saharan Africa Low
31 Cambodia East Asia & Pacific Lower middle
32 Cameroon Sub-Saharan Africa Lower middle
33 Canada North America High
34 Cape Verde Sub-Saharan Africa Lower middle
35 Central African Republic Sub-Saharan Africa Low
36 Chad Sub-Saharan Africa Low
37 Channel Islands Europe & Central Asia High
38 Chile Latin America & Caribbean High
39 China East Asia & Pacific Upper middle
40 Colombia Latin America & Caribbean Upper middle
41 Comoros Sub-Saharan Africa Lower middle
42 Congo, Dem. Rep. Sub-Saharan Africa Low
43 Congo, Rep. Sub-Saharan Africa Lower middle
44 Costa Rica Latin America & Caribbean Upper middle
45 Croatia Europe & Central Asia High
46 Cuba Latin America & Caribbean Upper middle
47 Curaçao Latin America & Caribbean High
48 Cyprus Europe & Central Asia High
49 Czechia Europe & Central Asia High
50 Denmark Europe & Central Asia High
51 Djibouti Middle East & North Africa Lower middle
52 Dominica Latin America & Caribbean Upper middle
53 Dominican Republic Latin America & Caribbean Upper middle
54 East Timor East Asia & Pacific Lower middle
55 Ecuador Latin America & Caribbean Upper middle
56 Egypt Middle East & North Africa Lower middle
57 El Salvador Latin America & Caribbean Upper middle
58 Equatorial Guinea Sub-Saharan Africa Upper middle
59 Eritrea Sub-Saharan Africa Low
60 Estonia Europe & Central Asia High
61 Eswatini Sub-Saharan Africa Lower middle
62 Ethiopia Sub-Saharan Africa Low
63 Faroe Islands Europe & Central Asia High
64 Fiji East Asia & Pacific Upper middle
65 Finland Europe & Central Asia High
66 France Europe & Central Asia High
67 French Polynesia East Asia & Pacific High
68 Gabon Sub-Saharan Africa Upper middle
69 Gambia Sub-Saharan Africa Low
70 Georgia Europe & Central Asia Upper middle
71 Germany Europe & Central Asia High
72 Ghana Sub-Saharan Africa Lower middle
73 Gibraltar Europe & Central Asia High
74 Greece Europe & Central Asia High
75 Greenland Europe & Central Asia High
76 Grenada Latin America & Caribbean Upper middle
77 Guam East Asia & Pacific High
78 Guatemala Latin America & Caribbean Upper middle
79 Guinea Sub-Saharan Africa Lower middle
80 Guinea-Bissau Sub-Saharan Africa Low
81 Guyana Latin America & Caribbean High
82 Haiti Latin America & Caribbean Lower middle
83 Honduras Latin America & Caribbean Lower middle
84 Hong Kong East Asia & Pacific High
85 Hungary Europe & Central Asia High
86 Iceland Europe & Central Asia High
87 India South Asia Lower middle
88 Indonesia East Asia & Pacific Upper middle
89 Iran Middle East & North Africa Upper middle
90 Iraq Middle East & North Africa Upper middle
91 Ireland Europe & Central Asia High
92 Isle of Man Europe & Central Asia High
93 Israel Middle East & North Africa High
94 Italy Europe & Central Asia High
95 Ivory Coast Sub-Saharan Africa Lower middle
96 Jamaica Latin America & Caribbean Upper middle
97 Japan East Asia & Pacific High
98 Jordan Middle East & North Africa Lower middle
99 Kazakhstan Europe & Central Asia Upper middle
100 Kenya Sub-Saharan Africa Lower middle
101 Kiribati East Asia & Pacific Lower middle
102 Kosovo Europe & Central Asia Upper middle
103 Kuwait Middle East & North Africa High
104 Kyrgyzstan Europe & Central Asia Lower middle
105 Laos East Asia & Pacific Lower middle
106 Latvia Europe & Central Asia High
107 Lebanon Middle East & North Africa Lower middle
108 Lesotho Sub-Saharan Africa Lower middle
109 Liberia Sub-Saharan Africa Low
110 Libya Middle East & North Africa Upper middle
111 Liechtenstein Europe & Central Asia High
112 Lithuania Europe & Central Asia High
113 Luxembourg Europe & Central Asia High
114 Macao East Asia & Pacific High
115 Madagascar Sub-Saharan Africa Low
116 Malawi Sub-Saharan Africa Low
117 Malaysia East Asia & Pacific Upper middle
118 Maldives South Asia Upper middle
119 Mali Sub-Saharan Africa Low
120 Malta Middle East & North Africa High
121 Marshall Islands East Asia & Pacific Upper middle
122 Mauritania Sub-Saharan Africa Lower middle
123 Mauritius Sub-Saharan Africa Upper middle
124 Mexico Latin America & Caribbean Upper middle
125 Micronesia East Asia & Pacific Lower middle
126 Moldova Europe & Central Asia Upper middle
127 Mongolia East Asia & Pacific Upper middle
128 Montenegro Europe & Central Asia Upper middle
129 Morocco Middle East & North Africa Lower middle
130 Mozambique Sub-Saharan Africa Low
131 Myanmar East Asia & Pacific Lower middle
132 Namibia Sub-Saharan Africa Upper middle
133 Nauru East Asia & Pacific High
134 Nepal South Asia Lower middle
135 Netherlands Europe & Central Asia High
136 New Caledonia East Asia & Pacific High
137 New Zealand East Asia & Pacific High
138 Nicaragua Latin America & Caribbean Lower middle
139 Niger Sub-Saharan Africa Low
140 Nigeria Sub-Saharan Africa Lower middle
141 North Korea East Asia & Pacific Low
142 North Macedonia Europe & Central Asia Upper middle
143 Norway Europe & Central Asia High
144 Oman Middle East & North Africa High
145 Pakistan South Asia Lower middle
146 Palau East Asia & Pacific High
147 Panama Latin America & Caribbean High
148 Papua New Guinea East Asia & Pacific Lower middle
149 Paraguay Latin America & Caribbean Upper middle
150 Peru Latin America & Caribbean Upper middle
151 Philippines East Asia & Pacific Lower middle
152 Poland Europe & Central Asia High
153 Portugal Europe & Central Asia High
154 Puerto Rico Latin America & Caribbean High
155 Qatar Middle East & North Africa High
156 Romania Europe & Central Asia High
157 Russia Europe & Central Asia High
158 Rwanda Sub-Saharan Africa Low
159 Samoa East Asia & Pacific Lower middle
160 San Marino Europe & Central Asia High
161 Saudi Arabia Middle East & North Africa High
162 Senegal Sub-Saharan Africa Lower middle
163 Serbia Europe & Central Asia Upper middle
164 Seychelles Sub-Saharan Africa High
165 Sierra Leone Sub-Saharan Africa Low
166 Singapore East Asia & Pacific High
167 Sint Maarten Latin America & Caribbean High
168 Slovakia Europe & Central Asia High
169 Slovenia Europe & Central Asia High
170 Solomon Islands East Asia & Pacific Lower middle
171 Somalia Sub-Saharan Africa Low
172 South Africa Sub-Saharan Africa Upper middle
173 South Korea East Asia & Pacific High
174 South Sudan Sub-Saharan Africa Low
175 Spain Europe & Central Asia High
176 Sri Lanka South Asia Lower middle
177 St. Kitts & Nevis Latin America & Caribbean High
178 St. Lucia Latin America & Caribbean Upper middle
179 St. Martin Latin America & Caribbean High
180 St. Vincent & Grenadines Latin America & Caribbean Upper middle
181 Sudan Sub-Saharan Africa Low
182 Suriname Latin America & Caribbean Upper middle
183 Sweden Europe & Central Asia High
184 Switzerland Europe & Central Asia High
185 Syria Middle East & North Africa Low
186 São Tomé & Principe Sub-Saharan Africa Lower middle
187 Tajikistan Europe & Central Asia Lower middle
188 Tanzania Sub-Saharan Africa Lower middle
189 Thailand East Asia & Pacific Upper middle
190 Togo Sub-Saharan Africa Low
191 Tonga East Asia & Pacific Upper middle
192 Trinidad & Tobago Latin America & Caribbean High
193 Tunisia Middle East & North Africa Lower middle
194 Turkey Europe & Central Asia Upper middle
195 Turkmenistan Europe & Central Asia Upper middle
196 Turks & Caicos Islands Latin America & Caribbean High
197 Tuvalu East Asia & Pacific Upper middle
198 Uganda Sub-Saharan Africa Low
199 Ukraine Europe & Central Asia Upper middle
200 United Arab Emirates Middle East & North Africa High
201 United Kingdom Europe & Central Asia High
202 United States North America High
203 Uruguay Latin America & Caribbean High
204 Uzbekistan Europe & Central Asia Lower middle
205 Vanuatu East Asia & Pacific Lower middle
206 Vietnam East Asia & Pacific Lower middle
207 Virgin Islands Latin America & Caribbean High
208 West Bank & Gaza Middle East & North Africa Lower middle
209 Yemen Middle East & North Africa Low
210 Zambia Sub-Saharan Africa Lower middle
211 Zimbabwe Sub-Saharan Africa Lower middle
One additional small processing step we should do before continuing is convert income_group
to an ordered factor (see section 3.9.5), which will be important later.
fertility <- fertility %>% mutate(
income_group = factor(income_group, ordered = TRUE, levels = c(
"Low", "Lower middle", "Upper middle", "High"))
)
We can begin by running a few summaries to explore the dataset. To start, here’s a chunk showing the number of countries and median income group for countries in each region:
# strangely, base R median doesn't work on ordered categories,
# but we can use Median from DescTools instead
fertility %>%
select(country, region, income_group) %>%
distinct() %>%
group_by(region) %>%
summarize(n = n(), median = DescTools::Median(income_group)) %>%
arrange(desc(median))
# A tibble: 7 × 3
region n median
<chr> <int> <ord>
1 Europe & Central Asia 57 High
2 North America 3 High
3 East Asia & Pacific 35 Upper middle
4 Latin America & Caribbean 39 Upper middle
5 Middle East & North Africa 21 Upper middle
6 South Asia 8 Lower middle
7 Sub-Saharan Africa 48 Lower middle
Next, here’s a chunk showing the median fertility rate in each region for the most recent year of 2022, as well as the countries with the highest and lowest 2022 rates (and what the rates are) in each region:
# first filter to get the right year, then
# sort by region, rate (so min, max are the first, last in each group)
# then summarize to get median, and min/max country/rate
fertility %>%
filter(year == max(year)) %>%
arrange(region, rate) %>%
group_by(region) %>%
summarize(
n = n(),
median = median(rate),
min_country = first(country),
min = first(rate),
max_country = last(country),
max = last(rate)
) %>%
arrange(median)
# A tibble: 7 × 7
region n median min_country min max_country max
<chr> <int> <dbl> <chr> <dbl> <chr> <dbl>
1 North America 3 1.33 Bermuda 1.3 United States 1.66
2 Europe & Central Asia 55 1.53 Spain 1.16 Uzbekistan 3.31
3 Latin America & Caribbean 40 1.74 Puerto Rico 0.9 Haiti 2.77
4 South Asia 8 1.99 Bhutan 1.40 Afghanistan 4.52
5 East Asia & Pacific 34 2.24 Hong Kong 0.701 Solomon Islands 3.92
6 Middle East & North Africa 21 2.40 Malta 1.15 Yemen 3.72
7 Sub-Saharan Africa 48 4.18 Mauritius 1.32 Niger 6.75
We can also show the latest rate for each country, as well as the change from 2000, just before the start of the 21st century:
# first filter to get the right years, then
# sort by country, year (so 2000, 2022 are first and last in each group)
# then summarize to get 2000 and 2022 rates, mutate to get change,
# then ungroup, distinct, and arrange to display a neat output
# again, collapsing output due to lengthy print out
# %T>% is used again to both save and print the results
fertility_change <- fertility %>%
filter(year %in% c(2000, max(year))) %>%
arrange(country, year) %>%
group_by(country) %>%
mutate(
rate2000 = first(rate),
rate2022 = last(rate),
change = rate2022 - rate2000
) %>%
select(country, region, rate2000, change, rate2022) %>%
ungroup() %>%
distinct() %>%
arrange(rate2022) %T>%
print(n = Inf)
# A tibble: 209 × 5
country region rate2000 change rate2022
<chr> <chr> <dbl> <dbl> <dbl>
1 Hong Kong East Asia & Pacific 1.03 -0.331 0.701
2 South Korea East Asia & Pacific 1.48 -0.702 0.778
3 Puerto Rico Latin America & Caribbean 2.05 -1.15 0.9
4 Singapore East Asia & Pacific 1.6 -0.56 1.04
5 Macao East Asia & Pacific 0.912 0.197 1.11
6 Malta Middle East & North Africa 1.68 -0.53 1.15
7 Spain Europe & Central Asia 1.22 -0.0600 1.16
8 China East Asia & Pacific 1.63 -0.453 1.18
9 Aruba Latin America & Caribbean 1.90 -0.725 1.18
10 Italy Europe & Central Asia 1.26 -0.0200 1.24
11 Japan East Asia & Pacific 1.36 -0.100 1.26
12 Poland Europe & Central Asia 1.37 -0.109 1.26
13 Ukraine Europe & Central Asia 1.12 0.149 1.26
14 Lithuania Europe & Central Asia 1.39 -0.120 1.27
15 Bermuda North America 1.74 -0.44 1.3
16 Curaçao Latin America & Caribbean 2.19 -0.89 1.3
17 Luxembourg Europe & Central Asia 1.76 -0.45 1.31
18 Cyprus Europe & Central Asia 1.64 -0.326 1.31
19 Thailand East Asia & Pacific 1.61 -0.295 1.32
20 Finland Europe & Central Asia 1.73 -0.41 1.32
21 Mauritius Sub-Saharan Africa 1.99 -0.67 1.32
22 Canada North America 1.51 -0.18 1.33
23 Jamaica Latin America & Caribbean 2.21 -0.87 1.34
24 Bosnia & Herzegovina Europe & Central Asia 1.28 0.0630 1.35
25 Albania Europe & Central Asia 2.23 -0.855 1.38
26 Bahamas Latin America & Caribbean 2.10 -0.717 1.38
27 St. Lucia Latin America & Caribbean 2.20 -0.815 1.39
28 Switzerland Europe & Central Asia 1.5 -0.110 1.39
29 Bhutan South Asia 3.41 -2.02 1.40
30 Austria Europe & Central Asia 1.36 0.0500 1.41
31 Estonia Europe & Central Asia 1.36 0.0500 1.41
32 Norway Europe & Central Asia 1.85 -0.44 1.41
33 Russia Europe & Central Asia 1.20 0.221 1.42
34 Greece Europe & Central Asia 1.25 0.18 1.43
35 Portugal Europe & Central Asia 1.55 -0.120 1.43
36 United Arab Emirates Middle East & North Africa 2.73 -1.29 1.44
37 Cuba Latin America & Caribbean 1.58 -0.132 1.45
38 Germany Europe & Central Asia 1.38 0.0750 1.46
39 Channel Islands Europe & Central Asia 1.49 -0.0180 1.47
40 Latvia Europe & Central Asia 1.25 0.22 1.47
41 Liechtenstein Europe & Central Asia 1.57 -0.100 1.47
42 Uruguay Latin America & Caribbean 2.17 -0.685 1.48
43 Netherlands Europe & Central Asia 1.72 -0.233 1.49
44 Belarus Europe & Central Asia 1.32 0.178 1.50
45 Kosovo Europe & Central Asia 2.66 -1.15 1.51
46 St. Kitts & Nevis Latin America & Caribbean 2.20 -0.684 1.51
47 Hungary Europe & Central Asia 1.32 0.2 1.52
48 Sweden Europe & Central Asia 1.54 -0.0200 1.52
49 Costa Rica Latin America & Caribbean 2.41 -0.89 1.52
50 Belgium Europe & Central Asia 1.67 -0.140 1.53
51 Croatia Europe & Central Asia 1.39 0.140 1.53
52 Chile Latin America & Caribbean 1.99 -0.448 1.54
53 Denmark Europe & Central Asia 1.77 -0.22 1.55
54 Slovenia Europe & Central Asia 1.26 0.29 1.55
55 Slovakia Europe & Central Asia 1.3 0.27 1.57
56 United Kingdom Europe & Central Asia 1.64 -0.0700 1.57
57 Isle of Man Europe & Central Asia 1.69 -0.123 1.57
58 Sint Maarten Latin America & Caribbean 1.85 -0.278 1.57
59 Armenia Europe & Central Asia 1.60 -0.0280 1.58
60 Antigua & Barbuda Latin America & Caribbean 2.20 -0.616 1.58
61 Dominica Latin America & Caribbean 2.35 -0.763 1.59
62 Iceland Europe & Central Asia 2.08 -0.491 1.59
63 North Macedonia Europe & Central Asia 1.86 -0.261 1.6
64 Trinidad & Tobago Latin America & Caribbean 1.77 -0.155 1.61
65 Czechia Europe & Central Asia 1.15 0.468 1.62
66 Brazil Latin America & Caribbean 2.26 -0.629 1.63
67 Australia East Asia & Pacific 1.76 -0.126 1.63
68 Serbia Europe & Central Asia 1.48 0.150 1.63
69 Barbados Latin America & Caribbean 1.78 -0.141 1.63
70 Turks & Caicos Islands Latin America & Caribbean 2.51 -0.853 1.66
71 New Zealand East Asia & Pacific 1.98 -0.32 1.66
72 United States North America 2.06 -0.391 1.66
73 Azerbaijan Europe & Central Asia 2 -0.33 1.67
74 Maldives South Asia 2.71 -1.03 1.68
75 Iran Middle East & North Africa 2.02 -0.335 1.68
76 French Polynesia East Asia & Pacific 2.60 -0.908 1.69
77 Colombia Latin America & Caribbean 2.57 -0.88 1.69
78 Ireland Europe & Central Asia 1.89 -0.19 1.7
79 Brunei East Asia & Pacific 2.35 -0.582 1.76
80 St. Vincent & Grenadines Latin America & Caribbean 2.34 -0.562 1.78
81 Bulgaria Europe & Central Asia 1.26 0.52 1.78
82 Qatar Middle East & North Africa 3.23 -1.45 1.78
83 El Salvador Latin America & Caribbean 3.14 -1.36 1.78
84 Malaysia East Asia & Pacific 2.91 -1.13 1.79
85 North Korea East Asia & Pacific 1.97 -0.176 1.79
86 France Europe & Central Asia 1.89 -0.0960 1.79
87 Bahrain Middle East & North Africa 2.78 -0.981 1.80
88 Moldova Europe & Central Asia 1.50 0.301 1.8
89 Montenegro Europe & Central Asia 2.06 -0.265 1.8
90 Mexico Latin America & Caribbean 2.72 -0.913 1.80
91 Romania Europe & Central Asia 1.31 0.5 1.81
92 Palau East Asia & Pacific 1.83 0 1.83
93 Gibraltar Europe & Central Asia 1.92 -0.0820 1.84
94 Greenland Europe & Central Asia 2.33 -0.49 1.84
95 Argentina Latin America & Caribbean 2.59 -0.715 1.88
96 Cape Verde Sub-Saharan Africa 3.54 -1.66 1.88
97 Turkey Europe & Central Asia 2.50 -0.623 1.88
98 Vietnam East Asia & Pacific 2.07 -0.131 1.94
99 Bangladesh South Asia 3.22 -1.27 1.95
100 Sri Lanka South Asia 2.19 -0.217 1.97
101 Grenada Latin America & Caribbean 2.58 -0.596 1.99
102 Belize Latin America & Caribbean 3.63 -1.64 1.99
103 Virgin Islands Latin America & Caribbean 1.87 0.134 2
104 Ecuador Latin America & Caribbean 3.10 -1.10 2.00
105 Nepal South Asia 3.94 -1.94 2.01
106 India South Asia 3.35 -1.34 2.01
107 New Caledonia East Asia & Pacific 2.59 -0.57 2.02
108 Faroe Islands Europe & Central Asia 2.58 -0.532 2.05
109 Georgia Europe & Central Asia 1.60 0.462 2.06
110 Tunisia Middle East & North Africa 2.05 0.0150 2.06
111 Lebanon Middle East & North Africa 2.50 -0.419 2.08
112 Kuwait Middle East & North Africa 2.74 -0.645 2.09
113 Myanmar East Asia & Pacific 2.78 -0.658 2.13
114 Indonesia East Asia & Pacific 2.54 -0.383 2.15
115 Peru Latin America & Caribbean 2.84 -0.681 2.16
116 Dominican Republic Latin America & Caribbean 2.86 -0.612 2.25
117 Nicaragua Latin America & Caribbean 3.11 -0.827 2.28
118 Panama Latin America & Caribbean 2.74 -0.447 2.30
119 Morocco Middle East & North Africa 2.80 -0.497 2.30
120 Seychelles Sub-Saharan Africa 2.08 0.240 2.32
121 Cambodia East Asia & Pacific 3.77 -1.45 2.32
122 Suriname Latin America & Caribbean 2.90 -0.573 2.32
123 Honduras Latin America & Caribbean 4.24 -1.90 2.34
124 South Africa Sub-Saharan Africa 2.41 -0.0720 2.34
125 Guatemala Latin America & Caribbean 4.58 -2.23 2.35
126 Guyana Latin America & Caribbean 3.02 -0.648 2.37
127 St. Martin Latin America & Caribbean 2.71 -0.329 2.38
128 Saudi Arabia Middle East & North Africa 4.12 -1.72 2.39
129 Libya Middle East & North Africa 2.85 -0.449 2.40
130 Paraguay Latin America & Caribbean 3.55 -1.11 2.44
131 Laos East Asia & Pacific 4.4 -1.95 2.45
132 Fiji East Asia & Pacific 3.03 -0.564 2.46
133 Guam East Asia & Pacific 3.01 -0.457 2.55
134 Oman Middle East & North Africa 3.89 -1.33 2.57
135 Bolivia Latin America & Caribbean 3.99 -1.41 2.58
136 Turkmenistan Europe & Central Asia 2.90 -0.281 2.62
137 Micronesia East Asia & Pacific 4.28 -1.61 2.67
138 Marshall Islands East Asia & Pacific 4.59 -1.92 2.67
139 Syria Middle East & North Africa 4.00 -1.30 2.70
140 Philippines East Asia & Pacific 3.71 -0.989 2.72
141 Botswana Sub-Saharan Africa 3.31 -0.558 2.75
142 Djibouti Middle East & North Africa 4.58 -1.82 2.76
143 Haiti Latin America & Caribbean 4.39 -1.62 2.77
144 Mongolia East Asia & Pacific 2.26 0.519 2.77
145 Eswatini Sub-Saharan Africa 4.00 -1.21 2.78
146 Jordan Middle East & North Africa 3.92 -1.13 2.79
147 Kyrgyzstan Europe & Central Asia 2.4 0.4 2.8
148 Algeria Middle East & North Africa 2.57 0.263 2.83
149 Egypt Middle East & North Africa 3.44 -0.564 2.88
150 Israel Middle East & North Africa 2.95 -0.0600 2.89
151 Lesotho Sub-Saharan Africa 3.66 -0.678 2.98
152 East Timor East Asia & Pacific 5.98 -2.93 3.05
153 Kazakhstan Europe & Central Asia 1.8 1.25 3.05
154 Tajikistan Europe & Central Asia 4.01 -0.865 3.14
155 Tuvalu East Asia & Pacific 3.81 -0.663 3.14
156 Papua New Guinea East Asia & Pacific 4.53 -1.36 3.16
157 Tonga East Asia & Pacific 4.11 -0.915 3.19
158 Namibia Sub-Saharan Africa 3.98 -0.727 3.25
159 Kiribati East Asia & Pacific 4.07 -0.794 3.27
160 Kenya Sub-Saharan Africa 5.14 -1.84 3.30
161 Uzbekistan Europe & Central Asia 2.58 0.728 3.31
162 Pakistan South Asia 5.26 -1.85 3.41
163 West Bank & Gaza Middle East & North Africa 5.44 -2.01 3.44
164 Zimbabwe Sub-Saharan Africa 3.97 -0.537 3.44
165 Iraq Middle East & North Africa 4.95 -1.50 3.44
166 Gabon Sub-Saharan Africa 4.47 -1.01 3.46
167 Nauru East Asia & Pacific 3.64 -0.179 3.46
168 Ghana Sub-Saharan Africa 4.85 -1.35 3.51
169 Vanuatu East Asia & Pacific 4.48 -0.783 3.70
170 Yemen Middle East & North Africa 6.32 -2.60 3.72
171 Rwanda Sub-Saharan Africa 5.92 -2.18 3.75
172 São Tomé & Principe Sub-Saharan Africa 5.18 -1.43 3.75
173 Eritrea Sub-Saharan Africa 5.40 -1.61 3.79
174 Madagascar Sub-Saharan Africa 5.40 -1.61 3.79
175 Malawi Sub-Saharan Africa 6.04 -2.19 3.85
176 Samoa East Asia & Pacific 4.51 -0.639 3.88
177 Sierra Leone Sub-Saharan Africa 6.36 -2.48 3.88
178 Comoros Sub-Saharan Africa 5.30 -1.38 3.91
179 Guinea-Bissau Sub-Saharan Africa 5.72 -1.80 3.92
180 Solomon Islands East Asia & Pacific 4.76 -0.831 3.92
181 Liberia Sub-Saharan Africa 5.88 -1.86 4.02
182 Ethiopia Sub-Saharan Africa 6.56 -2.50 4.06
183 Congo, Rep. Sub-Saharan Africa 4.76 -0.669 4.10
184 Equatorial Guinea Sub-Saharan Africa 5.83 -1.66 4.17
185 Togo Sub-Saharan Africa 5.27 -1.07 4.20
186 Zambia Sub-Saharan Africa 5.93 -1.68 4.24
187 Guinea Sub-Saharan Africa 5.94 -1.63 4.30
188 Senegal Sub-Saharan Africa 5.50 -1.18 4.31
189 South Sudan Sub-Saharan Africa 7.51 -3.18 4.34
190 Ivory Coast Sub-Saharan Africa 5.81 -1.47 4.34
191 Mauritania Sub-Saharan Africa 5.46 -1.12 4.34
192 Cameroon Sub-Saharan Africa 5.53 -1.15 4.38
193 Sudan Sub-Saharan Africa 5.38 -0.996 4.38
194 Uganda Sub-Saharan Africa 6.83 -2.36 4.47
195 Afghanistan South Asia 7.53 -3.01 4.52
196 Mozambique Sub-Saharan Africa 5.81 -1.25 4.56
197 Gambia Sub-Saharan Africa 5.80 -1.21 4.59
198 Tanzania Sub-Saharan Africa 5.69 -1.03 4.66
199 Burkina Faso Sub-Saharan Africa 6.52 -1.85 4.66
200 Benin Sub-Saharan Africa 5.89 -0.995 4.89
201 Burundi Sub-Saharan Africa 6.87 -1.89 4.98
202 Nigeria Sub-Saharan Africa 6.12 -0.981 5.14
203 Angola Sub-Saharan Africa 6.64 -1.43 5.21
204 Mali Sub-Saharan Africa 6.87 -1.01 5.87
205 Central African Republic Sub-Saharan Africa 5.92 0.00300 5.92
206 Congo, Dem. Rep. Sub-Saharan Africa 6.72 -0.612 6.11
207 Somalia Sub-Saharan Africa 7.61 -1.41 6.20
208 Chad Sub-Saharan Africa 7.25 -1.03 6.22
209 Niger Sub-Saharan Africa 7.73 -0.983 6.75
Which countries had the biggest drop in fertility?
# A tibble: 10 × 5
country region rate2000 change rate2022
<chr> <chr> <dbl> <dbl> <dbl>
1 South Sudan Sub-Saharan Africa 7.51 -3.18 4.34
2 Afghanistan South Asia 7.53 -3.01 4.52
3 East Timor East Asia & Pacific 5.98 -2.93 3.05
4 Yemen Middle East & North Africa 6.32 -2.60 3.72
5 Ethiopia Sub-Saharan Africa 6.56 -2.50 4.06
6 Sierra Leone Sub-Saharan Africa 6.36 -2.48 3.88
7 Uganda Sub-Saharan Africa 6.83 -2.36 4.47
8 Guatemala Latin America & Caribbean 4.58 -2.23 2.35
9 Malawi Sub-Saharan Africa 6.04 -2.19 3.85
10 Rwanda Sub-Saharan Africa 5.92 -2.18 3.75
Did any countries have actually increased?
# A tibble: 10 × 5
country region rate2000 change rate2022
<chr> <chr> <dbl> <dbl> <dbl>
1 Kazakhstan Europe & Central Asia 1.8 1.25 3.05
2 Uzbekistan Europe & Central Asia 2.58 0.728 3.31
3 Bulgaria Europe & Central Asia 1.26 0.52 1.78
4 Mongolia East Asia & Pacific 2.26 0.519 2.77
5 Romania Europe & Central Asia 1.31 0.5 1.81
6 Czechia Europe & Central Asia 1.15 0.468 1.62
7 Georgia Europe & Central Asia 1.60 0.462 2.06
8 Kyrgyzstan Europe & Central Asia 2.4 0.4 2.8
9 Moldova Europe & Central Asia 1.50 0.301 1.8
10 Slovenia Europe & Central Asia 1.26 0.29 1.55
How many countries are now above vs below replacement rate? We can do this by grouping by if the rate is at least 2.1—did I mention you can use transformations inside group_by()
as well?
# group by a new column called at_replacement indicating rate>=2.1
fertility %>%
group_by(at_replacement = rate >= 2.1) %>%
summarize(n = n()) %>%
mutate(pct = 100 * n / sum(n))
# A tibble: 2 × 3
at_replacement n pct
<lgl> <int> <dbl>
1 FALSE 3450 26.4
2 TRUE 9600 73.6
# we can repeat this grouping by region, this time using count()
# and just showing the percentage of countries at replacement
fertility %>%
count(region, at_replacement = rate >= 2.1) %>%
mutate(pct_at_rep = 100 * n / sum(n)) %>%
filter(at_replacement) %>%
select(-n, -at_replacement) %>%
arrange(-pct_at_rep)
# A tibble: 7 × 2
region pct_at_rep
<chr> <dbl>
1 Sub-Saharan Africa 22.6
2 Latin America & Caribbean 15.1
3 East Asia & Pacific 12.7
4 Europe & Central Asia 10.4
5 Middle East & North Africa 8.90
6 South Asia 3.55
7 North America 0.307
Let’s make a few plots of the data as well. Here are 2 plots showing median fertility rate over time grouping by either region or income level:
# first get mean rate in each region + year, then pipe into line plot
# fct_reorder2() is used to reorder legend to be same order as end of lines
# see https://forcats.tidyverse.org/reference/fct_reorder.html for more details
fertility %>%
group_by(region, year) %>%
summarize(median_rate = median(rate)) %>%
ggplot(aes(x = year, y = median_rate,
linetype = fct_reorder2(region, year, median_rate),
color = fct_reorder2(region, year, median_rate))) +
geom_hline(yintercept = 2.1, linetype = "dotted") + geom_line(linewidth = 1) +
labs(x = "Year", y = "Median fertility rate", linetype = "Region", color = "Region",
title = "Median fertility by region from 1960-2022",
subtitle = "(black dotted line at replacement rate of 2.1)") +
scale_x_continuous(expand = c(0, 0), breaks = seq(1960, 2022, 10),
minor_breaks = seq(1960, 2022, 2)) +
scale_y_continuous(expand = c(0, 0), limits = c(1.2, 7.2),
breaks = 2:7, minor_breaks = seq(1.2, 7.2, .2))
# first get mean rate in each income group + year, then pipe into line plot
fertility %>%
group_by(income_group, year) %>%
summarize(median_rate = median(rate)) %>%
ggplot(aes(x = year, y = median_rate,
linetype = income_group, color = income_group)) +
geom_hline(yintercept = 2.1, linetype = "dotted") + geom_line(linewidth = 1) +
labs(x = "Year", y = "Median fertility rate",
linetype = "Income group", color = "Income group",
title = "Median fertility by income group from 1960-2022",
subtitle = "(black dotted line at replacement rate of 2.1)") +
scale_x_continuous(expand = c(0, 0), breaks = seq(1960, 2022, 10),
minor_breaks = seq(1960, 2022, 2)) +
scale_y_continuous(expand = c(0, 0), limits = c(1.4, 7.2),
breaks = 2:7, minor_breaks = seq(1.6, 7.2, .2))
8.1.3.1 Bonus: choropleth
Here’s a bonus plotly choropleth just for fun (you do NOT need to learn this). You can change the year and pan/zoom to see each specific country. Red color indicates countries below the replacement rate 2.1, white indicates at the replacement rate, and blue indicates above the replacement rate.
library(plotly)
plot_ly(fertility, type = "choropleth", locations = ~code, z = ~rate,
text = ~country, frame = ~year, zmin = 0.7, zmax = 8.9, colorscale = list(
c(0, "rgb(255,0,0)"), c(0.1707, "rgb(255,255,255)"), c(1, "rgb(0,0,255)"))) %>%
layout(margin = list(l = 10, r = 10, b = 10, t = 35)) %>% animation_opts(frame = 100)
8.2 Merging
Let’s move on to another topic: merging data frames. Often, the information you need may be spread across several data frames from different samples or different sources, in which case you may need to merge data frames together.
There are generally 2 common ways data may need to be merged: binding rows, and joining columns.
- Binding rows is done when two data frames share the same columns, but have different rows, and you want to combine all the rows together.
- Joining columns is done when two data frames share the same rows, but have different columns, and you want to combine all the columns together.
These two are NOT the same, so it’s important to remember which is which.
8.2.1 Binding
Binding is the simpler of the two, so let’s start here. Returning briefly to the penguins data set, note there’s a year column which indicates when each sample was collected. This means there were 3 different studies conducted. Suppose each of the years was originally in its own data frame (I suspect this is likely to be true).
penguins2007 <- penguins %>% filter(year == 2007)
penguins2008 <- penguins %>% filter(year == 2008)
penguins2009 <- penguins %>% filter(year == 2009)
print(penguins2007, n = 5)
# A tibble: 103 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 Adelie Torgersen 39.1 18.7 181 3750 male
2 Adelie Torgersen 39.5 17.4 186 3800 female
3 Adelie Torgersen 40.3 18 195 3250 female
4 Adelie Torgersen 36.7 19.3 193 3450 female
5 Adelie Torgersen 39.3 20.6 190 3650 male
# ℹ 98 more rows
# ℹ 1 more variable: year <dbl>
print(penguins2008, n = 5)
# A tibble: 113 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 Adelie Biscoe 39.6 17.7 186 3500 female
2 Adelie Biscoe 40.1 18.9 188 4300 male
3 Adelie Biscoe 35 17.9 190 3450 female
4 Adelie Biscoe 42 19.5 200 4050 male
5 Adelie Biscoe 34.5 18.1 187 2900 female
# ℹ 108 more rows
# ℹ 1 more variable: year <dbl>
print(penguins2009, n = 5)
# A tibble: 117 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 Adelie Biscoe 35 17.9 192 3725 female
2 Adelie Biscoe 41 20 203 4725 male
3 Adelie Biscoe 37.7 16 183 3075 female
4 Adelie Biscoe 37.8 20 190 4250 male
5 Adelie Biscoe 37.9 18.6 193 2925 female
# ℹ 112 more rows
# ℹ 1 more variable: year <dbl>
We can see our 3 data frames penguins2007
, penguins2008
, penguins2009
share the exact same columns (same number of columns, same column names, same column types) but have different rows. Each data frame can be thought of as containing a fraction of the overall samples or subjects. In this case, we must row-bind the 3 data frames together.
The function bind_rows(df1, df2, ...)
lets us do this. Just pass each data frame that needs binding in:
penguins_bound <- bind_rows(penguins2007, penguins2008, penguins2009)
penguins_bound
# A tibble: 333 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 Adelie Torgersen 39.1 18.7 181 3750 male
2 Adelie Torgersen 39.5 17.4 186 3800 fema…
3 Adelie Torgersen 40.3 18 195 3250 fema…
4 Adelie Torgersen 36.7 19.3 193 3450 fema…
5 Adelie Torgersen 39.3 20.6 190 3650 male
6 Adelie Torgersen 38.9 17.8 181 3625 fema…
7 Adelie Torgersen 39.2 19.6 195 4675 male
8 Adelie Torgersen 41.1 17.6 182 3200 fema…
9 Adelie Torgersen 38.6 21.2 191 3800 male
10 Adelie Torgersen 34.6 21.1 198 4400 male
# ℹ 323 more rows
# ℹ 1 more variable: year <dbl>
It may be hard to see, since only the first 10 rows are printed here, but if you look at the row numbers, you can see we now have 103+113+117=333 rows after binding.
For best results, ensure input data frames have the exact same columns, with the same names and types!
8.2.2 Joining
Joining is slightly more complicated. This is for when all the rows are together, but the columns you need are spread out over several data frames. Often, this occurs when you obtain data from different sources and want to combine them to look for patterns, but it can also happen sometimes with multiple data frames gathered from the same source.
Joining is performed by first matching rows using a set of key variables before adding in the additional columns. The key variables should uniquely identify rows. Here’s a simple demo to start. Suppose you have the following data frames:
x <- tibble(
A = c("a", "b", "c"),
B = c(1, 2, 3)
)
y <- tibble(
A = c("a", "b", "d"),
C = ymd("2024.1.1") + 0:2
)
x
# A tibble: 3 × 2
A B
<chr> <dbl>
1 a 1
2 b 2
3 c 3
y
# A tibble: 3 × 2
A C
<chr> <date>
1 a 2024-01-01
2 b 2024-01-02
3 d 2024-01-03
Suppose you want to join x
and y
using column A as the “key”, i.e. “A” is the column that uniquely matches rows across data frames. There are 4 different joins you can use: inner_join()
, full_join()
, left_join()
, right_join()
, depending on which rows you want in the result:
-
inner_join(x, y)
matches rows using key variable(s), then returns only rows in bothx
andy
, -
full_join(x, y)
matches rows using key variable(s), then returns all rows in eitherx
ory
, -
left_join(x, y)
matches rows using key variable(s), then returns only rows inx
, -
right_join(x, y)
matches rows using key variable(s), then returns only rows iny
.
Below are demos of each join. Note each join has the exact same columns; the only difference is which rows are included in the output. Also note NAs
are automatically used to fill in any missing values that arise from the merge.
# only return rows in both x and y
inner_join(x, y)
# A tibble: 2 × 3
A B C
<chr> <dbl> <date>
1 a 1 2024-01-01
2 b 2 2024-01-02
# return rows in either x or y
full_join(x, y)
# A tibble: 4 × 3
A B C
<chr> <dbl> <date>
1 a 1 2024-01-01
2 b 2 2024-01-02
3 c 3 NA
4 d NA 2024-01-03
# return only rows in x
left_join(x, y)
# A tibble: 3 × 3
A B C
<chr> <dbl> <date>
1 a 1 2024-01-01
2 b 2 2024-01-02
3 c 3 NA
# return only rows in y
right_join(x, y)
# A tibble: 3 × 3
A B C
<chr> <dbl> <date>
1 a 1 2024-01-01
2 b 2 2024-01-02
3 d NA 2024-01-03
The above 4 are called the “mutating joins” since they mutate (i.e. change) the data frame by adding additional columns. There are 2 other less commonly used joins called the “filterting joins” called semi_join()
and anti_join()
that do not actually add columns; they are only used to filter rows that match or don’t match another data frame. Examples:
# semi_join(x, y) will FILTER rows in x that have matches in y
# note there are NO new columns added to x
semi_join(x, y)
# A tibble: 2 × 2
A B
<chr> <dbl>
1 a 1
2 b 2
# conversely, anti_join() will FILTER rows in x with NO match in y
# again, note NO new columns were added to x
anti_join(x, y)
# A tibble: 1 × 2
A B
<chr> <dbl>
1 c 3
These examples only have 1 key column, but the same works if you have 2 or more key columns; rows are matched if they have ALL the same values in the key columns.
When joining, R will automatically look for columns with the same name to use as key columns. Therefore, to run smoothly, it’s highly recommended to ensure the following before joining:
- Key columns should have the same name and type.
- All other non-key columns should have different names.
- When picking a join, carefully consider which rows you will actually need later.
- Generally, key values should be unique for each row. If you need to have duplicate key values, watch the joins closely to ensure the join works the way you want.
Let’s see a real example of joins in action.
8.2.3 Example: cleaning fertility
The World Bank fertility data set we used earlier obviously didn’t start out perfectly clean and tidy. When you download the CSV file from the data page it comes as a zip of the following two (renamed) files: fertility_meta.csv
and fertility_raw.csv
. One of them contains metadata for each country, while the other contains the actual annual fertility rates.
fertility_meta <- read_csv(
"https://bwu62.github.io/stat240-revamp/data/fertility_meta.csv")
fertility_raw <- read_csv(
"https://bwu62.github.io/stat240-revamp/data/fertility_raw.csv")
fertility_meta
# A tibble: 265 × 5
`Country Code` Region IncomeGroup SpecialNotes TableName
<chr> <chr> <chr> <chr> <chr>
1 ABW Latin America & Caribbean High income <NA> Aruba
2 AFE <NA> <NA> "26 countri… Africa E…
3 AFG South Asia Low income "The report… Afghanis…
4 AFW <NA> <NA> "22 countri… Africa W…
5 AGO Sub-Saharan Africa Lower middle inc… "The World … Angola
6 ALB Europe & Central Asia Upper middle inc… <NA> Albania
7 AND Europe & Central Asia High income <NA> Andorra
8 ARB <NA> <NA> "Arab World… Arab Wor…
9 ARE Middle East & North Africa High income <NA> United A…
10 ARG Latin America & Caribbean Upper middle inc… "The World … Argentina
# ℹ 255 more rows
fertility_raw
# A tibble: 266 × 67
`Country Name` `Country Code` `Indicator Name` `Indicator Code` `1960` `1961`
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 Aruba ABW Fertility rate,… SP.DYN.TFRT.IN 4.82 4.66
2 Africa Eastern and… AFE Fertility rate,… SP.DYN.TFRT.IN 6.72 6.74
3 Afghanistan AFG Fertility rate,… SP.DYN.TFRT.IN 7.28 7.28
4 Africa Western and… AFW Fertility rate,… SP.DYN.TFRT.IN 6.46 6.47
5 Angola AGO Fertility rate,… SP.DYN.TFRT.IN 6.71 6.79
6 Albania ALB Fertility rate,… SP.DYN.TFRT.IN 6.46 6.35
7 Andorra AND Fertility rate,… SP.DYN.TFRT.IN NA NA
8 Arab World ARB Fertility rate,… SP.DYN.TFRT.IN 6.93 6.98
9 United Arab Emirat… ARE Fertility rate,… SP.DYN.TFRT.IN 6.72 6.68
10 Argentina ARG Fertility rate,… SP.DYN.TFRT.IN 3.08 3.07
# ℹ 256 more rows
# ℹ 61 more variables: `1962` <dbl>, `1963` <dbl>, `1964` <dbl>, `1965` <dbl>,
# `1966` <dbl>, `1967` <dbl>, `1968` <dbl>, `1969` <dbl>, `1970` <dbl>,
# `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>,
# `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>,
# `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>,
# `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>, …
First, there are several columns like "SpecialNotes"
, "Indicator Name"
, and "Indicator Code"
that we don’t need; we can remove those with select()
. Country names are duplicated in both data frames, so we can pick one to use. Let’s pick "TableName"
since the names are slightly nicer (check this yourself). The rest of the columns should probably be renamed with rename()
.
We can also change the "Low income"
, "Lower middle income"
, … values into just "Low"
, "Lower middle"
, … to slightly simplify the column. Shorter labels will also be easier to work with in filter()
or in plots.
Note as well the existence of several NAs in region and income. Again, verify this yourself, but these are all further subregion groupings of countries (e.g. AFE and AFW are for Eastern/Southern and Western/Central African countries), so we can simply drop these rows.
# apply the processing steps above
fertility_meta2 <- fertility_meta %>%
rename(code = "Country Code", country = "TableName", region = "Region", income_group = "IncomeGroup") %>%
select(code, country, region, income_group) %>%
mutate(income_group = sub(" income", "", income_group)) %>%
filter(!is.na(income_group))
fertility_raw2 <- fertility_raw %>%
select(2, "1960":last_col()) %>%
rename(code = "Country Code")
print(fertility_meta2, n = 5)
# A tibble: 216 × 4
code country region income_group
<chr> <chr> <chr> <chr>
1 ABW Aruba Latin America & Caribbean High
2 AFG Afghanistan South Asia Low
3 AGO Angola Sub-Saharan Africa Lower middle
4 ALB Albania Europe & Central Asia Upper middle
5 AND Andorra Europe & Central Asia High
# ℹ 211 more rows
print(fertility_raw2, n = 5)
# A tibble: 266 × 64
code `1960` `1961` `1962` `1963` `1964` `1965` `1966` `1967` `1968` `1969` `1970`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ABW 4.82 4.66 4.47 4.27 4.06 3.84 3.62 3.42 3.23 3.05 2.91
2 AFE 6.72 6.74 6.76 6.78 6.79 6.80 6.81 6.82 6.83 6.83 6.84
3 AFG 7.28 7.28 7.29 7.30 7.30 7.30 7.32 7.34 7.36 7.39 7.4
4 AFW 6.46 6.47 6.49 6.51 6.53 6.54 6.56 6.59 6.61 6.64 6.66
5 AGO 6.71 6.79 6.87 6.95 7.04 7.12 7.19 7.27 7.33 7.39 7.43
# ℹ 261 more rows
# ℹ 52 more variables: `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>,
# `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>,
# `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>,
# `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>,
# `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>,
# `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, `1998` <dbl>, `1999` <dbl>, …
Now comes the key step! We’re going to use the "code"
columns (which are the 3-letter ISO 3166-1 country codes) as the key columns and join the two data frames together. Let’s use inner_join()
to keep only countries that appear in both, since we want to look at region and income group together with fertility rates.
Note these data frames already satisfy the recommended conditions, i.e. the key columns have the same name and type and uniquely identify each country, and all other columns have different names, so there should be no issues.
fertility_joined <- inner_join(fertility_meta2, fertility_raw2)
fertility_joined
# A tibble: 216 × 67
code country region income_group `1960` `1961` `1962` `1963` `1964` `1965` `1966`
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ABW Aruba Latin… High 4.82 4.66 4.47 4.27 4.06 3.84 3.62
2 AFG Afghan… South… Low 7.28 7.28 7.29 7.30 7.30 7.30 7.32
3 AGO Angola Sub-S… Lower middle 6.71 6.79 6.87 6.95 7.04 7.12 7.19
4 ALB Albania Europ… Upper middle 6.46 6.35 6.21 6.05 5.85 5.62 5.46
5 AND Andorra Europ… High NA NA NA NA NA NA NA
6 ARE United… Middl… High 6.72 6.68 6.66 6.62 6.57 6.52 6.49
7 ARG Argent… Latin… Upper middle 3.08 3.07 3.11 3.10 3.08 3.06 3.05
8 ARM Armenia Europ… Upper middle 4.79 4.67 4.52 4.34 4.15 3.99 3.83
9 ASM Americ… East … High NA NA NA NA NA NA NA
10 ATG Antigu… Latin… High 4.60 4.56 4.55 4.54 4.48 4.42 4.32
# ℹ 206 more rows
# ℹ 56 more variables: `1967` <dbl>, `1968` <dbl>, `1969` <dbl>, `1970` <dbl>,
# `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>,
# `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>,
# `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>,
# `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>,
# `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, …
This already looks pretty good! Our data frames joined nicely along the key "code"
column, and now our region, income, and fertility data is in 1 single data frame instead of spread across 2 data frames.
However, you may have noticed we still have a problem: our annual fertility rates are spread out across 63 columns, with each year in its own column. This is not compatible with our tidyverse functions. How do we fix it?
8.3 Pivoting
This is where pivoting becomes relevant. Pivoting (also sometimes called reshaping) refers to the process of transforming data between different representations of it. Let’s start with a small example to better illustrate this point.
Suppose two people Alice and Bob are playing a game. They agree to play 3 rounds. Each round, each person tries to score points up to a maximum of 10. The person that wins the most rounds wins overall. Suppose these are the final scores:
# create demo scores data frame
# sample() is used here to randomly generate points
scores_long <- tibble(
round = rep(1:3, each = 2),
player = rep(c("Alice", "Bob"), 3),
points = c(9, 4, 7, 1, 2, 7)
)
scores_long
# A tibble: 6 × 3
round player points
<int> <chr> <dbl>
1 1 Alice 9
2 1 Bob 4
3 2 Alice 7
4 2 Bob 1
5 3 Alice 2
6 3 Bob 7
However, the same data can also be represented in this alternative format:
# A tibble: 3 × 3
round Alice Bob
<int> <dbl> <dbl>
1 1 9 4
2 2 7 1
3 3 2 7
Despite the difference in shape and dimensions, the data contained in these data frames is identical. These are often called the long and wide representations of data, so named because one tends to be longer (i.e. more rows) and the other wider (i.e. more columns). Generally, long formats are easier to use with R/tidyverse, whereas wide formats are easier for humans to read, but there are of course exceptions to everything.
Usually, for data to be considered “tidy” it needs to be in a longer format. Tidy data is data that satisfies the following:
- Each variable is a column; each column is a variable.
- Each observation is a row; each row is an observation.
- Each value is a cell; each cell is a single value.33
Data in this format will generally be easier to work with using tidyverse, and broadly speaking R, since this format is highly versatile and most functions have been designed to work best on this format.
Hopefully, it should be easy to see that scores_long
above satisfies these properties. Each variable is indeed a column (and vice versa), each row an observation (vv.), etc. It should also be evident scores_wide
is NOT tidy, since each row actually contains 2 scores, and each column is a person, not a variable (the distinction here is subtle but meaningful).
Tidyr’s pivot_longer()
and pivot_wider()
functions let you easily convert between these two representations. When used correctly, these two should be inverses of each other, i.e. you can take a data frame and pivot it longer, then pivot it wider (or vice versa) and get back the original data frame. There are many ways of using these functions (see help page) but here’s a brief guide to their most common and important usage:
-
df %>% pivot_longer(cols, names_to = "..", values_to = "..")
turns a “wider” representation into a “longer” representation.-
cols
is a set of columns where the actual observations are “spread out over”. You can set this using numbers, names, ranges, or selector functions (just like forselect()
). In thescores_wide
data frame, this would be the two columnsAlice
andBob
since this is where the points for each round are stored. -
names_to
sets the name for the new column containing the names of thecols
. In thescores_wide
data frame, it will be filled with repetitions of"Alice"
and"Bob"
corresponding to each round’s points. -
values_to
sets the name for the new column containing the values insidecols
. In thescores_wide
data frame, it will be filled with the actual points scored in each round.
# example: scores_wide %>% pivot_longer(Alice:Bob, names_to = "player", values_to = "points")
# A tibble: 6 × 3 round player points <int> <chr> <dbl> 1 1 Alice 9 2 1 Bob 4 3 2 Alice 7 4 2 Bob 1 5 3 Alice 2 6 3 Bob 7
-
-
df %>% pivot_wider(names_from = "..", values_from = "..")
turns a “longer” represetation into a “wider” representation.-
names_from
sets the column of names that will each become their own column in the wider data frame. In thescores_long
data frame, this would be theplayer
column. -
values_from
sets the column of actual observations that will fill each of the newly created wider set of columns. In thescores_long
data frame, this would be thepoints
column.
# example: scores_long %>% pivot_wider(names_from = "player", values_from = "points")
# A tibble: 3 × 3 round Alice Bob <int> <dbl> <dbl> 1 1 9 4 2 2 7 1 3 3 2 7
-
Note in the above examples, passing scores_wide
to pivot_longer()
gives us scores_long
exactly, and passing scores_long
to pivot_wider()
gives us scores_wide
exactly.
8.3.1 Example: finish cleaning fertility
Let’s turn back to the fertility_joined
example. After joining, we were left with the following data frame:
fertility_joined
# A tibble: 216 × 67
code country region income_group `1960` `1961` `1962` `1963` `1964` `1965` `1966`
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ABW Aruba Latin… High 4.82 4.66 4.47 4.27 4.06 3.84 3.62
2 AFG Afghan… South… Low 7.28 7.28 7.29 7.30 7.30 7.30 7.32
3 AGO Angola Sub-S… Lower middle 6.71 6.79 6.87 6.95 7.04 7.12 7.19
4 ALB Albania Europ… Upper middle 6.46 6.35 6.21 6.05 5.85 5.62 5.46
5 AND Andorra Europ… High NA NA NA NA NA NA NA
6 ARE United… Middl… High 6.72 6.68 6.66 6.62 6.57 6.52 6.49
7 ARG Argent… Latin… Upper middle 3.08 3.07 3.11 3.10 3.08 3.06 3.05
8 ARM Armenia Europ… Upper middle 4.79 4.67 4.52 4.34 4.15 3.99 3.83
9 ASM Americ… East … High NA NA NA NA NA NA NA
10 ATG Antigu… Latin… High 4.60 4.56 4.55 4.54 4.48 4.42 4.32
# ℹ 206 more rows
# ℹ 56 more variables: `1967` <dbl>, `1968` <dbl>, `1969` <dbl>, `1970` <dbl>,
# `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>,
# `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>,
# `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>,
# `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>,
# `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, …
Now it should be clear all we need to do is apply pivot_longer()
:
# let's pivot_longer(), then drop any rows with missing values
# after our careful processing, it should be safe to just do drop_na()
# we can also again convert income_group to ordered categories if needed,
# and finally arrange by country and year for extra neatness
fertility <- fertility_joined %>%
pivot_longer("1960":last_col(), names_to = "year", values_to = "rate") %>%
drop_na() %>%
mutate(income_group = factor(income_group, ordered = TRUE, levels = c(
"Low", "Lower middle", "Upper middle", "High"))) %>%
arrange(country, year)
# print our final, clean & tidy data frame!
fertility
# A tibble: 13,050 × 6
code country region income_group year rate
<chr> <chr> <chr> <ord> <chr> <dbl>
1 AFG Afghanistan South Asia Low 1960 7.28
2 AFG Afghanistan South Asia Low 1961 7.28
3 AFG Afghanistan South Asia Low 1962 7.29
4 AFG Afghanistan South Asia Low 1963 7.30
5 AFG Afghanistan South Asia Low 1964 7.30
6 AFG Afghanistan South Asia Low 1965 7.30
7 AFG Afghanistan South Asia Low 1966 7.32
8 AFG Afghanistan South Asia Low 1967 7.34
9 AFG Afghanistan South Asia Low 1968 7.36
10 AFG Afghanistan South Asia Low 1969 7.39
# ℹ 13,040 more rows
Our fertility
data set is now 100% fully cleaned and ready for exploration, visualization, and modeling!
Before we move on, let’s demonstrate one more usage for pivot_wider()
. This function is also great for making human-friendly summary tables, especially when looking at how 3 variables interrelate. For example, suppose we want to look at, for each region, the percentage of countries in each income group. If we just compute the percentages and print, we get something like this:
# count how many countries in each region + income group,
# regroup by region to compute percentages,
# remove the n count column, and print
region_income_pct <- fertility %>%
count(region, income_group) %>%
group_by(region) %>%
mutate(pct = round(100 * n / sum(n), 1)) %>%
select(-n) %>%
print()
# A tibble: 22 × 3
# Groups: region [7]
region income_group pct
<chr> <ord> <dbl>
1 East Asia & Pacific Low 2.9
2 East Asia & Pacific Lower middle 35.2
3 East Asia & Pacific Upper middle 26.4
4 East Asia & Pacific High 35.5
5 Europe & Central Asia Lower middle 5.6
6 Europe & Central Asia Upper middle 27.8
7 Europe & Central Asia High 66.7
8 Latin America & Caribbean Lower middle 10
9 Latin America & Caribbean Upper middle 47.5
10 Latin America & Caribbean High 42.5
# ℹ 12 more rows
This would be great for plotting if that’s what we wanted to do, e.g:
# plot income group percentages by region
# str_wrap used to line-break long region names
region_income_pct %>%
ggplot(aes(x = str_wrap(region, 12), y = pct, fill = income_group)) +
geom_col() + labs(x = NULL, y = "Percent", fill = "Income level",
title = "% countries in each income level by region")
However, we can also see the numeric values laid out as a table by using pivot_wider()
to put each income_group
in its own column:
# pivot region/income percentages to wider table format
region_income_pct %>%
pivot_wider(names_from = income_group, values_from = pct)
# A tibble: 7 × 5
# Groups: region [7]
region Low `Lower middle` `Upper middle` High
<chr> <dbl> <dbl> <dbl> <dbl>
1 East Asia & Pacific 2.9 35.2 26.4 35.5
2 Europe & Central Asia NA 5.6 27.8 66.7
3 Latin America & Caribbean NA 10 47.5 42.5
4 Middle East & North Africa 9.7 31.8 19.5 39
5 North America NA NA NA 100
6 South Asia 12.5 75 12.5 NA
7 Sub-Saharan Africa 46.3 40 12.6 1
The NAs are due to a lack of countries in that region/income combination. In this case, we can fill in missing values by setting the values_fill
argument:
# fill in missing values with 0 in region/income table
region_income_pct %>%
pivot_wider(names_from = income_group, values_from = pct, values_fill = 0)
# A tibble: 7 × 5
# Groups: region [7]
region Low `Lower middle` `Upper middle` High
<chr> <dbl> <dbl> <dbl> <dbl>
1 East Asia & Pacific 2.9 35.2 26.4 35.5
2 Europe & Central Asia 0 5.6 27.8 66.7
3 Latin America & Caribbean 0 10 47.5 42.5
4 Middle East & North Africa 9.7 31.8 19.5 39
5 North America 0 0 0 100
6 South Asia 12.5 75 12.5 0
7 Sub-Saharan Africa 46.3 40 12.6 1