Meenakshi Kushwaha
21st July, 2022
Common errors when reading file
"" around file namehere packageType here() in your console. What do you see?

here packagehere tells R where your file is
If your data is in the main directory where here() begins
If you are data is inside a folder
If you are data is deep inside a subfolder
Why not just use setwd()
Using here() makes your code more robust and shareable
tidyverseA collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
dplyr for data manipulationggplot2 for data visualizationsreadr for reading datastringr for string manipulationfilter()arrange()select()mutate()summarise(), used with group_by()These six functions provide the verbs for a language of data manipulation
Dataset of 142 countries, with values for life expectancy, GDP per capita, and population, every five years, from 1952 to 2007
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 |
| Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 |
| Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 |
| Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971 |
| Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.9811 |
| Afghanistan | Asia | 1977 | 38.438 | 14880372 | 786.1134 |
Learn more: Gapminder
# A tibble: 15 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
11 Afghanistan Asia 2002 42.1 25268405 727.
12 Afghanistan Asia 2007 43.8 31889923 975.
13 Albania Europe 1952 55.2 1282697 1601.
14 Albania Europe 1957 59.3 1476505 1942.
15 Albania Europe 1962 64.8 1728137 2313.
filter()Keep or discard observations that satisfy certain condition
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| India | Asia | 1952 | 37.373 | 3.72e+08 | 546.5657 |
| India | Asia | 1957 | 40.249 | 4.09e+08 | 590.0620 |
| India | Asia | 1962 | 43.605 | 4.54e+08 | 658.3472 |
| India | Asia | 1967 | 47.193 | 5.06e+08 | 700.7706 |
| India | Asia | 1972 | 50.651 | 5.67e+08 | 724.0325 |
| India | Asia | 1977 | 54.208 | 6.34e+08 | 813.3373 |
filter()| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| India | Asia | 1952 | 37.373 | 3.72e+08 | 546.5657 |
| India | Asia | 1957 | 40.249 | 4.09e+08 | 590.0620 |
| India | Asia | 1962 | 43.605 | 4.54e+08 | 658.3472 |
| India | Asia | 1967 | 47.193 | 5.06e+08 | 700.7706 |
| India | Asia | 1972 | 50.651 | 5.67e+08 | 724.0325 |
%>%filter(gapminder, country == "India")
is same as
gapminder %>% filter(country == "India")
Memory tip: %>% can be read as “and, then”
Keyboard shortcut Ctrl/Cmd + Shift + M
How would you filter data from all asian countries that have life expectancy (lifeExp) higher than 80?
gapminder %>% filter(continent = “Asia”, lifeExp>“80”)
gapminder %>% filter(continent = “Asia”, lifeExp>80)
gapminder %>% filter(continent == Asia, lifeExp>80)
filter()the | operator signifies “or”
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| India | Asia | 1952 | 37.373 | 372000000 | 546.5657 |
| India | Asia | 1957 | 40.249 | 409000000 | 590.0620 |
| India | Asia | 1962 | 43.605 | 454000000 | 658.3472 |
| India | Asia | 1967 | 47.193 | 506000000 | 700.7706 |
| India | Asia | 1972 | 50.651 | 567000000 | 724.0325 |
| India | Asia | 1977 | 54.208 | 634000000 | 813.3373 |
| India | Asia | 1982 | 56.596 | 708000000 | 855.7235 |
| India | Asia | 1987 | 58.553 | 788000000 | 976.5127 |
| India | Asia | 1992 | 60.223 | 872000000 | 1164.4068 |
| India | Asia | 1997 | 61.765 | 959000000 | 1458.8174 |
| India | Asia | 2002 | 62.879 | 1034172547 | 1746.7695 |
| India | Asia | 2007 | 64.698 | 1110396331 | 2452.2104 |
| Nepal | Asia | 1952 | 36.157 | 9182536 | 545.8657 |
| Nepal | Asia | 1957 | 37.686 | 9682338 | 597.9364 |
filter()Using %in% to match more than one value
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 |
| Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 |
| Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.9811 |
| Albania | Europe | 1952 | 55.230 | 1282697 | 1601.0561 |
| Albania | Europe | 1962 | 64.820 | 1728137 | 2312.8890 |
| Albania | Europe | 1972 | 67.690 | 2263554 | 3313.4222 |
| Algeria | Africa | 1952 | 43.077 | 9279525 | 2449.0082 |
| Algeria | Africa | 1962 | 48.303 | 11000948 | 2550.8169 |
arrange()Arrange rows in asending order by default
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 |
| Albania | Europe | 1952 | 55.230 | 1282697 | 1601.0561 |
| Algeria | Africa | 1952 | 43.077 | 9279525 | 2449.0082 |
| Angola | Africa | 1952 | 30.015 | 4232095 | 3520.6103 |
| Argentina | Americas | 1952 | 62.485 | 17876956 | 5911.3151 |
| Australia | Oceania | 1952 | 69.120 | 8691212 | 10039.5956 |
arrange()Arrange rows in descening order using desc
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.9811 |
| Albania | Europe | 1972 | 67.690 | 2263554 | 3313.4222 |
| Algeria | Africa | 1972 | 54.518 | 14760787 | 4182.6638 |
| Angola | Africa | 1972 | 37.928 | 5894858 | 5473.2880 |
| Argentina | Americas | 1972 | 67.065 | 24779799 | 9443.0385 |
| Australia | Oceania | 1972 | 71.930 | 13177000 | 16788.6295 |
Select the code to arrange population (pop) in descening order
gapminder %>% arrange(pop)
gapminder %>% arrange(year)
select()Select variables or columns of interest
| country | year | pop |
|---|---|---|
| Afghanistan | 1952 | 8425333 |
| Afghanistan | 1957 | 9240934 |
| Afghanistan | 1962 | 10267083 |
| Afghanistan | 1967 | 11537966 |
| Afghanistan | 1972 | 13079460 |
| Afghanistan | 1977 | 14880372 |
select()Drop variables using -
| country | continent | year | lifeExp | gdpPercap |
|---|---|---|---|---|
| Afghanistan | Asia | 1952 | 28.801 | 779.4453 |
| Afghanistan | Asia | 1957 | 30.332 | 820.8530 |
| Afghanistan | Asia | 1962 | 31.997 | 853.1007 |
| Afghanistan | Asia | 1967 | 34.020 | 836.1971 |
| Afghanistan | Asia | 1972 | 36.088 | 739.9811 |
| Afghanistan | Asia | 1977 | 38.438 | 786.1134 |
select()There are a number of helper functions you can use within select():
starts_with(“abc”): matches names that begin with “abc”
ends_with(“xyz”): matches names that end with “xyz”
contains(“ijk”): matches names that contain “ijk”
matches(“(.)\1”): selects variables that match a regular expression
num_range(“x”, 1:3): matches x1, x2 and x3
mutate()Adds new variable at the end of your dataset
| country | pop | pop_mil |
|---|---|---|
| Afghanistan | 8425333 | 8.4 |
| Afghanistan | 9240934 | 9.2 |
| Afghanistan | 10267083 | 10.3 |
| Afghanistan | 11537966 | 11.5 |
| Afghanistan | 13079460 | 13.1 |
| Afghanistan | 14880372 | 14.9 |
mutate()Some of the ways that you can create new variables
+, -, *, /, ^%/% (integer division) and %% (remainder)log(), log2(), log10(), etc.lead() and lag()<, <=, >, >=, !=, and ==min_rank()mutate()Example
| country | year | pop | pop_rank |
|---|---|---|---|
| Sao Tome and Principe | 1952 | 60011 | 1 |
| Sao Tome and Principe | 1957 | 61325 | 2 |
| Djibouti | 1952 | 63149 | 3 |
| Sao Tome and Principe | 1962 | 65345 | 4 |
| Sao Tome and Principe | 1967 | 70787 | 5 |
| Djibouti | 1957 | 71851 | 6 |
What does the following code do?
Remove country, year and pop
Add new variables country, year , pop, and pop_lakh to the dataset
country, year and pop from the dataset and add a new variable pop_lakhmutate() with a conditionMake a new variable with numeric code for each continent
| country | continent | year | pop | cont_code |
|---|---|---|---|---|
| Afghanistan | Asia | 1952 | 8425333 | 3 |
| Afghanistan | Asia | 1957 | 9240934 | 3 |
| Afghanistan | Asia | 1962 | 10267083 | 3 |
| Afghanistan | Asia | 1967 | 11537966 | 3 |
| Afghanistan | Asia | 1972 | 13079460 | 3 |
| Afghanistan | Asia | 1977 | 14880372 | 3 |
case_when()case_when() with mutate()| country | continent | year | pop | cont_code |
|---|---|---|---|---|
| Afghanistan | Asia | 1952 | 8425333 | 3 |
| Afghanistan | Asia | 1957 | 9240934 | 3 |
| Afghanistan | Asia | 1962 | 10267083 | 3 |
| Afghanistan | Asia | 1967 | 11537966 | 3 |
| Afghanistan | Asia | 1972 | 13079460 | 3 |
| Afghanistan | Asia | 1977 | 14880372 | 3 |
case_when() with mutate()summarise()mean(), median()sd(), IQR()min(), max()n()