Skip to content
Open
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 67 additions & 4 deletions episodes/04-data-structures-part2.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -37,10 +37,10 @@ So far, you have seen the basics of manipulating data frames with our nordic dat

::::::::::::::::::::::::::::::::::::::::: instructor

Pay attention to and explain the errors and warnings generated from the
Pay attention to and explain the errors and warnings generated from the
examples in this episode.

:::::::::::::::::::::::::::::::::::::::::
:::::::::::::::::::::::::::::::::::::::::

```{r, echo=TRUE}
gapminder <- read.csv("data/gapminder_data.csv")
Expand Down Expand Up @@ -75,7 +75,7 @@ gapminder <- read.csv("https://datacarpentry.org/r-intro-geospatial/data/gapmind

- You can read directly from excel spreadsheets without
converting them to plain text first by using the [readxl](https://cran.r-project.org/package=readxl) package.


::::::::::::::::::::::::::::::::::::::::::::::::::

Expand All @@ -86,10 +86,12 @@ always do is check out what the data looks like with `str`:
str(gapminder)
```

We can also examine individual columns of the data frame with our `class` function:
We can also examine individual columns of the data frame with the `class` or
'typeof' functions:

```{r}
class(gapminder$year)
typeof(gapminder$year)
class(gapminder$country)
str(gapminder$country)
```
Expand Down Expand Up @@ -281,6 +283,67 @@ tail(gapminder_norway)

To understand why R is giving us a warning when we try to add this row, let's learn a little more about factors.


## Removing columns and rows in data frames

To remove columns from a data frame, we can use the 'subset' function.
This function allows us to remove columns using their names.
If we want to keep all columns except continent, pop and gdpPercap we can use the following `subset` command:

```{r}
life_expectancy <- subset(gapminder, select = -c(continent, pop, gdpPercap))
head(life_expectancy)
```

We can also use a logical vector to achieve the same result. Make sure the
vector's length match the number of columns in the data frame (to avoid R repeating the shorter vector to match the length of the longer vector):

```{r}
life_expectancy <- gapminder[c(TRUE, TRUE, FALSE, FALSE, TRUE, FALSE)]
head(life_expectancy)
```

Vector recycling occurs when working with vectors of different length and it
consist on repeating the elements of the shorter vector up to the lenght of
the larger one. For more information, check the book R for Data Science and its
[chapter about vectors](https://r4ds.had.co.nz/vectors.html#scalars-and-recycling-rules).

Alternatively, we can use column positions:

```{r}
life_expectancy <- gapminder[-c(3, 4, 6)]
head(life_expectancy)
```

Note that typically we select the rows we want to keep, rather than removing rows we do not want in the data.
However, to remove rows from a data frame, we can use their positions.
To practice on a smaller subset, we will filter the data to only those entries from Afghanistan after the year 2000.
This smaller dataset will be easier for us to inspect by eye and see the changes we are making.

```{r}
# Filter data for Afghanistan during the 20th century:
afghanistan_20c <- gapminder[gapminder$country == "Afghanistan" &
gapminder$year > 2000, ]

# Now remove data for 2002, that is, the first row:
afghanistan_20c[-1, ]
```


In research, we often remove rows based on features of the data itself, rather than its location.
For example, you may want to remove all the missing data prior to an analysis. Let's first add some missing values (NAs) into the data and then we can use `na.omit()` to remove them.

```{r}
# Turn some values into NAs:
afghanistan_20c <- gapminder[gapminder$country == "Afghanistan", ]
afghanistan_20c[afghanistan_20c$year < 2007, "year"] <- NA
head(afghanistan_20c)

# Remove NAs
na.omit(afghanistan_20c)
```


## Factors

Here is another thing to look out for: in a `factor`, each different value
Expand Down
Loading