gds/openscienceDIY.qmd at main · GDSL-UL/gds · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
# Do-It-Yourself {#sec-open-science-DIY .unnumbered}

This section is all about you taking charge of the steering wheel and choosing your own adventure. For this block, we are going to use what we've learnt before to take a look at a dataset of casualties in the war in Afghanistan. The data was originally released by Wikileaks, and the version we will use is published by The Guardian.

You can read a bit more about the data at The Guardian's data blog

::: {.panel-tabset group="language"}
## R

```{r message=FALSE}
library(tidyverse)
```

## Python

```{python}
import pandas
```
:::

## Data preparation

Before you can set off on your data journey, the dataset needs to be read, and there's a couple of details we will get out of the way so it is then easier for you to start working.

The data are published on a [Google Sheet](https://docs.google.com/spreadsheets/d/1EAx8_ksSCmoWW_SlhFyq2QrRn0FNNhcg1TtDFJzZRgc/edit?hl=en#gid=1).

As you will see, each row includes casualties recorded month by month, split by Taliban, Civilians, Afghan forces, and NATO.

Let's read it into an R or Python session:

::: {.panel-tabset group="language"}
## R

```{r}
# Specify the URL of the CSV file
url <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vRa7OIBiz7-yqmgwUEn4V5Wm1TO8rGow_wQVS1PWp--UTCAKqNUhtifECO5ZR9XrMd6Ddq9NxQwf1ll/pub?gid=0&single=true&output=csv"

# Read the data from the URL into a DataFrame
data <- read.csv(url)

# see the data
head(data)
```

## Python

```{python}
# Specify the URL of the CSV file
url = ("https://docs.google.com/spreadsheets/d/1EAx8_ksSCmoWW_SlhFyq2QrRn0FNNhcg1TtDFJzZRgc/export?format=csv&gid=1")

# Read the data from the URL into a DataFrame
data = pandas.read_csv(url, skiprows=[0, -1], thousands=",")

# see the data
data.head()
```
:::

This allows us to read the data straight into a `DataFrame`, as we have done in the previous session.

Now we are good to go!

## Tasks

Now, the challenge is to put to work what we have learnt in this block. For that, the suggestion is that you carry out an analysis of the Afghan Logs in a similar way as how we looked at population composition in Liverpool. These are of course very different datasets reflecting immensely different realities. Their structure, however, is relatively parallel: both capture counts aggregated by a spatial (neighbourhood) or temporal unit (month), and each count is split by a few categories.

Try to answer the following questions:

1.  Obtain the minimum number of civilian casualties (in what month was that?)
2.  How many NATO casualties were registered in August 2008?
3.  What is the month with the most total number of casualties?

**Tip**: You will need to first create a column with total counts