Skip to content

Latest commit

 

History

History
229 lines (204 loc) · 11.3 KB

File metadata and controls

229 lines (204 loc) · 11.3 KB

GAM smoothing vs GPS models

Overview

There are two parts that are covered:

  • Weekly average of PM10 Measurement using GAM
  • Understanding GPS data

Part 1: PM10 Measurement using GAM

According to the Datacamp course, a generalized additive model (GAM) is "a generalized linear model in which the linear predictor depends linearly on unknown smooth functions of some predictor variables, and interest focuses on inference about these smooth functions."

1. Personal rationale

It is worthwhile to use this approach because the smoothing functions of geo-environmental elements (e.g. DEM, XY coordinates, land use, buildings) might explain the complexity of pollution flow. In R CRAN, there are two packages called called MAPGAM and mgcv. MAPGAM (Mapping Smoothed Effect Estimates from Individual-Level Data) uses GAM for mapping odds ratios, other effect estimates using individual-level data such as case-control study data, while mgcv focuses on the mathematical perspective of GAM. After trying both, I decided to use mgcv for its miscellaneous applications e.g. convert to raster, analyse GIS files without pre-processing effort (MAPGAM does need some effort to bring shapefiles).

2. GAM Formula

Wiki says...

The model relates a univariate response variable, Y, to some predictor variables, xi. An exponential family distribution is specified for Y (for example normal, binomial or Poisson distributions) along with a link function g (for example the identity or log functions) relating the expected value of Y to the predictor variables via a structure such as

gam

The functions fi may be functions with a specified parametric form (for example a polynomial, or an un-penalized regression spline of a variable) or may be specified non-parametrically, or semi-parametrically, simply as 'smooth functions', to be estimated by non-parametric means. So a typical GAM might use a scatterplot smoothing function, such as a locally weighted mean, for f1(x1), and then use a factor model for f2(x2). This flexibility to allow non-parametric fits with relaxed assumptions on the actual relationship between response and predictor, provides the potential for better fits to data than purely parametric models, but arguably with some loss of interpretability.

https://en.wikipedia.org/wiki/Generalized_additive_model

3. Data preparation

Here is the weekly Average data shown in a data table. Click to enlarge image
WeeklyAvPM10

Note that pm2013_XX refers to:

  • pm2013_30: 2013 July week 4
  • pm2013_31: 2013 August week 1
  • pm2013_32: 2013 August week 2
  • pm2013_33: 2013 August week 3
  • pm2013_34: 2013 August week 4
  • pm2013_35: 2013 September week 1
  • pm2013_36: 2013 September week 2
  • pm2013_37: 2013 September week 3
  • pm2013_38: 2013 September week 4
  • pm2013_39: 2013 September week 5

Here is a map of 46 stations considered in this study Seoul Pollution Station

4. GAM calculation

I tested the GAM equation by using gam() function. The basic approach is to put the smoothing argument s() to the relevant variables. In my case, I put the s(X,Y) XY coordinates together as an interaction factor, and add a DEM smoother s(DEM).

The "method = REML" is a mathematical option that is used as default. From stackoverflow, REML is an acronym of REstricted Maximum Likelihood. Here is one of the online user's comments:

Indeed, the REML estimator of a variance component is usually (approximately) unbiased, while the ML estimator is negatively biased. However, the ML estimator usually has lower mean-squared error (MSE) than the REML estimator. So, if you want to be right on average, go with REML, but you pay for this with larger variability in the estimates. If you want to be closer to the true value on average, go with ML, but you pay for this with negative bias.

Since we have only 46 stations, I will leave the method as REML.

jul.w4 <- gam(pm2013_30 ~ s(X, Y) + s(DEM), data = aq.pm10, method = "REML")
aug.w1 <- gam(pm2013_31 ~ s(X, Y) + s(DEM), data = aq.pm10, method = "REML")
aug.w2 <- gam(pm2013_32 ~ s(X, Y) + s(DEM), data = aq.pm10, method = "REML")
aug.w3 <- gam(pm2013_33 ~ s(X, Y) + s(DEM), data = aq.pm10, method = "REML")
aug.w4 <- gam(pm2013_34 ~ s(X, Y) + s(DEM), data = aq.pm10, method = "REML")
sep.w1 <- gam(pm2013_35 ~ s(X, Y) + s(DEM), data = aq.pm10, method = "REML")
sep.w2 <- gam(pm2013_36 ~ s(X, Y) + s(DEM), data = aq.pm10, method = "REML")
sep.w3 <- gam(pm2013_37 ~ s(X, Y) + s(DEM), data = aq.pm10, method = "REML")
sep.w4 <- gam(pm2013_38 ~ s(X, Y) + s(DEM), data = aq.pm10, method = "REML")
sep.w5 <- gam(pm2013_39 ~ s(X, Y) + s(DEM), data = aq.pm10, method = "REML")

Here is the R2 of each model:

  • July w4: .25
  • August w1: .23
  • August w2: .204
  • August w3: .077
  • August w4: .232
  • September w1: .16
  • September w2: .292
  • September w3: .208
  • September w4: .346
  • September w5: .327

The default model used gaussian distribution for model fitting. However, we can change the options as mentioned below:

  • binomial(link = "logit")
  • gaussian(link = "identity")
  • Gamma(link = "inverse")
  • inverse.gaussian(link = "1/mu^2")
  • poisson(link = "log")
  • quasi(link = "identity", variance = "constant")
  • quasibinomial(link = "logit")
  • quasipoisson(link = "log")

All options need further investigation, but so far, inverse.gaussian(link = "1/mu^2") got the highest R2 value. For example, September w5 was increased to .392.

5. GAM on a spatial grid

Now we got all the GAM models ready. But how will the model be translated to a map? The answer is to build a grid. As a start, I used 200 x 200 x 200 grid area of X, Y, Z(DEM).

According to Simon Wood's book "Generalized Additive Model 2nd Edition 2017", he noted that the model containing more than one smoothing function(e.g. f(x), f(y)) are only estimable within the additive constant. I will come back to that later, but so far, adding DEM into the model increased the R2. Click to enlarge image

Rplot05

Part 2: Understanding GPS data

Here is an example of the GPS data. As can be seen below, there are 13 columns in total and all measurements are taken every minute.

period who group loc1 loc1_1 time pm10 pm2_5 pm1 long lat X Y
2 1 2 1 32 2013-07-25 20 57 104.6 97 89.7 126.9822844 37.31476048 198429.6307
2 1 2 1 32 2013-07-25 20 48 106.6 83 78.2 126.9822908 37.31488669 198430.1956

Here is a brief code info. I removed data from the pilot study.

  • Period
    • Pilot: 1
    • Summer: 2
    • Winter: 3
  • who: researcher 1-5
  • group: cluster 1-9
  • loc1(location category)
    • indoors(no movement): 1
    • outdoors(no movement): 2
    • movement: 3
  • loc1_1(specific location)
    • Restaurant: 10
    • Coffee shop: 11
    • Bbq grill: 12
    • Bar: 13
    • Office: 14
    • Traditional market: 15
    • Superstore: 16
    • Department store: 17
    • Shopping complex: 18
    • Other shops(including convenience store): 19
    • Workplace: 20
    • Bank: 21
    • School: 22
    • Academy(educational): 23
    • Bookshop: 24
    • Rest centre for seniors: 25
    • Stroll: 26
    • Walking: 27
    • Bus: 28
    • Subway: 29
    • Taxi: 30
    • Personal vehicle: 31
    • At home: 32
    • Missing data: 999

Plots and Stats

This GPS track plot is derived from one researcher who commutes from Suwon to Seoul (1.5 hour distance). Click to enlarge image.
그림1

  • July (count by minutes)
PM10  < 50 < 100 < 150 < 200 > 200
  0    147   113    40     0     0
  1    115    96    28     0     1
  2    120   102    18     0     0
  3    140    87    12     0     1
  4    122    60    58     0     0
  5    124    59    57     0     0
  6    121    92    27     0     0
  7    112    68    60     0     0
  8     88    89    63     0     0
  9     96   121    19     2     2
  10    85   148     5     0     0
  11    63   151     7     3     0
  12   116    90    20     1     0
  13    81    69     9     0     0
  14    26   126     6     0     0
  15    60   118     2     0     0
  16    69    67     0     0     0
  17   120    60     0     0     0
  18   155    37     1     0     0
  19   167    10     2     0     1
  20   140    66     3     0     0
  21   119   121    25     5     0
  22   136   117    47     0     0
  23   161   116    23     0     0
  • August (count by minutes)
PM10  < 50 < 100 < 150 < 200 > 200
  0    134     4     8    31     3
  1    190     1     0    15     0
  2    180     0     0     0     0
  3    197     1     0     0     1
  4    198     2     0    24    16
  5    180     0    39    20     1
  6    159     9    54    16     2
  7    136    25     6    62     0
  8    169    18    19    34     0
  9    124    64     0     0     0
  10   120    60     0     0     0
  11   110    63     1     0     0
  12    81    95     0     0     1
  13   156    47     4     0     0
  14   176    42     4     1     1
  15   236     4     0     0     0
  16   221    17     1     1     0
  17   222    13     2     1     0
  18   236     4     0     0     0
  19   165    33    10     1    12
  20   159    18     2     1     0
  21   141     9    16    24    19
  22   120     0     6    10    44
  23   120     0     9    18    33
  • September (count by minutes)
PM10  < 50 < 100 < 150 < 200 > 200
  0    134     4     8    31     3
  1    190     1     0    15     0
  2    180     0     0     0     0
  3    197     1     0     0     1
  4    198     2     0    24    16
  5    180     0    39    20     1
  6    159     9    54    16     2
  7    136    25     6    62     0
  8    169    18    19    34     0
  9    124    64     0     0     0
  10   120    60     0     0     0
  11   110    63     1     0     0
  12    81    95     0     0     1
  13   156    47     4     0     0
  14   176    42     4     1     1
  15   236     4     0     0     0
  16   221    17     1     1     0
  17   222    13     2     1     0
  18   236     4     0     0     0
  19   165    33    10     1    12
  20   159    18     2     1     0
  21   141     9    16    24    19
  22   120     0     6    10    44
  23   120     0     9    18    33