adv-r/Profiling.rmd at master · gisdarcon/adv-r · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
---
title: Profiling and benchmarking
layout: default
---

```{r, echo = FALSE}
source("code/microbenchmark.R")
source("_plugins/png.R")
```

# Optimising code {#profiling}

> "Programmers waste enormous amounts of time thinking about, or worrying
> about, the speed of noncritical parts of their programs, and these attempts
> at efficiency actually have a strong negative impact when debugging and
> maintenance are considered. We should forget about small efficiencies, say
> about 97% of the time: premature optimization is the root of all evil"
> --- Donald Knuth.

Optimising code to make it run faster is an iterative process:

1. Find the biggest bottleneck, the slowest part of your code.
1. Eliminate the bottleneck.
1. Repeat until your code is fast enough.

This process is simple, but not easy. Firstly, your intuition for bottlenecks is likely to be bad. Even experienced programmers have a hard time identifying bottlenecks from code because there are so many layers between R code and the processor. Instead of trying to guess where performance bottlenecks are, it's better to __profile__ code, running it on realistic inputs and timining how long each individual operation takes.  This is the subject of the first part of this chapter, [profiling](#measure-perf). Optimising code before you've figured out what's actually slow is called premature optimisation.

Once you've identified a specific bottleneck you need to rewrite so it runs faster. It's difficult to provide general advice on how to do this. In [improving performance](#improve-perf) you'll learn five techniques that can be applied to many different problems. The focus of this chapter is improving performance within R. In [Rcpp](#rcpp), you'll learn another powerful technique for improving performance: re-writing R bottlenecks in C++. You'll also learn a general strategy for tackling bottlenecks that helps ensure you make your code faster without making it incorrect. As computers get faster and R is optimised, your code will get faster all by itself. Your code is never going to automatically become correct or elegant if it is not already.

The bottleneck metaphor is useful because code performance is similar to fluid flowing through a pipe. Constrictions in a pipe cause energy loses and reduce flow. If you want liquid to flow through a pipe faster, you should first widen the narrowest parts. Similarly with code, as soon as you eliminate one bottleneck, a new bottleneck will arise. For this reason, you need to identify how fast the code needs to be before you start. Premature optimisation corresponds to making pipes wider without knowing which are the narrowest.

It's important to differentiate between absolute and relative speed, and fast vs fast enough. Be very wary of only looking at relative differences. One approach may be 10x faster than another, but it might only save 1 ms. Optimisation is also not free. You need to consider the costs of your time vs. computer time. You want to spend hours of your time to save days of computing time, not seconds.

##### Prerequisites

In this chapter we'll be using the lineprof to understand performance of R code, so make sure you've installed it before continuing:

```{r, eval = FALSE}
devtools::install_github(c("wch/shiny-slickgrid", "hadley/lineprof"))
```

## Measuring performance {#measure-perf}

To understand performance you use a profiler. There are a number of types of profiler, but R uses a fairly simple sort called a sampling or statistical profiler. A sampling profiler stops code every few milliseconds and records which function is currently being called, along with the function that called that function and so on. For example, consider the following function `f()`

```{r, eval = FALSE}
library(lineprof)
f <- function() {
  pause(0.1)
  g()
  h()
}
g <- function() {
  pause(0.1)
  h()
}
h <- function() {
  pause(0.1)
}

tmp <- tempfile()
Rprof(tmp, interval = 0.1)
f()
Rprof(NULL)
```

(Note that we're using a `lineprof::pause()` instead of `Sys.sleep()` because it's implemented in such away that it doesn't appear in profiling output.)

Conceptually, the profiler produces output like this:

```
f()
f() > g()
f() > g() > h()
f() > h()
```

Each line represents one "tick" of the profiler, 0.1s in this case. You can see the code first spends 0.1 in f, then 0.1 in g (called from inside f).

In the real world, you're unlikely to get a result this nice. That's because profiling is hard to do accurately without slowing your code down many orders of magnitude. `RProf()` uses a very well established approximation technique: sampling! Basically, R stops itself every `interval` seconds and inspects what the current call is. There's some variability in the how accurate timer is, and how much time each operation takes so each time you profile code you'll get a slightly different answer. Fortunately, pinpoint accuracy is not needed to identify the slowest parts of your code.

Instead of looking at the individual records, we'll aggregate and display them using the lineprof package. There are a number of ways to visualise this data. For example, `summaryRprof()`, the proftools and the profr package. These are useful, but sophisticated tools. I wrote the lineprof package as a simpler way of visualising profiling data. It's less powerful than the alternatives, but it's easier to get started with because it shows you performance in the context of your code. As the name suggests, the fundamental unit of analysis in `lineprof()` is a line of code. This makes lineprof less precise than the alternatives (because a line of code might contain multiple function calls), but it's easier to see the context.

To use lineprof, we'll first save the code in file: this way we can easily find out line numbers and see performance in context. We then use `lineprof()` to run our function and capture the timing output. Printing this object shows some basic information. For now, we'll just focus on the time column which estimates how long each line took to run, and the ref, which tells us which line of code was run (you'll learn about the other columns in [memory profiling](#memory-profiling)). The estimates aren't perfect, but the ratios look about right.

```{r, eval = FALSE}
library(lineprof)
source("code/profiling.R")
l <- lineprof(f())
l
#>    time alloc release dups           ref     src
#> 1 0.074 0.001       0    0 profiling.R#2 f/pause
#> 2 0.143 0.002       0    0 profiling.R#3 f/g
#> 3 0.071 0.000       0    0 profiling.R#4 f/h
```

Lineprof does provide some tools to navigate through this data structure (like `focus()`), but they're a bit clumsy. Instead, we'll start an interactive explorer built using shiny. `shine(l)` will  open a new web page (or if you're using RStudio, a new pane) that shows your source code annotated with information about how long each line took to run. `shine()` starts a shiny app which "blocks" your R session. To exit, you'll need to stop the process using escape or ctrl + c.

```{r, echo = FALSE}
png("profiling-lineprof-f.png", dpi = 220)
```

The `t` column visualises how much time is spent on each line. It's not precise, but allows you to spot bottlenecks (if you want precise numbers you can hover over the bar). You can see that twice as much time is spent in `g()` as `h()`, so that it would make sense to drill down into `g()` for more details. Click `g()` to drill down:

```{r, echo = FALSE}
png("profiling-lineprof-g.png", dpi = 220)
```

Then `h()`:

```{r, echo = FALSE}
png("profiling-lineprof-h.png", dpi = 220)
```

For your own code, this should allow you to quickly navigate to the slowest parts of your code in order to start optimising them.

### Limitations

There are some other limitations to profiling:

* Profiling does not cross the boundary to C code - you can only see if your
  R code calls C/C++ code, not what C/C++ functions are called inside of that.
  Unfortunately tools for profiling compiled code are beyond the scope of
  this book (i.e. I have no idea how to do it).

* Similarly, you can't profile primitive functions or byte code compiled code.

* If you're doing a lot of functional programming with anonymous functions,
  it can be hard to figure out exactly which function is being called.
  The easiest way to work around this is to name your functions.

* Lazy evaluation mean that arguments are often evaluated inside another
  function. For example, in the following code, profiling would make it seem
  like `i()` was called by `j()` because the argument isn't evaluated until it's
  needed inside of `j().

    ```{r, eval = FALSE}
    i <- function() {
      pause(0.1)
      10
    }
    j <- function(x) {
      x + 10
    }
    j(i())
    ```

    If this is too confusing, you might want to create temporary variables that
    force computation to happen earlier.

### Exercises

1 `Rprof()` doesn't very accurately track time spend in `Sys.sleep()`
  (presumably because it's not actually doing any computation.)

## Improving performance {#improve-perf}

Once you've used profiling to identify a bottleneck, you need to make it faster. The following sections introduce you to a number of techniques that I've found broadly useful:

1. Look for existing solutions
1. Do less work
1. Vectorise
1. Parallelise
1. Avoid copies
1. Byte-code compile

A final technique is to rewrite in a faster language, like C++. This is a big topic and is convered in the next chapter, [Rcpp](#rcpp).

Before we get into the specific techniques, first I'll describe a general strategy and organisation style when working on performance. As always, remember that clarity and readibility are more important than speed. Your intution for bottlenecks is likely to be bad, so don't sacrifice readability for performance unless you _know_ it will have a significant impact on run-time.

### Code organisation

There are two traps that it's easy to fall into when making your code faster:

1. Making it faster by making it incorrect.
1. Thinking you've made it faster, when you've actually make it slower.

You can avoid these pitfalls with a good strategy, as outlined below. In this trivial example, we'll compare two approaches to computing the mean.

When tackling a bottleneck, you're going to come up with multiple approaches. Start by writing a function for each approach. The function should encapsulate all relevant behaviour, making it easy to check that it returns the correct result and time how long it takes to run. For our example, computing the mean, two approaches come to mind:

```{r}
mean1 <- function(x) mean(x)
mean2 <- function(x) sum(x) / length(x)
```

I recommend that you record everything you try, even approaches are slower: if you come back to problem in the future, it's useful to see both successes and failures. I often work in a Rmarkdown file to make it easy to intermingle code and explanation.

Next, generate a representative test case. The test case should be big enough to capture the essence of your problem, but small enough that it only takes a few seconds to run with your variations. You don't want it to take too long because you're going to run it many times as you compare approaches. Equally, you don't want it to be too small, otherwise the results might not generalise to the real problem. If you're interested in the performance so different sized inputs, you may need to generate more than one test case.

```{r}
x <- runif(100)
```

Use this test case to quickly check that all variants return the same result with `stopifnot()` and `all.equal()`. For real problems that have fewer possible outputs, you'll need more tests to make sure that an appraoch doesn't accidentally return the correct answer.

```{r}
stopifnot(all.equal(mean1(x), mean2(x)))
```

Finally, use the microbenchmark package to precisely compare how long each variation takes to run. For bigger problems, reduce `times` paramater so that it only takes a couple of seconds to run the benchmark. Remember to focus on the median time, and use the upper and lower quartiles to gauge the varibility of the measurement.

```{r}
microbenchmark(
  mean1(x),
  mean2(x)
)
```

(You might be surprised by the results here - `mean(x)` is considerably slower than `sum(x) / length(x)` because it makes two passes over the vector in order to be more accurate.)

Before you start experimenting, also note your target speed. How fast does a new approach need to be so that it's no longer the bottleneck? You want to know this so that you don't spend time optimising code that's no longer a bottleneck.

If you'd like to see this strategy in action, I've used it a few times on stackoverflow.

* http://stackoverflow.com/questions/22515525#22518603
* http://stackoverflow.com/questions/22515175#22515856
* http://stackoverflow.com/questions/3476015#22511936

### Has someone already solved the problem?

Once you've organised your code and captured the variations you've already thought of, it's natural to see what others have down. You are part of a large community, and it's quite possible that someone else has had the same problem as you. If your bottleneck is in function in a package, it's worth looking for other packages that do that same thing. Two good places to start looking at:

* [CRAN task views](http://cran.rstudio.com/web/views/). If there is a
  CRAN task view related to your problem domain, it's worth looking for
  alternative packages.

* Reverse dependencies of Rcpp, as listed on its
  [CRAN page](http://cran.r-project.org/web/packages/Rcpp). Since these
  packages use C++, it's possible they have implemented your bottleneck
  in a higher performance language.

Otherwise the challenge is describing your bottleneck in a way that helps you search for related problems and solutions. You'll find this much easier if you know the name of the  problem is and some common synonyms. It's hard to search for this knowledge (because you don't know what it's called!) but you can build it up over time by reading broadly about statistics and algorithms, or you can ask others. Talk to your colleagues and brainstorm some possible names, then search on google and stackoverflow. It's often helpful to restrict your search to R related pages. For google, try [rseek](http://www.rseek.org/); for stackoverflow, restrict your searches to the R tag by including `[R]` in your search term.

Record all solutions that you find, not just those that are already faster. Some solutions might be initially slower, but easier to optimise, so end up faster. You may also be able to combine the fastest parts of different approaches. If you've found a solution that's fast enough, congratulations! Otherwise, read on to learn ways to make your R code faster.

### Exercises

1.  What faster alternatives to `lm` are available? What alternatives are
    specifically designed to work with larger datasets?

1.  What package implements a version of `match()` that's faster for
    repeated look ups? How much faster is it?

1.  List four functions (not just in base R) that convert a string into a
    date time object? What are their strengths and weaknesses?

1.  How many different ways can you compute a 1d density estimate in R?

1.  What packages provide the ability to compute a rolling mean?

1.  What alternatives to `optim()` are available?

### Do as little as possible

Given a function, the easiest way to make it faster is to make it do less work. Sometimes you can replace an existing component with a faster, more specific function:

* `vapply()` is faster than `sapply()`, because you pre-specify the output
  type.

* `rowSums()`, `colSums()`, `rowMeans()`, and `colMeans()` are faster than
  the equivalent `apply()` invocations because they are vectorised (the topic
  of the next section).

* If you want to see if a vector contains a single value, `any(x == 10)`
  is much faster than `10 %in% x` because testing equality is simpler than
  testing for inclusion in a set.

Having these specific functions at your fingertips is a matter of having a good R [vocabulary](#vocabulary). The best way to expand your vocabulary over time is to regularly regularly read R code, like on R-help or on [stackoverflow](http://stackoverflow.com/questions/tagged/r).

Other functions will do less work if you give them more information about the problem. It's always worthwhile to carefully read the documentation and experiment with different arguments. Some examples that I've discovered in the past are:

* `read.csv()`: specify known the columns types with `colClasses`

* `factor()`: specify known levels with `levels`

* `cut()`: don't generate labels with `labels = FALSE` if you don't need them
  (and even better use `findInterval()` as mentioned in the see also section of
  the documentation.)

* `interaction()`: if you only need combinations that exist in the data, use
  `drop = TRUE`

Sometimes you can make a function faster by avoiding method dispatch. As we've seen ([Extreme dynamism](#extreme-dynamism)) method dispatch in R can be costly, so if you're calling a method in a tight loop, you can avoid some of the cost by doing method lookup once. For S3, you can do this by calling `generic.class()` instead of `generic()`. For S4, you can use `findMethod()` to find the method, save it to a variable, and then call that function. For example, calling `mean.default()` instead of `mean()` is quite a bit faster for small vectors:

```{r}
x <- runif(1e2)

microbenchmark(
  mean(x),
  mean.default(x)
)
```

This optimisation is a little risky: `mean.default()` is almost twice as fast, but will fail in surprising ways if `x` is not a vector. You should only use it if you know that the input will be a numeric vector.

Other times, if you know you're dealing with a specific type of input, you can come up with a faster strategy. For example, `as.data.frame()` is quite slow because it first coerces each element to a data frame and then `rbind()`s them together. If have a named list with vectors of equal lengths, you can turn it into a data frame directly. If you make the strong assumption that you have a list with the correct assumption you can write a method that's about 20x faster than the default.

```{r}
quickdf <- function(l) {
  class(l) <- "data.frame"
  attr(l, "row.names") <- .set_row_names(length(l[[1]]))
  l
}

l <- lapply(1:26, function(i) runif(1e3))
names(l) <- letters

microbenchmark(
  quickdf(l),
  as.data.frame.list(l),
  as.data.frame(l)
)
```

Again, note the tradeoff. This method is fast because it's dangerous, and if you give it bad inputs you'll get a corrupt data frame:

```{r}
quickdf(list(x = 1, y = 1:2))
```

To come up with this minimal method, I carefully read through then rewrote the source code for `as.data.frame.list()` and `data.frame()`. I made many small changes, each time checking that I hadn't broken existing behaviour. After several hours work, I'd isolated the minimal code shown above. This is a very useful technique. Most base R functions are written for flexiblity and functionality, not performance, and often rewriting for your specific need can yield substantial speed ups. To do this, you'll need to read the source code. It will often be complex and confusing, but don't give up!

The following example shows a progressive simplification of the `diff()` function for the special case of computing differences between adjacent values in a vector. At each step, I replace one arguments with a specific case, then check that the function still works. The initial function is long and complicated, but by restricting the arguments I not only make it around twice as fast, I also make it easier to understand.

```{r}
# The original function, reformatted after typing diff
diff1 <- function (x, lag = 1L, differences = 1L) {
  ismat <- is.matrix(x)
  xlen <- if (ismat) dim(x)[1L] else length(x)
  if (length(lag) > 1L || length(differences) > 1L || lag < 1L || differences < 1L)
    stop("'lag' and 'differences' must be integers >= 1")

  if (lag * differences >= xlen) {
    return(x[0L])
  }

  r <- unclass(x)
  i1 <- -seq_len(lag)
  if (ismat) {
    for (i in seq_len(differences)) {
      r <- r[i1, , drop = FALSE] - r[-nrow(r):-(nrow(r) - lag + 1L), ,
        drop = FALSE]
    }
  } else {
    for (i in seq_len(differences)) {
      r <- r[i1] - r[-length(r):-(length(r) - lag + 1L)]
    }
  }
  class(r) <- oldClass(x)
  r
}

# Step 1: Assume vector input. This allows me to remove the is.matrix()
# test and the method that use matrix subsetting.
diff2 <- function (x, lag = 1L, differences = 1L) {
  xlen <- length(x)
  if (length(lag) > 1L || length(differences) > 1L || lag < 1L || differences < 1L)
    stop("'lag' and 'differences' must be integers >= 1")

  if (lag * differences >= xlen) {
    return(x[0L])
  }

  i1 <- -seq_len(lag)
  for (i in seq_len(differences)) {
    x <- x[i1] - x[-length(x):-(length(x) - lag + 1L)]
  }
  x
}
diff2(cumsum(0:10))

# Step 2: assume difference = 1L. This simplifies input checking
# and eliminates the for loop
diff3 <- function (x, lag = 1L) {
  xlen <- length(x)
  if (length(lag) > 1L || lag < 1L)
    stop("'lag' must be integer >= 1")

  if (lag >= xlen) {
    return(x[0L])
  }

  i1 <- -seq_len(lag)
  x[i1] - x[-length(x):-(length(x) - lag + 1L)]
}
diff3(cumsum(0:10))

# Step 3: assume lag = 1L. This eliminates input checking and simplifies
# subsetting.
diff4 <- function (x) {
  xlen <- length(x)
  if (xlen <= 1) return(x[0L])

  x[-1] - x[-xlen]
}
diff4(cumsum(0:10))

x <- runif(100)
microbenchmark(
  diff1(x),
  diff2(x),
  diff3(x),
  diff4(x)
)
```

Once you've read [Rcpp](#rcpp) you'll be able to make `diff()` even faster for this special case.

A final example of doing less work is to work with a simpler data structure. For example, when working with a rows from a data frame, it's often much faster to work with row indices than data frames. For example, if you wanted to compute a bootstrap estimate of the correlation between two columns in a data frame, there are two basic approaches: you can either work with the whole data frame or with the individual vectors. The following example shows that working with vectors is about twice as fast.

```{r}
sample_rows <- function(df, i) sample.int(nrow(df), i, replace = TRUE)

# Generate a new data frame containing randomly selected rows
boot_cor1 <- function(df, i) {
  sub <- df[sample_rows(df, i), , drop = FALSE]
  cor(sub$x, sub$y)
}

# Generate new vectors from random rows
boot_cor2 <- function(df, i ) {
  idx <- sample_rows(df, i)
  cor(df$x[idx], df$y[idx])
}

df <- data.frame(x = runif(100), y = runif(100))
microbenchmark(
  boot_cor1(df, 10),
  boot_cor2(df, 10)
)
```

### Exercises

1.  How do the results change if you compare `mean()` and `mean.default()`
    on 10,000 observations, rather than on 100?

1.  Make a faster version of `chisq.test()` that only computes the Chi-square
    test statistic when the input is two numeric vectors with no missing
    values. You can either start from `chisq.test()` and make it simpler,
    or trying starting from
    [the definition](http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test).

1.  Can you make a faster version of `table()` for the special case of
    two integer input vectors with no missing values? Can you use it to
    speed up your Chi-square test?

### Vectorise

If you've used R for any length of time, you've probably heard the admonishment to "vectorise your code". But what does that mean? Vectorising your code is not just about avoiding for loops (although that's often a step), but is more using functions that work with a complete vector, rather than the individual components. There are two key attributes of a vectorised function:

* It works with vectors. The most important feature of vectorised code is that
  it makes many problems simpler, because instead of thinking about the
  individual components of a vector, you think about working with the complete
  vector.

* Most vectorised function still involve a for loop, but the for loops is
  written in C instead of R. For loops in C are much faster because they have
  much less overhead, and as much work as possible is done upfront when R
  is compiled.

[Functionals](#functionals) stressed the importance of vectorised code as a higher level abstraction. Vectorisation is also important for writing fast R code. It doesn't mean using `apply()` or `lapply()` or even `Vectorise()`. Those just might improve the interface of the function, but they won't fundamentally change performance. Using vectorisation for performance reason is a matter of finding the existing R function that's implemented in C that most closely applies to your problem.

Vectorised functions that apply to many performance bottlenecks are:

* `rowSums()`, `colSums()`, `rowMeans()`, and `colMeans()` are vectorised
  matrix functions are will always be faster than `apply()`. You can sometimes
  use these functions to implement other vectorised function.

    ```{r}
    rowAny <- function(x) rowSums(x) > 0
    rowAll <- function(x) rowSums(x) == ncol(x)
    ```

* Vectorising subsetting can lead to big speed ups. Remember the techniques
  of lookup tables ([lookup tables](#lookup-tables)) and matching and
  merging by hand ([matching and merging by hand](#matching-merging)).
  Remember that you can use subsetting assignment to replace many values
  in one step: `x[is.na(x)] <- 0` will replace all missing values in `x` with
  0 if `x` is a vector, matrix or data frame.

* If you're converting continuous to categorical values make sure you know
  how to use `cut()` and `findInterval()`.

* Be aware of vectorised functions like `cumsum()` and `diff()`.

Vectorisation is challenging because it makes it harder to predict how different operations will scale up. The following example measures how long it takes to use character subsetting to lookup 1, 10 and 100 elements from a list. You might expect that looking up 10 elements would take 10x as long as looking up one, and looking up 100 elements would take 10x longer than looking up 10. In fact, the following example shows that it only takes about 8 times longer to lookup 100 elements than it takes to lookup 1.

```{r}
lookup <- setNames(as.list(sample(100, 26)), letters)

x1 <- "j"
x10 <- sample(letters, 10)
x100 <- sample(letters, 100, replace = TRUE)

microbenchmark(
  lookup[x1],
  lookup[x10],
  lookup[x100]
)
```

A special case of vectorisation is matrix algebra, where the loops are done by highly tuned external libraries like BLAS. If you can figure out a way to use matrix algebra to solve a problem, it will often be very fast. Unfortunately using matrix algebra is often recognising a know trick, so I don't know how to teach it here.

Vectorisation won't solve every problem, and rather than torturing an existing algorithm into one that uses a vectorised approach, you're often better off writing your own vectorised function in C++. You'll learn how to write your own vectorised functions in [Rcpp](#rcpp).

### Exercises

* How can you use `crossprod()` to compute a weighted sum? How much faster is
  it than the naive `sum(x * w)`?

### Avoid copies

A pernicious cause of slow R code is growing an object in a loop. Whenever you use `c()`, `append()`, `cbind()`, `rbind()` or `paste()` to create a bigger object, R must allocate space for the new object then copy the old object to its new home. This is Circle 2 in the [R inferno](http://www.burns-stat.com/pages/Tutor/R_inferno.pdf) can be quite expensive if you're doing it many times in a for loop.

Here's a little benchmark that illustrates the difference. We first generate some random strings, and then combine them either iteratively with a loop with `collapse()`, or once with `paste()`. Note that the peformance of `collapse()` get relatively worse as the number of strings grows: combining 100 strings takes almost 30 times longer than combining 10 strings.

```{r}
random_string <- function() {
  paste(sample(letters, 50, replace = TRUE), collapse = "")
}
strings10 <- replicate(10, random_string())
strings100 <- replicate(100, random_string())

collapse <- function(xs) {
  out <- ""
  for (x in xs) {
    out <- paste0(out, x)
  }
  out
}

microbenchmark(
  collapse(strings10),
  collapse(strings100),
  paste(strings10, collapse = ""),
  paste(strings100, collapse = "")
)
```

Modifying an object in a loop, e.g. `x[i] <- y`, can also create copies, depending on the class of `x`. [Modification in place]{#modification-in-place} discussed this problem in more depth and gives you tools to determine when it looks like you're modifying an object in place, but instead you're actually copying multiple times. Data frames are particularly common offenders.

Note that vectorised code avoids both of these potential problems by only modifying or creating an object once.

### Byte code compilation

R 2.13.0 introduced a new byte code compiler which can increase the speed of certain types of code. Using the compiler is an easy way to get speed ups - it's easy to use, and if it doesn't work well for your function, then you haven't invested a lot of time in it, and so you haven't lost much. The following example shows the pure R version of `lapply()` from [functionals](#lapply). Compiling it gives a considerable speedup, although it's still not quite as fast as the C version provided by base R.

```{r}
lapply2 <- function(x, f, ...) {
  out <- vector("list", length(x))
  for (i in seq_along(x)) {
    out[[i]] <- f(x[[i]], ...)
  }
  out
}

lapply2_c <- compiler::cmpfun(lapply2)

x <- list(1:10, letters, c(F, T), NULL)
microbenchmark(
  lapply2(x, is.null),
  lapply2_c(x, is.null),
  lapply(x, is.null)
)
```

This is a relatively good example for byte code compiling. In most cases you're more like to get a 5-10% speedup. This example optimises well because it uses a for-loop, something that is generally rare in R.

All base R functions are byte code compiled by default.

### Exercises

1.  Imagine you want to compute the boostrap distribution of a sample
    correlation. In other words you have data like in the example below, and
    you want to run `cor_df()` many times. How can you make this code faster?
    (Hint: the function does three things that you can speed up.)

    ```{r, eval = FALSE}
    n <- 1e6
    df <- data.frame(a = rnorm(n), b = rnorm(n))

    cor_df <- function(i) {
      i <- sample(seq(n), n * 0.01)
      cor(q[i, , drop = FALSE])[2,1]
    }
    ```

    Is there a way to vectorise this procedure?

### Case study: t-test

The following case study explores how to make t-tests faster by doing the minimum and exploiting vectorised functions. This case study is based on an example in [Computing thousands of test statistics simultaneously in R](http://stat-computing.org/newsletter/issues/scgn-18-1.pdf) by Holger Schwender and Tina Müller, and I thorougly recommend reading the full paper to see the same ideas applied to other tests.

Imagine we have run 1000 experiments, each of which collected data on 50 individuals. The first 25 individuals in each experiement were assigned to group 1 and the others to group 2. We'll generate some random data to represent this data.

```{r}
m <- 1000
n <- 50
X <- matrix(rnorm(m * n, mean = 10, sd = 3), nrow = m)
grp <- rep(1:2, length = n)
```

For data in this form, there are two basic ways to use `t.test()` to perform a t-test. We can either use the formula interface, to provide two vectors, one for each group. Timing these reveals that the formula interface is considerably slower.

```{r, cache = TRUE}
system.time(for(i in 1:m) t.test(X[i, ] ~ grp)$stat)
system.time(for(i in 1:m) t.test(X[i, grp == 1], X[i, grp == 2])$stat)
```

Of course, a for loop just computes the values, but doesn't save them, so we probably actually want to use `apply()`. This adds little overhead:

```{r}
compT <- function(x, grp){
  t.test(x[grp == 1], x[grp == 2])$stat
}
system.time(apply(X, 1, compT, grp = grp))
```

How can we make this faster? First, we could try doing less work. If you look at the source code, `stats:::t.test.default()`, you can see it does a lot more than just computing the t-statistic. It also computes the p-value and creates nice output for printing. Maybe we can make our code faster by stripping out those pieces.

```{r}
my_t <- function(x, grp) {
  t_stat <- function(x) {
    m <- mean(x)
    length <- length(x)
    var <- sum((x - m) ^ 2) / (n - 1)

    list(m = m, n = n, var = var)
  }

  g1 <- t_stat(x[grp == 1])
  g2 <- t_stat(x[grp == 2])

  pooled_se <- sqrt(g1$var / g1$n + g2$var / g2$n)
  (g1$m - g2$m) / pooled_se
}
system.time(apply(X, 1, my_t, grp = grp))
```

That gives us about a 5x speed up.

Now that we have a fairly simple function, we can make it faster still by vectorising it. Instead of looping over the array outside the funtion, we vectorise the function, modifying `t_stat()` to work with a matrix of values instead of a vector: `mean()` becomes `rowMeans()`, `length()` becomes `ncol()`, and `sum()` becomes `rowSums()`. The rest of the code stays the same.

```{r}
rowtstat <- function(X, grp){
  t_stat <- function(X) {
    m <- rowMeans(X)
    n <- ncol(X)
    var <- rowSums((X - m) ^ 2) / (n - 1)

    list(m = m, n = n, var = var)
  }

  g1 <- t_stat(X[, grp == 1])
  g2 <- t_stat(X[, grp == 2])

  pooled_se <- sqrt(g1$var / g1$n + g2$var / g2$n)
  (g1$m - g2$m) / pooled_se
}
system.time(rowtstat(X, grp))
```

That's much faster! It's at least 40x faster than our previous effort, and around 1000x faster than where we started.

Finally, we could try using byte code compilation. Here we'll need to use `microbenchmark()` instead of `system.time()` in order to get enough accuracy to see the difference:

```{r}
rowtstat_bc <- compiler::cmpfun(rowtstat)

microbenchmark(
  rowtstat(X, grp),
  rowtstat_bc(X, grp)
)
```

For this example, it doesn't help at all.

### Parallelise

Making code parallel doesn't save the computer time, but it does save your time, because multiple computers work on different parts of the problem at the same time. Parallel computing is a complex field, and there's no way to cover it in depth here. Some resources I recommend are:

* [Parallel R](http://amzn.com/B005Z29QT4) by Q. Ethan McCallum and Stephen Weston.

* [Parallel computing for data science](http://heather.cs.ucdavis.edu/paralleldatasci.pdf), by
  Norm Matloff.

What I want to focus on is an simple application of parallel computing to what is called "trivially" parallelisable problems. If you're problem consists of many simple problems that can be solved independently it's very easy to spread the computation over multiple cores of your computer. For example, if you have a for loop or equivalent `lapply()` you can easily run each operation in parallel.  This is particularly easy on linux and the mac because you can easily substitute `mclapply()` for `lapply()`. The following snippet of code runs a trivial (but slow) function on all cores provided by the current machine.

```{r}
library(parallel)
cores <- parallel::detectCores()
cores

pause <- function(i) {
  function(x) Sys.sleep(i)
}

system.time(lapply(1:10, pause(0.25)))
system.time(mclapply(1:10, pause(0.25), mc.cores = cores))
```

Life is a bit harder on windows. You need to first set up a local compute cluster and then use `parLapply()`:

```{r}
cluster <- parallel::makePSOCKcluster(cores)
system.time(parLapply(cluster, 1:10, function(i) Sys.sleep(1)))
```

The main difference between `mclapply()` and `makePSOCKcluster()` is that the individual processes generated by `mclapply()` inherit from the current process, while the processes generated by `makePSOCKcluster()` start a fresh session. This means that most real code will need some setup. Use  `clusterEvalQ()` run arbitrary code on each cluster and load needed packages, and `clusterExport()` to copy objects in the current session to all the remote sessions.

```{r, error = TRUE}
x <- 10
psock <- parallel::makePSOCKcluster(1L)
clusterEvalQ(psock, x)

clusterExport(psock, "x")
clusterEvalQ(psock, x)
```

Note there is some communication overhead to parallel computing. If the subproblems are very fast, then parallelisation might hurt rather than helping.  It's also possible distribute computation over larger networks of computers (not just cores on your local computer). That's beyond the scope of this book since it gets increasingly complicated to balance computation and communication costs. A good place to start in the [High performance computing](http://cran.r-project.org/web/views/HighPerformanceComputing.html) CRAN task view.

### Other techniques

Writing fast R code is part of the general task of becoming a better R programmer. As well as a the specific hints in this chapter, if you want to write fast R code, you'll need to generally improve your programming skills. Some ways to do this are to:

You can also reach out to the community for help. Stackoverflow can be a useful place to ask, but you'll need to put some effort into creating an example that captures the salient features of your problem while being easily digestible. If it's too complex few people will have the time and motivation to attempt a solution. If it's too simple, you'll get answers that solve the toy problem, not the real problem. If you also try to answer questions on stackoverflow, you'll quickly get a feel for what makes a good question.

* [Read R blogs](http://www.r-bloggers.com/) to see what performance
  problems other people have struggled with, and how they have made their
  code faster.

* Read other R programming books, like
  [The Art of R Programming](http://amzn.com/1593273843). Read the
  [R inferno](http://www.burns-stat.com/documents/books/the-r-inferno/) to
  learn about common traps.

* Take an algorithms and data structure course to learn some theory and
  well known ways of tackling certain classes of problems. I have heard
  good things about Princeton's
  [Algorithms](https://www.coursera.org/course/algs4partI) course offered by
  coursera.

* Read general books about optimisation like
  [Mature optimisation](http://carlos.bueno.org/optimization/mature-optimization.pdf)
  by Carlos Bueno, or the [Pragmatic Programmer](http://amzn.com/020161622X) by
  Andrew Hunt and David Thomas.