-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy path11-practice.Rmd
More file actions
309 lines (231 loc) · 12.4 KB
/
Copy path11-practice.Rmd
File metadata and controls
309 lines (231 loc) · 12.4 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
# Practice
In this section, we'll work on annotating and understanding unfamiliar code.
The code examples may look intimidating! But remember, we're less concerned here about understanding every detail about what the code is doing, and more about using what we've learned about programming logic as a lens through which to begin interpreting code.
## New types of conditions (in SQL)
Consider the following SQL command:
```SQL
SELECT * FROM OceanBuoys
WHERE Ocean = 'Atlantic' AND (BuoyName LIKE 'S%' OR BuoyName LIKE 'K%');
```
One option for navigating unfamiliar code is to break down the code into its
component parts. Using that approach, you can differentiate functions separate from variables and names to get a better understanding of what type of data you're working with, and what the code is trying to do.
__Exercise__: Using what we've learned in prior sections, try to answer the following questions.
- What does the `OceanBuoys` term most likely refer to - variable or function?
- Tip: Variables will contain a data object. Functions perform an action or task.
- Assuming that `SELECT` is a function, what might it do?
- The `*` character is new. What might `*` indicate, in combination with `SELECT`?
- Parsing the second line that starts with `WHERE`, can you make an educated guess about what `Ocean` and `BuoyName` refer to, in terms of variables?
- Can you guess (or feel free to use your favorite search engine) what `LIKE 'S%'` and `LIKE 'K%'` would indicate?
- Note the `;` at the end of the line starting with `WHERE`. What might this mean?
<details>
<summary>Do your answers match these?</summary>
- The `OceanBuoys` term here refers to a table or dataframe or matrix (really, any kind of tabular format).
- The `SELECT` function selects columns from the `OceanBuoys` table.
- The `*` is a wildcard matching symbol. It indicates that we want to return _all_ columns from the table, no matter what their column names are.
- The `OceanBuoys` table must have columns named `Ocean` and `BuoyName`.
- If we wanted to return only these columns, the syntax could be: `SELECT Ocean, BuoyName FROM OceanBuoys`
- `LIKE` is used for pattern matching. The symbol `%` is a wildcard character in SQL. This code is matching values that start with "S" or with "K".
- The conditions after `WHERE` are the row-based conditions for what will be returned, analogous to `SELECT` for the column-specific condition.
- The `;` is used to indicate the end of a statement in SQL. It tells the computer that the 'thought' is finished, and the action of _doing_ the thought can commence.
</details>\
**Exercise, continued:** Now, can you write out a narrative of what you see the code is trying to do?
<details>
<summary>Example written narrative</summary>
This code is returning a subset of data from the `OceanBuoys` table. Starting from the full `OceanBuoys` table, return all the columns from this table, but keep only rows where `Ocean` is equal to "Atlantic" _and_ where the `BuoyName` value starts with "S" or "K". The returned data should be tabular, with the same number of columns as the full `OceanBuoys` table, but likely with less rows (only those that met the condition).
</details>\
<hr>
## New syntax and terms (in R)
While the logic underlying programming languages stays consistent, a central challenge is that different languages often have their own special syntax, which can take a while to get used to. Don't panic! Familiarity comes with experience and in the meantime, Google is your friend (well, the search engine of your choice. Some of us prefer DuckDuckGo).\
Here is an example of R code:
```r
1. df_new <- df %>%
2. select(-respondent_name) %>%
3. mutate(identifier = paste(respondent_id, survey_wave, sep = "_")) %>%
4. mutate(survey_type =
5. ifelse(survey_wave %in% c("first", "second"), "phone", "in person"))
```
__Exercise__: Go line by line and annotate, in your own words, what that line of code is doing. Then, combine these into a written narrative of what this code is doing.\
Each line has a new take on the same logic we've covered in prior sections; a breakdown of new terms is below to guide your annotation.
- Line 1
- What is `df_new <- df` doing here?
- Hint: refer to 8.1.3, Common Variable Names
- What does the `%>%` symbol indicate?
- Line 2
- What is the `select()` function likely doing?
- What might it mean that the argument in this function is preceded by `-`?
- Line 3
- Using context clues (and your favorite search engine) what do you think the `mutate()` function does?
- The new `paste()` function has three arguments (`paste(arg1, arg2, arg3)`). What do you think the arguments are for?
- Line 4 & 5 (note: new line not strictly necessary, only to fit in page width without needing scroll)
- This `ifelse()` statement has a different format than we've covered so far, but the concept is the same. Assuming this `ifelse()` statement has three arguments (`ifelse(condition, mystery1, mystery2)`) what might the two mystery arguments be specifying?
- Looking at the section `survey_wave %in% c("first", "second")`, what do you think this would translate to, as a written explanation of the task here?
- Finally, it is always important to understand the data type and structure of the data being acted upon. Keep in mind, based on the questions above, what is the structure of the data being used here?
<details>
<summary>Example narrative</summary>
Starting from the table/dataframe called `df`, we want to keep (`select`) all columns _except_ the column named `respondent_name`.\
Then, make a new column called `identifier`. This new column is created by pasting together the value in the `respondent_id` column and the `survey_wave` column, separated by an underscore, `_`.\
Then, make another new column called `survey_type`. The values in this column are determined by an `ifelse` statement: if the value in the `survey_wave` column is any of the values specified in the list `("first", "second")` (so if the value is `first` or `second`), then the value in the `survey_type` column will be `phone`. Otherwise, the value will be `in person`.\
</details>\
<hr>
## New functions in Stata
This challenge uses Stata, a proprietary statistical analysis platform. Stata has a number of unique features and syntax that can make it challenging to interpret. (At least in this instructor's experience, Stata is not user-friendly, but your mileage may vary!)
Relying on what we've learned so far and context clues, what is the code below doing?
```stata
replace variable = variable[_n-1] if missing(variable)
```
<details>
<summary>What is this code doing?</summary>
This is a bit of code to replace any missing values of `variable` with with previous (`n-1`) value of `variable`.
There are [many ways to replace missing values in Stata](https://www.stata.com/support/faqs/data-management/replacing-missing-values/), this is one.
</details>
<hr>
## C++
This is a C++ file named `testScratchDoc.cpp`
Review the file then answer the following questions, annotating as you see fit.
- How is the code similar to some of the code that we've seen so far?
- What do you think the code does (generally)?
- What questions do you have about it?
- What are some suggestions to make it easier to read?
```c
#include <iostream>
#include <stdio.h>
#include <cstring>
int main() {
printf("Hello World!\n");
// std::cout << "Hello, World!" << std::endl;
// mbr (a,b,c,d)
int a=6;
int b=3;
int c=8; //c=b+1; inner/outer test c=4
int d=5;
//mbr (i,j,k,l)
int i=1; // i=n; inner/outer test i=1
int j=2;
int k=3; // k=j+1; inner/outer test k=3
int l=4;
bool intersect_x = false;
bool intersect_y = false;
printf("\nMBR1[%d,%d,%d,%d]\n",a,b,c,d);
printf("MBR1[%d,%d,%d,%d]\n",i,j,k,l);
if(!((c<i) ||(k<a) ))
{
printf("x intersects!\n");
intersect_x = true;
printf("intersect_x is %d\n", intersect_x);
} else{
printf("x does not intersect!\n");
}
if(!((d<j) ||(l<b) ))
{
printf("y intersects!\n");
intersect_y = true;
printf("intersect_y is %d\n", intersect_y);
} else{
printf("y does not intercept!\n");
}
if(intersect_x&&intersect_y)
printf("MBR1 intersects with MBR2\n");
else
printf("MBR1 does not intersect MBR2\n");
/*
// testing writing to memory using pointers
int page = 1;
int *pData; // pointer to data
pData = &page;
printf("\npdata is %s; \n&pData is %p; \n*pdata is %d",pData, &pData, *pData);
int* newP; // new pointer to data
int a = 2;
int b = 3;
int m = sizeof(b);
newP = pData+m;
memcpy(pData,&a,sizeof(int));
printf("\npdata is %s; \n&pData is %p; \n*pdata is %d",pData, &pData, *pData);
memcpy(pData+m,&b,sizeof(int));
// newP = pData+sizeof(int);
printf("\nnewP is %s; \n&newP is %p; \n*newP is %d",newP, &newP, *newP);
*/ // end testing writing to memory using pointers
// testing whether you can initialize a structure with an outside variable
// not really, you can with an outside constant
/* const int b = 100;
const int c = 7;
const int a = (int)(b/c);
int i;
struct test {
int test[a];
};
test mytest;
for(i=0; i<a; i++)
{
mytest.test[i]=i;
}
for(i=0; i<a; i++)
{
printf("\nmytest[%d]=%d",i,mytest.test[i]);
}*/
return 0;
}
/*
#include <stdio.h>
main() {
printf("Hello World!\n");
char sentence []="test your 12 7 42";
char str[20];
int a, b, c;
sscanf(sentence,"%s %s %d %d %d", str, str, &a, &b, &c);
printf("%d %d %d\n", a,b,c);
printf("%s\n", str);
sscanf(sentence,"%s your %d 7 %d", str, &a, &b);
printf("%d %d %d\n", a,b,c);
printf("%s\n", str);
printf("%s\n", sentence);
}*/
```
<hr>
## MATLAB
This is a MATLAB file named `assignment01_leaf.m`
Review the file then answer the following questions, annotating as you see fit.
- How is the code similar to some of the code that we've seen so far?
- What do you think starts a comment line?
- What do you think the code does (generally)?
- What questions do you have about it?
- What are some suggestions to make it easier to read?
```octave
function assignment01_leaf
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% (C) Student Name Here %%%%%%%%%%%%%%%%%%%%%%%%%
TRAIN = load('FaultsNNA_csv'); % Only these two lines need to be changed to test a different dataset. %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
TRAIN_class_labels = TRAIN(:,1); % Pull out the class labels.
TRAIN(:,1) = []; % Remove class labels from training set.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Compute and display Default Rate and other Basic Information
%
[class,count] = mode(TRAIN_class_labels(:,1));
default_error_rate=1-(count/size(TRAIN_class_labels(:,1),1));
disp(['The dataset you tested has ', int2str(length(unique(TRAIN_class_labels))), ' classes.']);
disp(['The training set is of size ', int2str(size(TRAIN,1)),'.']);
disp(['The time series are of length ', int2str(size(TRAIN,2)),'.']);
disp(['The dataset''s most common class is ',num2str(class),' with a total of ',num2str(count),' occurances.']);
disp(['The default error rate is ',num2str(default_error_rate),'.']);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Compute and display the k-fold error rate
%
k=length(TRAIN_class_labels);
k_fold_error_rate = k_fold_cross_validation(TRAIN,TRAIN_class_labels,k);
disp(['The normal k-fold error rate is ',num2str(k_fold_error_rate),' where k=',num2str(k),' and length of train set is ',num2str(length(TRAIN_class_labels)),'.']);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Here is a sample classification algorithm, it is the simple (yet very competitive) one-nearest
% neighbor using the Euclidean distance.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function predicted_class = Classification_Algorithm(TRAIN,TRAIN_class_labels,unknown_object)
best_so_far = inf;
for i = 1 : length(TRAIN_class_labels)
compare_to_this_object = TRAIN(i,:);
distance = sqrt(sum((compare_to_this_object - unknown_object).^2)); % Euclidean distance
if distance < best_so_far
predicted_class = TRAIN_class_labels(i);
best_so_far = distance;
end
end;
```