-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathAssignment 5.Rmd
More file actions
885 lines (789 loc) · 53.5 KB
/
Copy pathAssignment 5.Rmd
File metadata and controls
885 lines (789 loc) · 53.5 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
---
title: "Assignment 5"
author: "Janice Luong"
date: "November 17, 2015"
output: pdf_document
---
In this assignment I used the old (original) database because 8.5gig would have taken too long to download. I defined movies as movies.type = 3 and I filtered out the tv series from movies.type = 3. However, sometimes TV shows do occassionally show up, for example in Question 13, when I look up the movies Andrew Garfield has been in, it also counts TV series such as award shows as a "movie". I'm not sure why even though I did filter out the series.idmovies from movies.idmovies.
Resouces: Piazza, STA 141 Lectures Discussion OH, Stacked Overflow, the Web, and sample solutions from past assignments.
```{r, results = 'hide', message=FALSE}
library(RSQLite)
con = dbConnect(SQLite(), "C:/JANICE/UC DAVIS/FALL 2015/STA 141/Assignment 5/imdb_data")
dbListTables(con)
```
###Question 1: How many actors are there in the database? How many movies?
```{r, cache=TRUE}
# get the amount of unique actors from the column idactors. SELECT the column idactors,
# but get only the unique values and COUNT all the values FROM the table actors
dbGetQuery(con, "SELECT DISTINCT COUNT(idactors) FROM actors")
# get the amount of movies. SELECT the column idmovies and COUNT all the non-null values.
# LEFT JOIN series ON series.idmoves IS NULL so that way I can filter out all the TV series.
# I only want the movies.type = 3 because thats where the movies are.
dbGetQuery(con, "SELECT
COUNT(movies.idmovies)
FROM movies
LEFT JOIN series
ON series.idmovies = movies.idmovies
WHERE series.idmovies IS NULL
AND movies.type = 3
LIMIT 1")
```
I took the unique count of actors from the column idactors because some actor names repeat and each actor only has 1 ID associated with them. Counting the unique ID's ensures that I do not double count an actor. There are 3500167 actors.
The number of movies is 1014745. I counted idmovies, where the type was '3' because type 3 are movies and I took out the TV series by doing a LEFT JOIN on series and setting the WHERE clause to series.idmovies IS NULL. This removes any movies that have a series ID (I repeated this step for all problems that ask about movies). Also movies.type = 1 had a lot of adult/rated X films so I filtered those out and movies.type = 2 were just video games.
###Question 2: What time period does the database cover?
```{r, cache=TRUE}
# UNION ALL is the same as UNION but it allows duplicates
# take from table title because in table movie, the min year was 1 which is not possible
# SELECT MAX and MIN years FROM a table that has a column of years, which is taken FROM
# the table movies and the table aka_titles (since they both have a column named years),
# use UNION ALL to combine both columns, allowing duplicates. I filtered the years so it
# was restricted between 1885 and 2016 (exclusive)
dbGetQuery(con, "SELECT
MAX(year) as 'Max Year',
MIN(year) as 'Min Year'
FROM(SELECT
year
FROM movies
WHERE (year > 1885 AND year < 2016)
UNION ALL
SELECT year
FROM aka_titles
WHERE (year > 1885 AND year < 2016))")
```
The time period the database covers between years 1887 and 2015.
I tried to use JOIN; however, FULL JOIN is not supported on SQLite. I wanted a FULL JOIN so that way I can get all the years possible from both year columns from I realized that since I simply only want two columns from two different tables (and they have the same amount of rows), I can used UNION ALL (this allows duplicates) to get the min and max years from the combined columns. I set the restriction where year must be greater than 1885 because the first motion picture ever made was in 1886. Year must also be less than 2016 because it is currenly 2015, so it is not possible to have a movie past 2015. Although IMDB covers upcoming movies (up to year 2025) we cannot be completely sure that the movie will be released, so I stopped at year 2015.
###Question 3: What proportion of the actors are female? male?
```{r, cache=TRUE}
# get the number of females and males
# if gender is 1, then set it to male, else set it to female, we want two
# columns called 'gender' and 'percent'. use comma to seperate the columns.
# COUNT counts the number of lines returned (if gender is NULL then set
# it to 2) / -- division (counts total number of entries in the actors table)
# / -- division 100.0 -- divides the total number of entries in the actors
# table by 100, use as 'percent' means to change the column name to percent.
# Get all this from the table actors and group them by gender
dbGetQuery(con, "SELECT
CASE
WHEN gender = '1'
THEN 'male'
ELSE 'female'
END as 'gender',
COUNT(
IFNULL(gender, 2))/((SELECT COUNT(*) FROM actors)/100.0
) as 'percent'
FROM actors
GROUP BY gender")
```
About 64.6% proportion of actors are males and about 35.4% of actors are females. More males are actors because acting was a prodominately male job until later in the 20th century when more females began acting.
I noticed that when I did COUNT(gender), it would give me back the total number of males, meaning that it would completely ignore the gender rows with N/A as the gender for females. Using the query: SELECT
(SELECT COUNT(gender) FROM actors WHERE (gender = '1'))/((SELECT COUNT(*) FROM actors)/100.0), I kept on getting the female count as '0'. I found on Stacked Overflow: COUNT(SMTH)/((SELECT COUNT(SMTH) FROM SMTH)/100). What the percentage expression does is it divides 100 twice, which is how I end up with a percentage and not a decimal. Another way to look at it is it first divides the demonimator by 100 then takes the result and then does numberator/result.
Originally I had "SELECT CASE WHEN gender = '1' THEN 'male' ELSE 'female' END as 'gender', COUNT(gender) as 'count' FROM actors GROUP BY gender", but I kept on getting females are 0. In the column gender from table actors, 1 are males and 0 are females; however, the 0 was auto set to N/A. To solve this issues, I used CASE (similar to if-else statements), which allows me to go through every row and sort them between 'male' or 'female', where if gender is '1', then it will be set to male, anything else, will be set to female. COUNT(ISNULL(gender,1)), where COUNT is the function and inside it, it 1 paramter, which is ISNULL(). The ISNULL() is a function with parameters gender and 1 gender is the value of the column 1 is the value that is given if gender is null.
###Question 4: What proportion of the entries in the movies table are actual movies and what proportion are television series, etc.?
```{r, cache=TRUE}
# adult/rated X films are type 1, games are type 2, movies/TV series are type 3
# I sorted by using CASE...END and found all the movie type 1 and 2 and set them to
# different variable names, and the last one, which is type 3 will be another name.
# Next I made a column that got the percentages of each type.
dbGetQuery(con, "SELECT
CASE
WHEN type = '1'
THEN 'Adult Movies'
WHEN type = '2'
THEN 'Video Games'
ELSE 'Movies/TV Series'
END as 'type',
COUNT(type) / ((SELECT COUNT(*) FROM movies)/100.0)
as 'percent'
FROM movies
WHERE type IS NOT NULL
GROUP BY type")
```
```{r}
# get the percentage of just movies without the TV series, and percentage of
# TV series without movies. SELECT the COUNTed rows and rename it to 'Count
# of Movies', then from there I made another column which gets the percentage
# of all the COUNT(*) and divides it by the total rows of movies/100. I filtered
# my result so it was only movies by settingseries.idmovies to NULL (so any
# movie with a series ID will be removed) and I only wanted
# movie.type = 3 since type 1 and 2 are rated X files and video games
dbGetQuery(con, "SELECT
COUNT(*) as 'Count of Movies',
(SELECT
COUNT(*)
FROM movies) as 'Total',
COUNT(*) / ((SELECT COUNT(*) FROM movies)/100.0) as 'Percent'
FROM movies
LEFT JOIN series
ON movies.idmovies = series.idmovies
WHERE series.idmovies IS NULL
AND movies.type = '3'")
```
In the table movies, 11.348795% are adult movies (rated X), 1.184535% are video games, 78.133217% movies/TV series. So this means 72.75766% are just movies and 5.375557% are just TV series. I did not count movies.type = 1 as movies because when I checked what kind of movies they were, they were adult/rated X films, which meant it was most likely not released in theatres so I did not count/view it as a "regular movie".
When I used the query SELECT CASE WHEN type = '1' THEN 'movies' WHEN type = '2' THEN 'games' ELSE 'series' END as 'type', COUNT(type) / (SELECT COUNT(*) FROM movies)/100.0 as 'percent' FROM movies GROUP BY type, I kept on getting two columns of the same name, but with different percentages, which looked like this (output is before I changed the CASE names):
type percent
1 series 0.000000
2 movies 11.348795
3 games 1.184535
4 series 78.133217
When I tried my SQL expression with the CASE...END and replaced it with type, that is when I noticed the row of NA's, which looked like this:
type percent
1 NA 0.000000
2 1 11.348795
3 2 1.184535
4 3 78.133217
Adding "WHERE type IS NOT NULL" removed the row of NA because the percentage was 0.0000 so it was not significant information to have. Adding the query "type IS NOT NULL" does not count the NA's from the table movies.
###Question 5: How many genres are there? What are their names/descriptions?
```{r, cache=TRUE}
# get the all the genres in the database by SELECTing the column genre
# (renamed to 'Genres') FROm the table genres
genres = dbGetQuery(con, "SELECT genre 'Genres' FROM genres")
```
To get all the genres that are in the database, I SELECTed the column genre FROM the table genres. This outputs a table with the column "genre".
There are 32 genres. Their Names and descriptions are:
```{r, echo = FALSE, cache=TRUE}
genres
```
###6: List the 10 most common genres of movies, showing the number of movies in each of these genres.
```{r}
# SELECT the columns idgenres and genres from genres because we want the count,
# but just the count without the name/label of the genre is not helpfull, so
# with both columns it gives us meaningful data. This done through INNER JOINing
# the tables genres and movies_genres. Then COUNT the rows idmovies_genres from
# the table movies_genres and group them by genre (from genres). Filter out the
# TV series by LEFT JOINing series and setting series.idmovies to NULL and get
# only the movies.type = 3. Decending, meaning the next number is smaller,
# and get the first 10
dbGetQuery(con, "SELECT movies_genres.idgenres 'Genre ID', genres.genre 'Genre Discription',
COUNT(*) 'Total'
FROM movies_genres
INNER JOIN genres
ON movies_genres.idgenres = genres.idgenres
INNER JOIN movies
ON movies.idmovies = movies_genres.idgenres
LEFT JOIN series
ON series.idmovies = movies.idmovies
WHERE series.idmovies IS NULL
AND movies.type = 3
GROUP BY genres.genre
ORDER BY
COUNT(*)
DESC
LIMIT 10")
```
Originally, I had: SELECT idgenres, COUNT(idmovies_ genres) FROM movies_genres GROUP BY idgenres ORDER BY COUNT(idmovies _genres) DESC LIMIT 10; but it did not have the genre names/discriptions. In order to get the genre names/discriptions I used INNER JOIN.
###Question 7: Find all movies with the keyword 'space'. How many are there? What are the years these were released? and who were the top 5 actors in each of these movies?
```{r, cache=TRUE}
# get the number of movies with the keywords space (with anything following
# or before the word space). This is done by getting the COUNT of movies ID
# through INNER JOIN of the tables keywords and movies_keywords ON the key
# column idkeywords. Filtered out the movies with a series ID by doing a
# LEFT JOIN on movies and series ON the key column idmovies and
# set series.idmovies to NULL and get only the rows with movies.type = 3.
dbGetQuery(con, "SELECT
COUNT(movies_keywords.idmovies) 'Number Of Movies With Keyword space'
FROM keywords
INNER JOIN movies_keywords
ON keywords.idkeywords = movies_keywords.idkeywords
INNER JOIN movies
ON movies.idmovies = movies_keywords.idmovies
LEFT JOIN series
ON movies.idmovies = series.idmovies
WHERE series.idmovies IS NULL
AND movies.type = 3
AND LOWER(keywords.keyword)
LIKE 'space'")
```
```{r, cache=TRUE}
# get the years, this is done by SELECTing the year column from INNER JOINing
# the tables keywords and movies_keywords on the key column idkeywords. Next
# I do another INNER JOIN on the tables movies and movies_keywords on the key
# column idmovies. This is needed so I can get to the table with the years that
# are also connected to the tables keywords and movies_keywords. Next I filtered
# my result to remove the Tv series and only the movie.type = 3.Next I looked for
# keywords that are LIKE space. Finally I GROUPed BY year so I can get only
# the unique years.
dbGetQuery(con, "SELECT movies.year 'Year'
FROM keywords
INNER JOIN movies_keywords
ON keywords.idkeywords = movies_keywords.idkeywords
INNER JOIN movies
ON movies.idmovies = movies_keywords.idmovies
LEFT JOIN series
ON movies.idmovies = series.idmovies
WHERE series.idmovies IS NULL
AND movies.type = 3
AND year IS NOT NULL
AND LOWER(keywords.keyword)
LIKE 'space'
GROUP BY movies.year")
```
```{r, cache=TRUE}
# get the top 5 actors in each of those movies, this is done by getting
# the relevent columns from the INNER JOINing of the tables actors, movies
# and movies_keywords. This is done to get the relevent columns: lname, fname,
# billing_position, titles and keywords. Next I filtered my results so the I
# only get actors with only a movie ID (no series ID), movie type of 3, have
# been in movies that have a keyword 'space' and have a billing position of 5
# or less. Then I GROUPed by the actor's lname and fname (or else I will
# have repeated actor names) and then I finally ordered by the movie title then
# ordered by the billing position of the actor for that movie.
dbGetQuery(con, "SELECT actors.lname ||', '|| actors.fname 'Last Name, First Name',
acted_in.billing_position 'Billing Postion',
movies.title 'Movie Title', keywords.keyword 'Keyword'
FROM actors
INNER JOIN acted_in
ON actors.idactors = acted_in.idactors
INNER JOIN movies
ON movies.idmovies = acted_in.idmovies
INNER JOIN movies_keywords
ON acted_in.idmovies = movies_keywords.idmovies
INNER JOIN keywords
ON keywords.idkeywords = movies_keywords.idkeywords
LEFT JOIN series
ON movies.idmovies = series.idmovies
WHERE series.idmovies IS NULL
AND keywords.keyword LIKE 'space'
AND movies.type = 3
AND acted_in.billing_position <= 5
GROUP BY actors.lname ||', '|| actors.fname
ORDER BY movies.title, acted_in.billing_position
LIMIT 10")
```
I get 73 movies with the keyword space.
At first when I found the all the movies with the keyword 'space', I was not sure if the word 'space' is case sensitive in SQLite. After finding the amount of movies with the keyword 'space' I added LOWER() around the column I am looking though, but I still got the same amount of movies with the keyword 'space'. This could mean that either SQLite is not case sentitive or all the movies I found did not have 'space' capitalized. I kept LOWER() in the query just to be safe.
Looking for the top 5 actors for each movie with the keyword 'space' was difficult (learned how to use JOIN starting here, then went back to 6 and did it using JOIN). I first looked at the schema for the tables to see how each table were connected to each other by which columns. I needed to JOIN the tables actors, acted_ in, movies, movies_keywords, and keywords. All together I would need four INNER JOINS. Then I had to decide which imformation was important, so that way I won't have a table clustered with information that I won't be helpful in answering the question.
I needed to filter my results. Using the WHERE field, I want only the movies, which is type 3, with the keyword 'space'. I also need to set the restriction in the billing _postition to be 5 or less because we want the top 5 actors in each movie. I GROUP BY actors.lname ||', '|| actors.fname because there were actor names that were repeated many times with the same billing position, and movie title. It looked something like this (the sample below is before I took in account of removing TV series):
Last Name, First Name Billing Postion Movie Title Keyword
1 Ingram, Jay 1 @discovery.ca space
2 Halliday, Judy 2 @discovery.ca space
3 Deacon, Gillian 3 @discovery.ca space
4 Gilbert, Jane 4 @discovery.ca space
5 Wright, Max 1 ALF space
6 Wright, Max 1 ALF space
As you can see Wright, Max is repeated.
I made the lname and fname columns into 1 column using || and splitting them by a comma then space. I accidentally discover that I can simply put the the name I want to change the column to right next to column by using single quotes, instead of haveing to do something like oldColumn AS 'newColName' (later I saw this on Piazza).
As you can see from the first 10 (I limited it to 10 or else the table would have been several pages long), not all movies have a top 5 actor. So either the movies did not have a top 5 actors for billing or the information was not entered into the data base. Also some movies have multiple of the same position. This likely means that two or more actors in the same movie got paid the same or very similar amounts (hence same "billing" position).
###Question 8: Has the number of movies in each genre changed over time? Plot the overall number of movies in each year over time, and for each genre.
```{r, cache=TRUE}
# get the number of genres per year. I SELECTed the columns year from
# table movies and the column genre from the table genres and INNER JOINed
# them on the key column idmovies. Then I INNER JOINed the column idengres
# from table movies_ genres using the key column idengres from the two tables
# genres and movies_genres. Next I LEFT JOINed series on the tables series
# and movies ON the key column idmovies. Then I set series.idmovies to NULL to
# filter out all the movies/TV shows that have a series ID. Since I want the
# number of movies in each genre changed over time, This means I want the count
# of a movie genre for each year. Using ||, which is the concatinate function
# that combines two strings together, I combined the movies.yearand genres.genre
# columns together, this way I can GROUP BY both year and genres at the same time.
# I counted all the columns because it would have been the same result if I wrote
# COUNT(genres.genre).
dbGetQuery(con, "SELECT movies.year ||' '|| genres.genre 'Movie Year and Genre',
COUNT(*) 'Total'
FROM movies
INNER JOIN movies_genres ON movies.idmovies = movies_genres.idmovies
INNER JOIN genres ON genres.idgenres = movies_genres.idgenres
LEFT JOIN series
ON movies.idmovies = series.idmovies
WHERE series.idmovies IS NULL
AND movies.type = '3'
AND movies.year ||' '|| genres.genre IS NOT NULL
GROUP BY movies.year ||' '|| genres.genre LIMIT 10")
```
```{r, cache=TRUE}
# plot overall number of movies in each year over time and for each genre.
# Same query as above but seperated the year and genres. This is done
# so I can plot it.
genresANDYear = dbGetQuery(con, "SELECT movies.year 'Year', genres.genre 'Genre',
COUNT(*) 'Total' FROM movies
INNER JOIN movies_genres
ON movies.idmovies = movies_genres.idmovies
INNER JOIN genres
ON genres.idgenres = movies_genres.idgenres
LEFT JOIN series
ON movies.idmovies = series.idmovies
WHERE series.idmovies IS NULL
GROUP BY genres.genre")
```
```{r, cache=TRUE, message=FALSE}
#remove the first row because it is NA, so it does not have a genre
# OR year. This does not add information to our plot (also having NA's
# makes it not possible to plot)
genresANDYear = na.omit(genresANDYear)
library(ggplot2)
#plot the genres over time (in years) using a stacked bar plot
question8Plot = ggplot(data = genresANDYear,
aes(x = Year, y = Total, fill = Genre)) +
ggtitle("Question 8. Changes in Genres Over the Years") +
ylab("Total Count") +
geom_bar(position = 'stack', stat = 'identity') +
scale_x_continuous(limits = c(1970, 2015))
```
Originally I my query was "SELECT movies.year ||' '|| genres.genre 'Movie Year and Genre', movies_ genres.idgenres 'Genre ID', COUNT(*) 'Total' FROM movies INNER JOIN movies_ genres ON movies.idmovies = movies_ genres.idmovies INNER JOIN genres ON genres.idgenres = movies_ genres.idgenres GROUP BY movies.year ||' '|| genres.genre LIMIT 10", but then I realized I could remove "movies_ genres.idgenres 'Genre ID'" because that information was not helpful; however, I do still need to keep both the INNER JOINs in my query because the only way to connect the table movies to the table genres is through the table movies_genres.
I made a new query that is very similar to my original query because although it is much easier to read the table with year and genres combined, it is not possible for me to plot the year and genres as a single variable. I seperate them and I GROUPed BY genres.genre. If I were to GROUP BY movie.year, the genres would not be stacked on top of each other by year, which would make my bar plot appear very crowded and hard to understand.
In the stacked bar plot, we can see that there were certain genres that were popular at different times and that the genre changes over time. My plot starts at 1970's because all the years earlier than 1970's did not have enough data (for example, it would have year 1910 with the genre News with a count of 1). The data was too little/not significant enough to be shown up on the plot espically since some genres go up to 10,000 counts. Genres that do not show up on the plot in certain years are assumed to have a count of 0.
###Question 9: Who are the actors that have been in the most movies? List the top 20.
```{r, cache=TRUE}
# To get the top 20 actors in the most amount of movies, I INNER JOINed
# the two tables actors and acted_in ON the key column idactors. Next I
# did a LEFT JOIN on series and movies on the key column idmovies, and
# set the WHERE clause series.idmovies IS NULL so that way I can filter
# out all the tv series from movies (WHERE movie.type = 3). I took
# the columns lname and fname from actors and the column character from
# acted _in. Next I created a COUNT() column that couned the DISTINCT
# acted _in.idmovies values so that way it would only count . I GROUPed BY
# idactors so that way I would not have dubplicate actors. Then I ORDERed
# BY the COUNT() of characters. Since ORDER BY goes from lowest to highest,
# I added the field DESC so it would go from highest to lowest so that
# way I can simply just take the first 20 as the top 20 actors who
# have been in the most movies.
dbGetQuery(con, "SELECT actors.lname ||', '|| actors.fname 'Last Name, First Name',
COUNT(DISTINCT(acted_in.idmovies)) '# of Movies Acted In'
FROM actors
INNER JOIN acted_in
ON actors.idactors = acted_in.idactors
INNER JOIN movies
ON movies.idmovies = acted_in.idmovies
LEFT JOIN series
ON movies.idmovies = series.idmovies
WHERE series.idmovies IS NULL
AND actors.lname IS NOT NULL
AND movies.type = 3
GROUP BY acted_in.idactors
ORDER BY COUNT(DISTINCT(acted_in.idmovies)) DESC
LIMIT 20")
```
```{r, cache=TRUE}
#redo in R syntax
releventCol9 = dbGetQuery(con, "SELECT actors.lname, actors.fname,
acted_in.idmovies, movies.type
FROM actors
INNER JOIN acted_in
ON actors.idactors = acted_in.idactors
INNER JOIN movies
ON movies.idmovies = acted_in.idmovies
LEFT JOIN series
ON movies.idmovies = series.idmovies
WHERE series.idmovies IS NULL")
```
```{r, cache=TRUE}
# get only the rows where the column type is equal to 3
releventCol9 = subset(releventCol9, releventCol9$type == 3)
# combine column lname and fname rogether seperated by a
# comma followed by a space (found on stack overflow)
releventCol9 = within(releventCol9,
lfname <- paste(releventCol9$lname,
releventCol9$fname, sep = ', '))
# remove the columns lname and fname since we already have
# a combined column
releventCol9$lname = releventCol9$fname = NULL
#remove all duplicates from both the columns lfname and idmovies
releventCol9 = releventCol9[!duplicated(releventCol9[c("lfname","idmovies")]),]
```
```{r, cache=TRUE, message=FALSE}
library(plyr)
# go through every row in releventCol9 and group all the
# unique lfname and count how many times each unique lfname appears
countedRows = ddply(releventCol9,.(lfname), nrow)
# rename the column V1 to totalMovies
countedRows$totalMovies = countedRows$V1
# remove the column V1 by setting it to NULL
countedRows$V1 = NULL
# sort the rows by totalMovies from largest to smallest value
sortedRows = countedRows[order(countedRows$totalMovies, decreasing = TRUE),]
# get the top/first 20, I get the same output as I do in SQL
head(sortedRows, 20)
```
I had to add WHERE...AND actors.lname IS NOT NULL or else I would have a row of NA's that looked like this in my output:
Last Name, First Name # of Movies Acted In
1 Blanc, Mel 1090
2 <NA> 987
If I did put DISTINCT in COUTNT(), it would start showing TV series actors/hosts as the top actors that have been in the most TV shows. This is not right because a TV show is not a movie. However, I did filter my results by doing s LEFT JOIN on series WHERE series.idmovie IS NULL, so I am not sure why it is not filtering it. A sample result below:
Last Name, First Name # of Movies Acted In
1 Vorderman, Carol 5544
2 Whiteley, Richard 4533
3 Lang, Katherine Kelly 3652
I checked a popoular TV series to see what is going on and I noticed that TV series have both a series ID and a movie ID, which should not affect the filtering (since I want to filter any TV show with a movie ID). Here is a sample of what I'm looking at:
idseries idmovies name season number
1 49476 56137 Wheel of Fortune 5 5
I am really confused about what this is occuring when series.idmovies IS NULL worked fine for all the other problems I did. I thought maybe I skipped a table while JOINing (I did this earlier and it messed up my results), but I didn't. I really can't explain what is going on here. But doing DISTINCT COUNT() fixed the issue.
I tried my best to do questions 9-12 in R with as little SQL as possible. I had to do some parts in SQL or else R would return back an error saying that the database is too large. I started out doing the problems with small subsets of each table, but in order to comapre my results with my query I had to take get the entire table, and this is where I started running into problems with memory. I really wished the data base was smaller (for both the original and the new database) because it is too large and I feel that we would learn SQL just as well with a smaller database.
When redoing the problem in R, I used dbGetQuery to help me get my relevent columns and setting a LIMIT of 10000 as a test run. I had to use SQL to help me get the relevent columns I needed because when I attempted to join in R using the function merge(), I would get an error message saying that I am out of memory.
###Question 10: Who are the actors that have had the most number of movies with "top billing", i.e., billed as 1, 2 or 3? For each actor, also show the years these movies spanned?
```{r, cache=TRUE}
# gets the actors that have most number of movies with "top billing".
# First join the tables actors and acted_in to get the columns last
# and first name of the actors and the column billing_position. Next
# join the tables movies and acted_in to get the column years from movies.
# Then COUNT the columns to get number of movies (COUNT(*column*) just
# counts how many non-null values are in that column, but COUNT(*) counts
# every column). After JOINing, I filter my results by using the WHERE clause.
# I only want the top billing positions 1, 2, 3, so billing_position
# must be equal to or less than 3. I do not want any non-null values for
# the actor's name because that does not tell me anything. Finally,
# I want to GROUP BY the actor's full name and ORDER BY the COUNT
# (number of movies) and DESC which goes from highest to lowest, and
# finally, I want to ORDER BY billing position after COUNT(*)
dbGetQuery(con,
"SELECT actors.lname ||', '|| actors.fname 'Last Name, First Name',
COUNT(DISTINCT(movies.title)) '# Of Movies',
MIN(movies.year) || '-' || MAX(movies.year) 'Years Spanned'
FROM actors
INNER JOIN acted_in
ON actors.idactors = acted_in.idactors
INNER JOIN movies
ON movies.idmovies = acted_in.idmovies
LEFT JOIN series
ON movies.idmovies = series.idmovies
WHERE series.idmovies IS NULL
AND acted_in.billing_position <= 3
AND actors.lname IS NOT NULL
GROUP BY actors.lname ||', '|| actors.fname
ORDER BY
COUNT(DISTINCT(movies.title))
DESC, acted_in.billing_position
LIMIT 10")
```
```{r, cache=TRUE}
releventCol10 = dbGetQuery(con, "SELECT actors.lname,actors.fname,
movies.year, movies.idmovies, acted_in.billing_position
FROM actors
INNER JOIN acted_in
ON actors.idactors = acted_in.idactors
INNER JOIN movies
ON movies.idmovies = acted_in.idmovies
LEFT JOIN series
ON movies.idmovies = series.idmovies
WHERE series.idmovies IS NULL")
```
```{r, cache=TRUE}
#get the rows where billing_position is equal to or less than 3
releventCol10 = subset(releventCol10, releventCol10$billing_position <= 3)
#combine column lname and fname rogether seperated by a comma followed by
#a space (found on stack overflow)
releventCol10 = within(releventCol10,
lfname <- paste(releventCol10$lname,
releventCol10$fname, sep = ', '))
#remove the columns lname and fname since we already have a combined column
releventCol10$lname = releventCol10$fname = NULL
#remove all duplicates from both the columns lfname and idmovies
releventCol10 = releventCol10[!duplicated(releventCol10[c("lfname","idmovies")]),]
#group by lfname and find the minimum year
topActors = aggregate(year ~ lfname, releventCol10, min)
#rename the column year to minYear
topActors$minYear = topActors$year
```
```{r, cache=TRUE}
#remove the column year
topActors$year = NULL
#get the max year while grouping by lfname, set this to a column in topActors
topActors$maxYear = aggregate(year ~ lfname, releventCol10, max)$year
#group by lfname and get the count (length) of years, assign this to the
#column totalMovies
topActors$totalMovies = aggregate(year ~ lfname, releventCol10, length)$year
#sort by the value of totalMovies from largest to smallest
topActors = topActors[order(topActors$totalMovies, decreasing = TRUE),]
#display the top/first 10, I get the same output as in do in SQL
head(topActors, 10)
```
When I did not add in "WHERE actors.lname IS NOT NULL", I kept on getting a row where the actor's name was NA, but he/she had the highest movie count. Here is an example below
actors.lname ||', '|| actors.fname billing_position COUNT(*) MIN(movies.year) || '-' || MAX(movies.year)
1 <NA> 2 54980 5-2017
2 Vorderman, Carol 1 4655 1982-2013
This wouldn't make sense and this does not tell me anything about who has the highest amount of movies and highest billing. The NA's are likely a total of all the people without last names/first names. This means that about 55000 columns did not have an actor last/first name.
I tried changing my query so that I did not have to ORDER BY acted_ in.billing_ position; however, when I took out DESC for acted_ in.billing_position, it cause the NA's values to show up, which is not what I want. Then I tired adding "WHERE acted _in.billing _position IS NOT NULL", I ended up getting billing positions that were 14 or 10, which is too low of a billing position. I took out "ORDER BY COUNT(*)" and kept "ORDER BY acted _in.billing _position"; however, this lead to showing all the billing positions as 1's but it did not show the max number of movies. I realized it was best to just ORDER BY COUNT() and then by billing position.
When redoing this problem in R, I used dbGetQuery to help me get my relevent columns and setting a LIMIT of 10000 as a test run. I could not get the tables itself because after I did, I could not do merge() to help me join tables and filter out the tables, such as billing position, in R. When I did, I got back an error message that I ran out of memory. So I used SQL to help me get my relevent columns by using JOIN.
###Question 11: Who are the 10 actors that performed in the most movies within any given year? What are their names, the year they starred in these movies and the names of the movies?
I had to set the limit to 1 or else my table would be 5 pages long. Output of 1 already fills up an entire page because of all the movie titles. The first output shows that my SQL does display all the movies the actor performed in that year. I showed the top 10 actors below by copy and pasting my output. Writing another query to output it would have taken a longer for R to knit.
```{r, cache=TRUE}
# I SELECTed the columns lname and fname from the table actors.
# I INNER JOINed actors and acted_in ON the key column idactors.
# Although I did not take any columns from acted _in, I still
# needed to INN JOIN actors and acted _in together in order to
# connect/JOIN the table movies. Next I SELECTed the column year
# and title from the table movies and INNER JOINed the table
# acted _in and movies ON the key column idmovies. Now that I have
# the columns I need, I needed to filter my results. I did a
# LEFT JOIN on series by the key column idmovies and I set
# series.idmovies to IS NULL because I want to remove all the tv series
# from the movie table (where type = 3). I used the WHERE clause to only
# get movies that are type 3, and then I grouped by the actor's last name,
# first name because there are likely repeated actor names, but in
# different movies, so GROUPing by actor names will allow me to see
# their total amount performed. I did COUNT(DISTINCT(movies.title)),
# so that way movie title's would not repeat or else the list would be
# very long. COUNting the movies.title only counts the movies that have
# already has the series filtered out. Finally, I ORDERed BY COUNT(DISTINCT(movies.title))
# and I set it to DESC so it goes from highest to lowest.
# That way I can simply LIMIT and take the first 10. I used
# GROUP_ CONCAT() so I could display all the movie titles while also
# GROUPing by the actor's last and first name. I had to use the function
# GROUP_CONCAT() because using GROUP() condenses all the titles and displays
# only one of the many titles that were grouped together, instead of
# displaying all the titles per group.
dbGetQuery(con, "SELECT actors.lname ||', '|| actors.fname 'Last Name, First Name',
movies.year 'Year',
COUNT(DISTINCT(movies.title)) '# Performed',
GROUP_CONCAT(DISTINCT(movies.title)) 'Movie Titles'
FROM actors
INNER JOIN acted_in
ON actors.idactors = acted_in.idactors
INNER JOIN movies
ON movies.idmovies = acted_in.idmovies
LEFT JOIN series
ON movies.idmovies = series.idmovies
WHERE series.idmovies IS NULL
AND movies.type = 3
AND movies.year IS NOT NULL
AND actors.lname IS NOT NULL
GROUP BY actors.lname ||', '|| actors.fname, movies.year
ORDER BY COUNT(DISTINCT(movies.title))
DESC
LIMIT 1")
```
```{r, cache=TRUE}
#to get just the top 10 without all the movie titles (because it's too long)
dbGetQuery(con, "SELECT actors.lname ||', '|| actors.fname 'Last Name, First Name',
movies.year 'Year',
COUNT(DISTINCT(movies.title)) '# Performed'
FROM actors
INNER JOIN acted_in
ON actors.idactors = acted_in.idactors
INNER JOIN movies
ON movies.idmovies = acted_in.idmovies
LEFT JOIN series
ON movies.idmovies = series.idmovies
WHERE series.idmovies IS NULL
AND movies.type = 3
AND movies.year IS NOT NULL
AND actors.lname IS NOT NULL
GROUP BY actors.lname ||', '|| actors.fname, movies.year
ORDER BY COUNT(DISTINCT(movies.title))
DESC
LIMIT 10")
```
```{r, cache=TRUE}
#get the relevent columns needed to redo in R because when I
# attempt to merge() tables together, R keeps on saying that
# the database is too large
releventCol11 = dbGetQuery(con, "SELECT actors.lname, actors.fname,
movies.year, movies.title, movies.type
FROM actors
INNER JOIN acted_in
ON actors.idactors = acted_in.idactors
INNER JOIN movies
ON movies.idmovies = acted_in.idmovies
LEFT JOIN series
ON movies.idmovies = series.idmovies
WHERE series.idmovies IS NULL")
```
```{r, cache=TRUE}
# get years that are between 2016 and 1885 (exclusive)
releventCol11 = subset(releventCol11, releventCol11$year > 1885 | releventCol11$year < 2016)
#get only the movie.type = 3
releventCol11 = subset(releventCol11, releventCol11$type == 3)
# combine column lname and fname rogether seperated by a
# comma followed by a space (found on stack overflow)
releventCol11 = within(releventCol11,
lfname <- paste(releventCol11$lname,
releventCol11$fname, sep = ', '))
# remove the columns lname and fname since we already
# have a combined column
releventCol11$lname = releventCol11$fname = NULL
# remove all duplicates from both the columns lfname and title
releventCol11 = releventCol11[!duplicated(releventCol11[c("lfname","title")]),]
# group by the columns lfname and year, and get the
# length(which is basically the count) of the column title
moviesPerYear = aggregate(title ~ lfname + year, releventCol11, length)
#rename the column title to totalMovies
moviesPerYear$totalMovies = moviesPerYear$title
# remove the column title
moviesPerYear$title = NULL
# sort the rows by the column totalmovies
# from largest to smallest value
moviesPerYear = moviesPerYear[order(moviesPerYear$totalMovies, decreasing = TRUE),]
```
```{r, cache=TRUE}
# filtered down to top 10 to keep tables small
top10ActedYears = head(moviesPerYear, 10)
# aggregate takes likes values and combines them,
# group by lfname and year, title is the column that paste will work on,
# so paste will display all the movie titles
movieTitles = aggregate(title ~ lfname + year, releventCol11, paste,sep=", ")
# merge function will automatically look for matching column
# names and combine rows whome these names match.
# parameters are data frames
top10ActedYearsCombined = merge(movieTitles, top10ActedYears)
#sort rows by the column totalMovies from largest to smallest value
top10ActedYearsCombined = top10ActedYearsCombined[order(top10ActedYearsCombined$totalMovies,
decreasing = TRUE),]
# reorder the columns by index so it looks similar to my SQL
# output and display top10ActedYearsCombined, I get the
# same output as I do in SQL
top10ActedYearsCombined = top10ActedYearsCombined[,c(1,2,4,3)]
head(top10ActedYearsCombined, 1)
top10ActedYearsCombined[,c(1,2,3)]
```
The 10 actors the performed in the most movies are posted above with their corresponding year and movie title.
The output has the same actors, but are organized slightly different. It might be the way I am filtering my result in R is different from how I am filtering it using SQL.
I later added WHERE actors.lname IS NOT NULL. It wouldn't make sense to keep the NA's there because if I do not have the actor's name, it does not tell me anything or answers the question about who the top 10 actors are. This is likely a GROUPed amount of NA's that are in the data set.
When redoing it in R, I used dbGetQuery to help me get my relevent columns and setting a LIMIT of 10000 as a test run. When I tried to merge() the tables, it runs out of memory and R would time out and stop running, so I had to use SQL to JOIN the tables.
###Question 12: Who are the 10 actors that have the most aliases (i.e., see the aka_names table).
```{r, cache=TRUE}
# get the 10 actors with the most amount of aliases (nicknames/aka names).
# I first took the columns lname, fname and name from the tables actors
# and aka_names. I INNER JOINed the two tables actors and aka_names so I
# can get only the actors that have aliases. I made a column that counted
# all the aliases of each actor since I GROUPed by the actor's lname and
# fname. Next I ordered by actors with the most amount of aliases
# (highest to lowest value)
dbGetQuery(con, "SELECT actors.lname ||', '|| actors.fname 'Last Name, First Name',
COUNT(aka_names.name) '# Of Aliases'
FROM actors
INNER JOIN aka_names
ON actors.idactors = aka_names.idactors
WHERE actors.lname IS NOT NULL
GROUP BY actors.lname ||', '|| actors.fname
ORDER BY COUNT(aka_names.name)
DESC
LIMIT 10")
```
```{r, cache=TRUE}
# redo in R
releventCol12 = dbGetQuery(con, "SELECT actors.lname,actors.fname, aka_names.idaka_names
FROM actors
INNER JOIN aka_names
ON actors.idactors = aka_names.idactors")
# combine column lname and fname together,
# seperated by a comma followed by a space
releventCol12 = within(releventCol12,
lfname <- paste(releventCol12$lname,
releventCol12$fname, sep = ', '))
#remove the columns lname and fname since we already have a combined column
releventCol12$lname = releventCol12$fname = NULL
```
```{r, cache=TRUE, message=FALSE}
library(plyr)
# get the counts of each unique lfname that repeats,
# do this for all rows (hence nrow). I'm not 100% sure how
# this line of code works. I found it on Stacked Overflow,
# but it does what I want it to do.
countedRows = ddply(releventCol12,.(lfname), nrow)
#rename the column V1 to totalAliases
countedRows$totalAliases = countedRows$V1
#remove the V1 column by setting it to NULL
countedRows$V1 = NULL
#sort the rows by totalAliases from largest to smallest
sortedRows10 = countedRows[order(countedRows$totalAliases, decreasing = TRUE),]
#get the top/first 10, the output is the same as the query I wrote in SQL
head(sortedRows10, 10)
```
To get the top 10 actors with the most number of aliases, I first took the columns lname and fname from the table actors and made a column called COUNT, that COUNTed the amount of aliases. I INNER JOINed the tables aka_names and actors ON the key column idactors. I filtered by results by setting actors.lname IS NOT NULL or else I would get a row of NA's like this:
Last Name, First Name # Of Aliases
1 <NA> 16789
2 Franco, Jess 77
I needed to remove the NA's because NA is not an actor's name and it does not tell me anything about who has the highest number of aliases. I GROUPed BY actors' first and last name (remove duplicates) and I ordered by the number of aliases.
When redoing it in R, I used dbGetQuery to help me get my relevent columns and setting a LIMIT of 10000 as a test run. If I didn't use SQL to help me get the relevent columns I would run out of memory when I attempt to merge() in R.
###Question 13: Networks: Pick a (lead) actor who has been in at least 20 movies. Find all of the other actors that have appeared in a movie with that person. For each of these, find all the people they have appeared in a movie with. Use this to create a network/graph of who has appeared with who.
```{r, cache=TRUE}
# for actor Andrew Garfield, he appeared in 53 movies (movies only
# excluding TV series); however, IMDB also includes award shows such
# as Much Music Video Music Awards andFilm Independent SPirit Awards
# as a movie
# To get all the second and third degree matches for Andrew Garfield,
# I had to first find all the movies that he has been in. Next I find
# all the people who have been in the same movies as him and this can be
# done by finding all the actors who have been in the same movie ID (easier
# than finding the actors by movie names/title). This is done by checking the
# same tables again and getting the first and last names of the actors. Since
# I cannot INNER JOIN the same table names repeated in a single query, I
# renamed the tables to actors2, acted_in2, series2 and movies2. To find
# people who are three degrees away from Andrew Garfield, I do the same
# thing as I did for finding the people who are two degrees from Andrew
# Garfield, instead I change all the 2's to 3's. Next I filtered my results
# by removing all the movies/TV shows with a series ID and only took actors
# that had a billing position of 1, 2 or 3. To remove NA's from my table,
# I set the actor's lname to IS NOT NULL. Then I ordered my table my movie ID.
actedWithDF = dbGetQuery(con, "SELECT actors.lname ||', '||actors.fname 'First Degree',
actors2.lname ||', '||actors2.fname 'Second Degree',
actors3.lname ||', '||actors3.fname 'Third Degree'
FROM actors
INNER JOIN acted_in
ON actors.idactors = acted_in.idactors
INNER JOIN movies
ON acted_in.idmovies = movies.idmovies
LEFT JOIN series
ON series.idmovies = movies.idmovies
INNER JOIN actors actors2
ON acted_in2.idactors = actors2.idactors
INNER JOIN acted_in acted_in2
ON movies.idmovies = acted_in2.idmovies
INNER JOIN movies movies2
ON acted_in2.idmovies = movies2.idmovies
LEFT JOIN series series2
ON series2.idmovies = movies2.idmovies
INNER JOIN actors actors3
ON acted_in3.idactors = actors3.idactors
INNER JOIN acted_in acted_in3
ON movies2.idmovies = acted_in3.idmovies
WHERE series.idmovies IS NULL
AND series2.idmovies IS NULL
AND acted_in2.billing_position <= 3
AND acted_in3.billing_position <= 3
AND LOWER(actors.fname) = 'andrew'
AND LOWER(actors.lname) = 'garfield'
AND actors2.lname IS NOT NULL
AND actors3.lname IS NOT NULL
ORDER BY movies.idmovies")
```
```{r, cache=TRUE}
# get all the matches with the first degree and second degree, so remove
# the third column by setting it to null
firstDegree = actedWithDF
firstDegree$`Third Degree` = NULL
# get all the matches with the second degree and third degree,
# so remove the first column by setting it to null
secondDegree = actedWithDF
secondDegree$`First Degree` = NULL
# rename the first columns in the data frame firstDegree
# and secondDegree to 'first' in order to rbind() them
colnames(firstDegree)[1] = colnames(secondDegree)[1] = 'first'
# rename the second columns in the data frame firstDegree and
# secondDegree to 'second' in order to rbind() them
colnames(firstDegree)[2] = colnames(secondDegree)[2] = 'second'
```
```{r, cache=TRUE, message=FALSE}
library(igraph)
#row bind (stack them from tail to head/head to tail) them so
# now we only have two columns, but have all the possible pairs
# between first & second degree and second & third degree
getAllMatches = graph.data.frame(rbind(firstDegree, secondDegree), directed = FALSE)
# use the function simplify to remove all duplicate matches
question11Graph = plot(simplify(getAllMatches))
```
Since igraph only plots matches by pairs, it completely ignored my third column (three degrees away from Garfield). So in order for igraph to plot all the matches, I created two data frames with the first & second degrees and second & third degrees. Then I renamed the columns so both data frames had the same column names or else I can't rbind() them. Finally stacked these from tail to head/head to tail by using rbind() and now I have all the matches in two columns.
There are many duplicate matches when I run rbind(firstDegree, secondDegree) because, for example, let's say we have an actor called Main and in the second degree matches, we have Samual Jackson and Jackson has been in a movie with Scarlett Johansson and Chris Evans, which are three degrees away from Main. So this means we will have many matches that look like this: Main -- Samual Jackson -- Scarlett Johansson; Main -- Samual Jackson -- Chris Evans. So this means we will have many first and second degree matches that look like: Main -- Samual Jackson; Main -- Samual Jackson because of all the different third degree matches. Using simplify() will help remove multiple matches/lines on the graph and will help make the graph look cleaner and less convoluted with lines that go back and forth between the same two people.
In my igraph, Garfield is in the center since he is first degree (according to Wiki, a Bacon number of 0). We can see all the second degrees (Bacon Number 1) connection Garfield has and some of his second degree connections are connected to his other connections, which are likely Garfield's third degree (Bacon Number 2) connection. Since I filtered my result to only have actors with billing position to 1, 2 or 3 (Duncan said on Piazza we can do this), not all the billing information was stored in IMDB (or there was no billing info) so this is why it may appear like the igraph is small even though there are much more actors in the movies than seen on the igraph. There is a giant cluster of names all group together, I am guessing it is because most of those movie actors are connected to each other in a way and are all connected to Garfield.
###Question 14: What are the 10 television series that have the most number of movie stars appearing in the shows?
Here is my attempt at 14. It doesn't work. When I ran the query, I let it run for 6 hours and it was still going, so I stopped it. I had to comment it out or else it would not knit (or rather it would take forever to knit).
```{r, cache=TRUE}
# dbGetQuery(con, "SELECT name,
# COUNT(DISTINCT(idactors))
# FROM (SELECT series.name, acted_in.idactors
# FROM acted_in
# INNER JOIN series
# ON series.idseries = series.idseries
# INNER JOIN acted_in acted_in_movies
# ON acted_in.idactors = acted_in_movies.idactors
# INNER JOIN movies
# ON acted_in_movies.idmovies = movies.idmovies
# LEFT JOIN series series_movies
# ON series_movies.idmovies = movies.idmovies
# WHERE series_movies.idmovies IS NULL
# GROUP BY series.idseries, acted_in.idactors)
# GROUP BY name
# ORDER BY COUNT(DISTINCT(idactors))
# DESC
# LIMIT 10")
```
I think some possible TV series that may have the most number of movie stars appear in the shows are: Saturday Night Live are Celebrity Name Game