-
-
Notifications
You must be signed in to change notification settings - Fork 50
Expand file tree
/
Copy path40-report.Rmd
More file actions
1113 lines (832 loc) · 65.4 KB
/
Copy path40-report.Rmd
File metadata and controls
1113 lines (832 loc) · 65.4 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Reporting Your Analysis
This chapter provides a **concise, reproducible workflow** for reporting a data analysis in business and social science contexts. We demonstrate end-to-end reporting elements you'd typically see in a professional manuscript (marketing, finance, management), complete with diagnostic checks, robust/clustered standard errors, model comparison, and high-quality tables/figures.
------------------------------------------------------------------------
## Recommended Structure
### Phase 1: Exploratory Data Analysis (EDA)
**Understanding Your Data Landscape**
Before embarking on any modeling endeavor, immerse yourself thoroughly in your data. Exploratory data analysis serves as the foundation upon which all subsequent analysis rests. This critical phase allows you to develop intuition about your dataset, identify potential challenges, and formulate preliminary hypotheses that will guide your modeling decisions.
**Visual Exploration and Data Visualization**
Begin by creating a comprehensive suite of visualizations that reveal the character and structure of your data. Univariate plots such as histograms, density plots, and boxplots illuminate the distribution of individual variables, revealing whether they follow normal, skewed, bimodal, or other distribution patterns. These visualizations immediately expose the presence of extreme values and help you understand the central tendency and spread of each variable.
For continuous variables, construct detailed histograms with appropriate bin widths to capture the true shape of the distribution. Overlay kernel density estimates to smooth out the discrete nature of histograms and reveal underlying patterns. Complement these with boxplots that succinctly display the five-number summary while making outliers immediately visible.
For categorical variables, develop bar charts and frequency tables that show the distribution of observations across categories. Pay particular attention to class imbalance, as severely imbalanced categories can create challenges for certain modeling approaches and may require special handling techniques such as stratified sampling or synthetic minority oversampling.
Transition next to bivariate and multivariate visualizations that expose relationships between variables. Scatter plots reveal correlations, non-linear relationships, and interaction effects between continuous variables. When examining the relationship between a continuous outcome and categorical predictors, construct side-by-side boxplots or violin plots that simultaneously display distribution shape and central tendency across groups.
Correlation matrices presented as heatmaps provide an at-a-glance understanding of linear relationships among all continuous variables in your dataset. Use color gradients thoughtfully to make strong positive and negative correlations immediately apparent. Augment simple correlation coefficients with scatter plot matrices that allow you to visually inspect the nature of each pairwise relationship.
For more complex multivariate patterns, consider dimension reduction techniques such as principal component analysis (PCA) or t-distributed stochastic neighbor embedding (t-SNE). While these methods will be explored more rigorously later, preliminary visualizations in reduced dimensional space can reveal clustering, separation between groups, or other high-dimensional structure that would otherwise remain hidden.
**Preliminary Statistical Results**
Complement your visual exploration with descriptive statistics that quantify the properties you've observed graphically. Calculate measures of central tendency including means, medians, and modes for each variable. Assess spread through standard deviations, interquartile ranges, and ranges. For skewed distributions, report robust statistics that are less sensitive to extreme values.
Construct detailed contingency tables for categorical variables, including both counts and proportions. Calculate marginal and conditional distributions to understand how categories relate to one another. For key relationships of interest, compute preliminary effect sizes or correlation coefficients to quantify the strength of associations.
Perform initial hypothesis tests where appropriate, but interpret these exploratory results with appropriate caution. At this stage, you are generating hypotheses rather than testing pre-specified ones, so traditional significance thresholds should be applied conservatively. Consider adjusting for multiple comparisons if you conduct numerous exploratory tests, or better yet, clearly distinguish between confirmatory and exploratory findings in your narrative.
**Identifying Interesting Patterns, Structure, and Features**
As you explore your data, remain vigilant for unexpected patterns that might inform your modeling strategy or reveal important substantive insights. Look for evidence of subgroups or clusters within your data that might suggest the need for hierarchical models, mixture models, or stratified analyses. Notice whether relationships between variables appear consistent across the full range of the data or if they change in character at certain thresholds.
Temporal patterns deserve special attention if your data have any time-series component. Plot variables across time to identify trends, seasonality, or structural breaks that might violate independence assumptions or require specialized time-series modeling approaches. Even in cross-sectional data, consider whether unobserved temporal factors might have introduced systematic patterns.
Geographic or spatial patterns should similarly be explored if your data have spatial attributes. Map-based visualizations can reveal spatial autocorrelation or clustering that standard models might miss. If present, such patterns may necessitate spatial statistical methods that explicitly model dependence structures.
Pay attention to the relationship between variance and mean across groups or conditions. Heteroscedasticity, where the variability of your outcome changes systematically with predictor values, will violate key assumptions of many standard models and may require variance-stabilizing transformations or more flexible modeling frameworks.
**Outlier Detection and Characterization**
Devote substantial attention to identifying and understanding outliers, which are observations that differ markedly from the overall pattern in your data. Begin with univariate outlier detection using methods such as the $1.5 \times IQR$ rule for boxplots, which flags points falling more than 1.5 times the interquartile range beyond the first or third quartile. For normally distributed data, consider threshold rules based on standard deviations, such as flagging observations more than three standard deviations from the mean.
Extend your outlier analysis to the multivariate space, where observations that appear unremarkable in any single dimension may nonetheless be anomalous in their combination of values. Mahalanobis distance measures how far each observation lies from the center of the multivariate distribution, accounting for correlations between variables. Cook's distance and other influence diagnostics, while typically associated with model diagnostics, can also be calculated at this exploratory stage to identify observations that might exert disproportionate influence on subsequent analyses.
Crucially, resist the temptation to automatically discard outliers. Instead, investigate each carefully to understand its origin and nature.
- Is it a data entry error that should be corrected?
- Is it a legitimate but rare event that contains valuable information?
- Does it represent a different population that should be analyzed separately?
Document your decisions transparently, presenting results both with and without questionable observations when appropriate, so readers can assess the robustness of your conclusions.
Consider the domain context when evaluating outliers. In some fields, extreme values may be the most scientifically interesting observations, while in others they may represent measurement errors or irrelevant anomalies. Consult with subject matter experts to properly interpret unusual observations and make informed decisions about their treatment.
### Phase 2: Model Selection and Specification
**Articulating Model Assumptions**
Every statistical model rests on a foundation of assumptions, and making these explicit is essential for proper interpretation and assessment of your results. Begin by clearly stating the distributional assumptions your model makes about the outcome variable. Does your model assume normally distributed errors, or are you working within a generalized linear model framework that allows for binomial, Poisson, or other distributional families?
Detail the assumptions about the relationship between predictors and outcome. Most commonly, models assume linearity in parameters, meaning that the expected outcome changes by a constant amount for each unit change in a predictor (possibly after appropriate transformation or link function). If your model permits non-linear relationships through polynomial terms, splines, or other flexible forms, explain the functional form you've adopted and why.
Independence assumptions warrant careful consideration. Standard regression assumes that observations are independent of one another, but this is frequently violated in practice by clustering (students within schools, measurements within individuals), spatial dependence, or temporal autocorrelation. If such dependencies exist in your data structure, acknowledge them explicitly and describe how your model accounts for them, whether through mixed effects, robust standard errors, or specialized correlation structures.
Homoscedasticity, the assumption of constant error variance, should be stated and later verified. Many standard inferential procedures assume that the variance of your outcome does not depend on predictor values or fitted values, though weighted regression or generalized linear models can accommodate heteroscedastic errors when this assumption is untenable.
Additional assumptions relevant to specific methods should be documented. For causal inference, state clearly what identification assumptions are necessary for causal interpretation, such as ignorability, no unmeasured confounding, or valid instrumental variables. For time series models, describe stationarity assumptions. For machine learning approaches, discuss assumptions about the relationship between training and test data distributions.
**Justifying Your Modeling Approach**
After articulating assumptions, provide a compelling rationale for why your chosen model is the most appropriate tool for addressing your research question. Connect the model selection directly to your scientific objectives. If your goal is prediction, emphasize the model's predictive performance and its ability to generalize to new data. If your goal is inference about specific parameters, justify how the model structure allows for valid and efficient estimation of those parameters.
Consider the nature of your outcome variable in justifying your approach. Continuous outcomes measured on an interval or ratio scale typically call for linear regression or its extensions, while binary outcomes necessitate logistic regression or other classification methods. Count data often require Poisson or negative binomial regression, while time-to-event data demand survival analysis techniques. Ordinal outcomes merit specialized methods that respect the ordered nature of categories.
Discuss how your model handles the specific challenges present in your data. If you have high-dimensional data with more predictors than observations, explain your choice of regularization method such as ridge, lasso, or elastic net regression. If multicollinearity is a concern, describe how your approach mitigates its effects, whether through variable selection, principal component regression, or Bayesian methods with informative priors.
Address computational considerations when relevant. Some modeling approaches that are theoretically ideal may be computationally intractable for large datasets, while others scale efficiently. If you've made tradeoffs between statistical optimality and computational feasibility, acknowledge this transparently and describe any steps taken to validate that the chosen approach provides adequate performance.
Compare your chosen model to reasonable alternatives, explaining why you've selected one approach over others. This comparative discussion demonstrates that you've thoughtfully considered multiple options rather than defaulting to a familiar method. You might compare parametric versus non-parametric approaches, frequentist versus Bayesian frameworks, or simple versus complex model structures, weighing their relative advantages and limitations in your specific context.
**Considering Interactions, Collinearity, and Dependence**
Interaction effects represent situations where the effect of one predictor on the outcome depends on the value of another predictor. During model specification, consider whether substantive theory suggests important interactions, and explore whether your exploratory analysis revealed evidence of effect modification. Interaction terms can substantially improve model fit and provide crucial scientific insights, but they also increase model complexity and can make interpretation challenging.
When including interactions, think carefully about whether to also include the constituent main effects (you almost always should, to maintain the principle of marginality), and consider centering continuous variables before forming interaction terms to reduce collinearity and aid interpretation. Visualize predicted values across different combinations of interacting variables to help readers understand these complex relationships.
Multicollinearity, the presence of strong linear relationships among predictors, can create serious problems for parameter estimation and interpretation. Severely collinear predictors lead to unstable coefficient estimates with inflated standard errors, making it difficult to isolate the individual effect of any single predictor. Assess collinearity using variance inflation factors (VIF), with values exceeding 5 or 10 typically indicating problematic levels.
When high collinearity is detected, several remedial strategies exist. You might remove one of a highly correlated pair of predictors based on theoretical considerations or measurement quality. Alternatively, combine collinear predictors into composite scores or indices that capture their shared information. Regularization methods such as ridge regression explicitly address collinearity by shrinking coefficient estimates. In some cases, severe collinearity simply reflects reality and must be acknowledged as a limitation, particularly when you need to include certain predictors for theoretical completeness despite their intercorrelation.
Dependence structures in your data require special modeling approaches. For clustered data, where observations are nested within groups, mixed effects (multilevel or hierarchical) models partition variance into within-group and between-group components and account for the correlation among observations from the same cluster. Specify both fixed effects that represent average relationships and random effects that allow these relationships to vary across clusters.
For longitudinal data with repeated measurements on the same units, consider growth curve models, generalized estimating equations (GEE), or transition models depending on your research question. Each approach handles the correlation among repeated measures differently and allows for different types of inference, so select the framework that best matches your substantive goals.
Spatial or network dependence calls for specialized models that explicitly represent connections between observations. Spatial autoregressive models, geographically weighted regression, or network autocorrelation models may be appropriate depending on the structure of spatial or social relationships in your data.
### Phase 3: Model Fitting and Diagnostic Assessment
**Evaluating Overall Model Fit**
After estimating your model, systematically evaluate how well it fits the observed data. Begin with summary statistics that quantify the proportion of variance explained. For linear models, the coefficient of determination ($R^2$) indicates what fraction of outcome variance is captured by your predictors, while adjusted $R^2$ penalizes model complexity to discourage overfitting. Recognize that while $R^2$ provides a useful summary, it doesn't tell the whole story about model adequacy, and even low $R^2$ values can be scientifically important if they represent relationships that are difficult to predict.
For generalized linear models, report appropriate pseudo-$R^2$ measures such as McFadden's, Nagelkerke's, or Tjur's $R^2$, keeping in mind that these lack the direct interpretation of classical $R^2$. Log-likelihood values and deviance statistics provide information about how well the model's probability distribution matches the data, with comparisons to null or saturated models offering context for interpretation.
Information criteria including Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) balance goodness of fit against model complexity, rewarding fit while penalizing the inclusion of additional parameters. These are particularly valuable for comparing non-nested models, though differences of less than 2-3 units are generally considered negligible. BIC penalizes complexity more heavily than AIC and tends to favor simpler models, especially with large sample sizes.
For models intended for prediction, assess predictive performance using metrics appropriate to your outcome type. For continuous outcomes, examine mean squared error, root mean squared error, or mean absolute error. For binary outcomes, consider accuracy, sensitivity, specificity, positive and negative predictive values, area under the ROC curve (AUC), and calibration metrics. Critically, evaluate predictive performance on held-out data not used for model training to obtain honest estimates of generalization performance.
Conduct formal goodness-of-fit tests where appropriate. The Hosmer-Lemeshow test for logistic regression, the deviance test for generalized linear models, or omnibus tests for model specification each provide statistical assessments of model adequacy, though remember that with large sample sizes, these tests may reject even models that fit adequately for practical purposes.
**Verifying Model Assumptions Through Residual Analysis**
Residual analysis forms the cornerstone of model diagnostics, as residuals (i.e., the differences between observed and fitted values) should exhibit certain properties if model assumptions hold. If your model is correctly specified and assumptions are satisfied, residuals should appear as random noise without systematic patterns.
Begin with residual plots that display residuals against fitted values. In a well-fitting model, this plot should show a random cloud of points with no discernible pattern, constant spread across the range of fitted values, and no systematic curvature. A funnel shape, where spread increases or decreases with fitted values, suggests heteroscedasticity. Curved patterns indicate that the assumed functional form may be incorrect and that transformations or additional predictors might improve the model.
For generalized linear models, use appropriate residuals such as deviance, Pearson, or quantile residuals rather than raw residuals, as these better approximate the expected properties under the model assumptions. Deviance residuals are particularly useful for assessing overall model fit, while Pearson residuals help evaluate the variance assumption.
Construct residual plots against each predictor variable to identify whether any individual predictor's relationship with the outcome is misspecified. Non-random patterns in these plots suggest that the predictor may require transformation, that its effect may be non-linear, or that it may interact with other variables.
Q-Q (quantile-quantile) plots compare the distribution of residuals to the theoretical distribution assumed by your model, typically the normal distribution for linear regression. Points should fall approximately along a straight diagonal line if the distributional assumption is satisfied. Systematic departures from linearity, particularly in the tails, indicate non-normality. Light-tailed distributions (fewer extreme values than expected under normality) produce S-shaped patterns, while heavy-tailed distributions (more extreme values) create inversely S-shaped patterns.
For time series or spatially structured data, examine residual autocorrelation through autocorrelation function (ACF) plots or spatial correlograms. Significant autocorrelation in residuals indicates that your model has failed to account for temporal or spatial dependence, suggesting the need for more sophisticated modeling approaches that explicitly model correlation structures.
Identify influential observations using diagnostic measures such as Cook's distance, DFBETAS, DFFITS, and leverage values. Influential points are those whose inclusion or exclusion would substantially alter model estimates or predictions. High leverage points have unusual predictor values that give them the potential for influence, while high influence points actually do substantially affect the fitted model. Investigate influential observations carefully, determining whether they represent errors, exceptional cases worthy of separate analysis, or legitimate data that should be retained.
Assess the variance inflation in parameter estimates due to collinearity by examining condition indices or variance decomposition proportions in addition to variance inflation factors. These diagnostics help you understand which specific parameters are most affected by collinearity and whether the instability is severe enough to warrant remedial action.
Test for heteroscedasticity formally using the Breusch-Pagan test, White test, or other appropriate diagnostics depending on your model type. If heteroscedasticity is detected, consider whether variance-stabilizing transformations, weighted least squares, or robust standard error estimators are appropriate remedies.
For mixed effects models, examine residuals at each level of the hierarchy. Inspect level-1 (within-group) residuals for the usual regression diagnostics, and additionally examine level-2 (group-level) residuals and random effects to assess whether higher-level assumptions are satisfied and to identify outlying clusters.
When assumption violations are detected, consider their practical severity carefully. Minor violations may have negligible impact on inference, particularly with large samples where central limit theorem properties provide robustness. Severe violations require remedy through data transformation, alternative modeling approaches, robust methods, or explicit acknowledgment as a limitation.
### Phase 4: Inference and Prediction
**Drawing Valid Statistical Inferences**
With a well-fitting model in hand, turn your attention to statistical inference about parameters of interest and the relationships they represent. Begin by reporting point estimates for all relevant parameters, including regression coefficients, odds ratios, hazard ratios, or other effect measures appropriate to your model type. Present these with appropriate measures of uncertainty, typically confidence intervals and p-values from hypothesis tests.
Interpret each parameter estimate in the context of your research question and in language accessible to your intended audience. For linear regression coefficients, explain the expected change in the outcome associated with a one-unit change in the predictor, holding other variables constant. For logistic regression, interpret odds ratios or convert to more intuitive probability scales for specific covariate values. For survival models, explain hazard ratios in terms of relative risk over time.
Attend carefully to the distinction between statistical significance and practical significance. Statistically significant effects may be too small to matter in practice, particularly with large samples, while non-significant effects may still be substantively important, especially when confidence intervals are wide due to limited power. Report and discuss both the magnitude and precision of estimates rather than focusing exclusively on whether p-values fall below arbitrary thresholds.
Consider the multiple testing problem if you're conducting numerous hypothesis tests. When testing many hypotheses simultaneously, some will appear significant purely by chance. Address this through appropriate multiple testing corrections such as Bonferroni, Holm, or false discovery rate (FDR) methods, or through a hierarchical testing strategy that prioritizes certain comparisons. Alternatively, distinguish clearly between confirmatory tests of pre-specified hypotheses and exploratory analyses that generate hypotheses for future research.
For predictive models, generate predictions for new observations or for specific covariate profiles of interest. Provide prediction intervals that appropriately capture uncertainty, recognizing that prediction uncertainty includes both estimation uncertainty about parameters and inherent residual variation in individual observations. Visualize predictions across the range of key predictors to help readers understand model implications.
**Exploring Alternative Approaches to Support Inference**
Strengthen your inferences by demonstrating robustness through alternative analytical approaches. A finding that persists across multiple reasonable modeling strategies is more credible than one that depends critically on specific modeling choices. This triangulation of evidence provides readers with greater confidence in your conclusions.
Conduct sensitivity analyses that explore how results change under different assumptions. Fit variants of your model that include or exclude potential confounders, use different functional forms for continuous predictors, apply different transformations to the outcome, or employ alternative link functions. If conclusions remain substantively similar across these variations, you can be more confident in their validity. If results are sensitive to specific modeling choices, acknowledge this and discuss which specification is most defensible based on theory and empirical evidence.
For causal inference questions, implement multiple analytical strategies if possible. Combine regression adjustment with propensity score methods, instrumental variables, difference-in-differences, or regression discontinuity designs depending on your data structure and research design. Agreement across methods that rely on different identifying assumptions substantially strengthens causal claims.
Employ resampling methods such as bootstrap or permutation tests to validate your inferential conclusions, particularly when sample sizes are modest or distributional assumptions are questionable. The bootstrap provides a way to estimate sampling distributions and standard errors without relying on parametric assumptions, while permutation tests offer exact significance tests for certain hypotheses.
Conduct subgroup analyses to examine whether relationships are consistent across different populations or contexts within your data. While these are exploratory and should be interpreted cautiously due to reduced power and multiple testing concerns, they can reveal important heterogeneity in effects and generate hypotheses about effect moderation that deserve investigation in future studies.
Implement cross-validation or other hold-out validation procedures for predictive models to honestly assess generalization performance. K-fold cross-validation, leave-one-out cross-validation, or train-test splits allow you to evaluate how well your model performs on data it hasn't seen during training. This is essential for claims about predictive utility and for comparing the predictive performance of different modeling approaches.
If you have access to multiple datasets addressing similar questions, consider replication analyses that fit your model to independent data. Successful replication provides the strongest possible evidence for the robustness and generalizability of your findings, while failures to replicate may indicate that initial results were sample-specific or resulted from chance variation.
For Bayesian analyses, conduct prior sensitivity analyses that examine how posterior inferences change under different prior specifications. If conclusions are similar under a range of reasonable priors, inference is robust to prior specification. If posteriors are highly sensitive to prior choice, either collect more data to allow the likelihood to dominate or acknowledge that definitive conclusions require stronger prior information.
### Phase 5: Conclusions and Recommendations
**Synthesizing Findings into Actionable Recommendations**
In concluding your analysis, synthesize your findings into clear, actionable recommendations that directly address the original research questions or practical problems that motivated the investigation. Avoid simply restating results; instead, interpret their meaning and implications for theory, policy, or practice.
Connect your statistical findings back to the substantive domain, explaining what your results mean for real-world phenomena. If you've found that a particular intervention has a significant positive effect, discuss what decision-makers should do with this information. If you've built a predictive model, explain how it should be deployed and what level of performance users can expect in practice.
Prioritize your recommendations by importance and strength of evidence. Some findings will be central to your research questions and supported by robust evidence across multiple analyses, while others may be more peripheral or tentative. Help readers understand which conclusions are most secure and which require additional confirmation before being acted upon.
Acknowledge uncertainty in your recommendations. Statistical analysis rarely provides absolute certainty, and honest acknowledgment of uncertainty better serves decision-makers than false precision. Describe the range of plausible effects indicated by confidence intervals and discuss how remaining uncertainty might affect decisions.
If your analysis revealed unexpected findings, discuss their potential significance and implications for existing theory or practice. Surprising results often represent the most important scientific contributions, but they also require more scrutiny and replication before being accepted with high confidence.
Consider differential implications for different stakeholders or contexts. A finding that suggests one course of action for one group might have different implications for another, and careful analysis should recognize this heterogeneity in drawing conclusions.
**Acknowledging Limitations with Specificity and Candor**
Every analysis has limitations, and comprehensive acknowledgment of these limitations actually strengthens rather than weakens your work by demonstrating careful scientific reasoning and helping readers appropriately calibrate their confidence in your conclusions. Move beyond generic limitations to provide specific, honest assessment of factors that may limit the validity or generalizability of your findings.
Discuss limitations related to your data source and sampling. Is your sample representative of the population to which you wish to generalize, or might selection bias limit external validity? Are there important subgroups underrepresented or absent from your data? Does non-response or attrition introduce potential bias? Are key variables measured with error or missing for substantial proportions of observations?
Address methodological limitations in your analytical approach. Which assumptions of your chosen model are most questionable in your particular application? Are there known alternatives that might have advantages you couldn't exploit due to data constraints or computational limitations? Does the observational nature of your data limit causal inference, even if you've attempted to address confounding through statistical adjustment?
Consider limitations in measurement and operationalization. Do your variables capture the theoretical constructs of interest with high fidelity, or are they imperfect proxies? Are there important dimensions of concepts that your measures don't capture? Would different but equally defensible operationalizations lead to different conclusions?
Acknowledge temporal limitations. For cross-sectional data, note that you observe relationships at a single time point and cannot make claims about causal ordering or temporal dynamics. For longitudinal data, discuss whether your observation period is long enough to capture relevant changes and whether patterns might differ over longer time horizons.
Discuss limitations related to model complexity and specification. Have you potentially omitted important confounders or moderators due to data unavailability? Does your model impose functional form assumptions that, while reasonable, may not perfectly capture reality? Have you prioritized interpretability over predictive performance, or vice versa, and how might this choice limit certain uses of your findings?
For predictive models, clearly delineate the conditions under which predictions should be trusted and situations where the model may perform poorly. Discuss the training data's representativeness and how concept drift or distribution shift might affect performance when the model is deployed in different contexts or time periods.
Address limitations in statistical power if applicable. Underpowered studies may fail to detect truly important effects, and confidence intervals may be too wide to provide useful guidance. Non-significant findings in underpowered studies should be interpreted as inconclusive rather than as evidence of null effects.
**Charting a Path Forward: Future Research Directions**
Conclude by outlining specific steps that could address the limitations you've identified and advance understanding beyond what your current analysis achieved. This forward-looking discussion demonstrates scientific maturity and provides a roadmap for continuing research on important questions.
For data-related limitations, describe what improved data collection efforts would look like. Should future studies employ different sampling strategies to improve representativeness? Would longitudinal designs that track individuals over time provide stronger evidence than cross-sectional data? Are there key variables that should be measured but weren't available in your data? Would larger sample sizes enable detection of more subtle effects or more complex modeling?
Recommend methodological innovations or alternative analytical approaches that might overcome current limitations. Are there emerging statistical methods that would better address the particular challenges your data present? Would experimental or quasi-experimental designs provide stronger causal evidence? Could different modeling frameworks accommodate complexities that your current approach handles imperfectly?
Suggest directions for extending your findings. What related research questions naturally follow from your results? Are there important moderators or boundary conditions that should be explored? Would replication in different populations or contexts test the generalizability of your findings? Are there theoretical mechanisms linking your variables that require further investigation?
For applied work, discuss how implementation research could assess the effectiveness of your recommendations in practice. Statistical findings that seem promising in analysis may encounter challenges when deployed in real-world contexts, and careful evaluation of implementation is crucial for evidence-based practice.
Consider interdisciplinary connections that might enrich future investigation of your research questions. Would combining your quantitative approach with qualitative methods provide richer understanding? Could insights from other disciplines inform better model specification or theoretical development?
If your work identified measurement limitations, suggest how instrument development or validation studies could improve future research. Better measurement is often the key to scientific progress, and acknowledging measurement challenges while proposing solutions contributes meaningfully to your field.
Discuss how emerging data sources or technologies might enable future research that wasn't possible for your current analysis. Could sensor data, administrative records, natural language processing of text data, or other innovations provide new windows into your research questions?
Finally, contextualize your work within the broader scientific enterprise. Position your analysis as one contribution within an accumulating body of evidence, acknowledging what remains to be learned and how the field should collectively move forward to advance understanding.
This expanded structure provides a comprehensive framework for conducting and presenting rigorous statistical analysis, emphasizing transparency, methodological awareness, and careful reasoning at every stage of the research process.
------------------------------------------------------------------------
## One Summary Table (Packages)
**Summary Statistics Table**
- [qwraps2](https://cran.r-project.org/web/packages/qwraps2/vignettes/summary-statistics.html)
- [vtable](https://cran.r-project.org/web/packages/vtable/vignettes/sumtable.html)
- [gtsummary](http://www.danieldsjoberg.com/gtsummary/)
- [apaTables](https://cran.r-project.org/web/packages/apaTables/apaTables.pdf)
- [stargazer](https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf)
**Regression Table**
- [gtsummary](http://www.danieldsjoberg.com/gtsummary/)
- [sjPlot, sjmisc, sjlabelled](https://cran.r-project.org/web/packages/sjPlot/vignettes/tab_model_estimates.html)
- [stargazer](https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf): recommended ([Example](https://www.jakeruss.com/cheatsheets/stargazer/))
- [modelsummary](https://github.com/vincentarelbundock/modelsummary#a-simple-example)
```{r, message=FALSE, warning=FALSE}
# Core packages we will use throughout the chapter
library(jtools)
library(broom)
library(dplyr)
library(ggplot2)
library(lmtest)
library(sandwich)
set.seed(123)
```
## Exploratory Analysis
We begin with EDA to understand variable distributions, relationships, and potential data issues (outliers, missingness, skew)[^40-report-1]. The `jtools::movies` dataset offers a realistic setting with continuous and discrete variables relevant to business/creative outcomes.
[^40-report-1]: For further details on exploratory analysis, see the next chapter.
Key steps:
- Inspect distributions (histograms/densities)
- Examine pairwise relationships (scatterplots, correlation)
- Flag outliers and influential observations
```{r, message=FALSE, warning=FALSE}
data(movies, package = "jtools")
# Minimal wrangling for illustration
movies_small <- movies %>%
select(metascore, budget, us_gross, year, runtime) %>%
filter(complete.cases(.))
summary(movies_small)
```
Figure \@ref(fig:report-eda-distributions) shows the marginal distributions of the four key variables in the movies sample.
```{r report-eda-distributions, fig.cap="Marginal distributions of metascore, budget, US gross, and runtime in the movies sample.", message=FALSE}
# Distribution plots (log scale for highly skewed financials)
library(tidyr)
movies_long <- movies_small %>%
pivot_longer(cols = c(metascore, budget, us_gross, runtime),
names_to = "variable", values_to = "value")
ggplot(movies_long, aes(value)) +
facet_wrap(~ variable, scales = "free") +
geom_histogram(bins = 30, fill = "#3c8dbc", color = "white") +
scale_x_continuous(labels = scales::label_number(scale_cut = scales::cut_short_scale())) +
labs(title = "Distributions of Key Variables",
x = NULL, y = "Count") +
theme_bw(base_size = 12)
```
```{r}
# Pairwise relationships: simple scatter matrix
if (requireNamespace("GGally", quietly = TRUE)) {
GGally::ggpairs(
movies_small %>% mutate(across(c(budget, us_gross), log1p)),
columns = c("metascore","budget","us_gross","runtime","year"),
upper = list(continuous = GGally::wrap("cor", size = 3)),
lower = list(continuous = GGally::wrap("points", alpha = .5, size = .7)),
diag = list(continuous = GGally::wrap("barDiag", bins = 20))
) + theme_bw(base_size = 10)
}
```
```{r}
# Quick correlation table (with log transforms for skewed $ variables)
cor_mat <- movies_small %>%
mutate(across(c(budget, us_gross), log1p)) %>%
select(metascore, budget, us_gross, runtime, year) %>%
cor(use = "pairwise.complete.obs")
round(cor_mat, 3)
```
```{r}
# Outlier & influence screening (pre-model)
base_fit <- lm(metascore ~ log1p(budget) + log1p(us_gross) + runtime + year, data = movies_small)
infl <- influence.measures(base_fit)
summary(infl)
# Flag observations with large Cook's distance or high leverage
diag_df <- tibble(
.cooksd = cooks.distance(base_fit),
.hat = hatvalues(base_fit),
.resid = rstandard(base_fit)
) %>% mutate(id = row_number())
head(diag_df[order(-diag_df$.cooksd),], 10)
```
## Model
We illustrate a multiple linear regression model appropriate for continuous outcomes:
$$
y_i = \beta_0+ \beta_1 \log(\mathrm{budget}_i + 1)+ \beta_2 \log(\mathrm{us\_gross}_i + 1)+ \beta_3 \mathrm{runtime}_i+ \beta_4 \mathrm{year}_i+ \varepsilon_i .
$$
### Assumptions
- Linearity in parameters and approximately linear relationships after transformation
- Independence of errors (or appropriately modeled dependence/clustered SEs)
- Homoskedasticity of errors (or robust SEs)
- Approximately normal errors for t tests/intervals (large-sample robustness often sufficient)
### Why this model?
- `metascore` is continuous; OLS is a natural baseline.
- Financial variables are **skewed**; log-transform helps stabilize variance and linearize relationships.
- `year` and `runtime` capture secular trends and content length.
### Considerations
- Interactions: e.g., does budget effectiveness differ by year?
- Collinearity: budget and gross can be correlated; check VIF.
- Dependence: panel or clustered structures (franchise, studio) may require cluster-robust SEs.
```{r}
fit <- lm(metascore ~ log1p(budget) + log1p(us_gross) + runtime + year, data = movies_small)
# Core jtools summary (unstandardized)
summ(fit)
```
```{r}
# Enhanced jtools summary: standardized coefs, VIFs, semi-partial correlations, CIs
summ(
fit,
scale = TRUE, # standardized betas
vifs = TRUE, # collinearity diagnostics
part.corr = TRUE, # semi-partial (part) correlations
confint = TRUE,
pvals = TRUE
)
```
```{r}
# Interactions: visualize and test
fit_int <- lm(metascore ~ log1p(budget)*year + log1p(us_gross) + runtime, data = movies_small)
# Create the transformed variable in your dataset
movies_small$log_budget <- log1p(movies_small$budget)
# Fit the model with the new variable
fit_int <- lm(metascore ~ log_budget*year + log1p(us_gross) + runtime,
data = movies_small)
# Now visualize
interactions::interact_plot(fit_int,
pred = year,
modx = log_budget,
interval = TRUE,
plot.points = TRUE) +
theme_bw(base_size = 12)
```
```{r}
# Collinearity (car::vif)
car::vif(fit)
```
### Model Fit
We report $R^2$, adjusted $R^2$, residual standard error (RSE), and information criteria (AIC/BIC). For nested models, we can use ANOVA; otherwise, compare AIC/BIC, cross-validated error, or out-of-sample performance.
We then examine residuals for normality, heteroskedasticity, and influential points. If assumptions are questionable, prefer **heteroskedasticity-robust** or **cluster-robust** standard errors.
```{r}
broom::glance(fit) %>%
select(r.squared, adj.r.squared, sigma, AIC, BIC, df, nobs)
```
Figure \@ref(fig:report-residual-diagnostics) shows the standard four-panel residual diagnostic plots for the fitted OLS model.
```{r report-residual-diagnostics, fig.cap="Four-panel residual diagnostics for the OLS model: residuals vs. fitted, normal Q-Q, scale-location, and residuals vs. leverage.", message=FALSE}
# Base residual diagnostics (four-panel plots)
par(mfrow = c(2, 2))
plot(fit)
par(mfrow = c(1, 1))
```
```{r}
# Formal tests (use judiciously; they're sensitive to n)
if (requireNamespace("car", quietly = TRUE)) {
car::ncvTest(fit) # non-constant variance test
car::durbinWatsonTest(fit) # autocorrelation test (time ordering needed)
}
lmtest::bptest(fit) # Breusch-Pagan heteroskedasticity test
```
```{r}
# Partial residual plots
if (requireNamespace("car", quietly = TRUE)) {
car::crPlots(fit)
}
```
### Cluster-Robust Standard Errors
When heteroskedasticity or clustering is present, use **sandwich** estimators:
- HC0/HC1/HC2/HC3: robust to heteroskedasticity
- Cluster-robust (by firm, school, market, etc.) when errors correlate within clusters
Below: examples using `jtools::summ()` and `lmtest::coeftest()` with `sandwich`.
```{r}
# Heteroskedasticity-robust SEs for fit
lmtest::coeftest(fit, vcov. = sandwich::vcovHC(fit, type = "HC3"))
# jtools in one line
summ(fit, robust = "HC3", confint = TRUE)
```
```{r}
# Cluster-robust SEs using example data from 'sandwich'
data("PetersenCL", package = "sandwich")
fit2 <- lm(y ~ x, data = PetersenCL)
# Cluster on 'firm'
summ(fit2, robust = "HC3", cluster = "firm")
```
See Table \@ref(tab:report-sandwich-variants) for a quick reference:
| Type | Applicable | Usage | Notes/References |
|:---------|:-----------|:----------------|:-------------------------------------------------|
| const | iid | Homoskedastic | Assumes constant variance |
| HC/HC0 | vcovHC | Heteroskedastic | White's estimator [@white1980heteroskedasticity] |
| HC1 | vcovHC | Heteroskedastic | DoF correction [@mackinnon1985some] |
| HC2 | vcovHC | Heteroskedastic | Hat-matrix adjustment |
| HC3 | vcovHC | Heteroskedastic | Preferred for small $n$; hat-matrix adjustment |
| HC4/4m/5 | vcovHC | Heteroskedastic | Leverage-adaptive |
Table: (\#tab:report-sandwich-variants) Sandwich variants.
<!--# See vignette: https://cran.r-project.org/web/packages/sandwich/vignettes/sandwich-CL.pdf -->
```{r}
# Another quick demo on built-in data; package= is needed because
# caret also exports a `cars` dataset with different columns.
data(cars, package = "datasets")
model <- lm(speed ~ dist, data = cars)
summary(model)
lmtest::coeftest(model, vcov. = sandwich::vcovHC(model, type = "HC1"))
```
### Model to Equation
Use `equatiomatic` to extract LaTeX-ready equations. If unavailable, we provide a fallback to print the fitted equation with estimated coefficients.
```{r, eval=FALSE}
# install.packages("equatiomatic") # not available for some R versions (e.g., R 4.2)
fit_eq <- lm(metascore ~ log1p(budget) + log1p(us_gross) + runtime + year, data = movies_small)
# Show the theoretical model
equatiomatic::extract_eq(fit_eq)
# Display the actual coefficients
equatiomatic::extract_eq(fit_eq, use_coefs = TRUE)
```
```{r}
# Fallback: build a simple equation string with estimated coefs
print_lm_equation <- function(mod) {
co <- coef(mod)
terms <- names(co)
terms_fmt <- ifelse(terms == "(Intercept)",
sprintf("%.3f", co[terms]),
paste0(sprintf("%.3f", co[terms]), " \\cdot ", terms))
rhs <- paste(terms_fmt, collapse = " + ")
asis <- paste0("$\\hat{y} = ", rhs, "$")
cat(asis, "\n")
}
print_lm_equation(fit)
```
## Model Comparison
Compare nested models via ANOVA (F-test), and non-nested via AIC/BIC, cross-validation, or predictive performance metrics. `jtools::export_summs()` offers attractive comparison tables.
```{r}
fit_a <- lm(metascore ~ log1p(budget), data = movies_small)
fit_b <- lm(metascore ~ log1p(budget) + log1p(us_gross), data = movies_small)
fit_c <- lm(metascore ~ log1p(budget) + log1p(us_gross) + runtime, data = movies_small)
coef_names <- c("Budget" = "log1p(budget)",
"US Gross" = "log1p(us_gross)",
"Runtime (Hours)" = "runtime",
"Constant" = "(Intercept)")
export_summs(fit_a, fit_b, fit_c, robust = "HC3", coefs = coef_names)
```
```{r}
# AIC/BIC comparison
broom::glance(fit_a) %>% select(AIC, BIC, adj.r.squared) %>% mutate(model = "fit_a")
broom::glance(fit_b) %>% select(AIC, BIC, adj.r.squared) %>% mutate(model = "fit_b")
broom::glance(fit_c) %>% select(AIC, BIC, adj.r.squared) %>% mutate(model = "fit_c")
```
```{r}
# Nested model ANOVA (fit_a ⊂ fit_b ⊂ fit_c)
anova(fit_a, fit_b, fit_c)
```
## Changes in an Estimate
Visualize how coefficient estimates shift when adding controls. This is especially useful to show robustness to omitted variable bias concerns.
```{r}
coef_names_plot <- coef_names[1:3] # Dropping intercept for plots
plot_summs(fit_a, fit_b, fit_c, robust = "HC3", coefs = coef_names_plot)
```
```{r}
plot_summs(
fit_c,
robust = "HC3",
coefs = coef_names_plot,
plot.distributions = TRUE
)
```
### Coefficient Uncertainty and Distribution
Visualize uncertainty with either frequentist or Bayesian approaches. With frequentist OLS, we can simulate coefficient draws from the asymptotic sampling distribution using the estimated variance-covariance matrix and then plot with `ggplot2`. Figure \@ref(fig:report-coef-sampling-densities) shows the simulated sampling densities for selected coefficients.
```{r report-coef-sampling-densities, fig.cap="Simulated HC3-robust sampling densities for selected coefficients, drawn from a multivariate normal approximation to the asymptotic distribution.", message=FALSE}
# Simulate coefficient draws (multivariate normal approx)
if (requireNamespace("MASS", quietly = TRUE)) {
V <- vcovHC(fit_c, type = "HC3")
b <- coef(fit_c)
draws <- MASS::mvrnorm(n = 5000, mu = b, Sigma = V)
draws_df <- as.data.frame(draws) %>%
select(`log1p(budget)`, `log1p(us_gross)`, runtime)
draws_long <- tidyr::pivot_longer(draws_df, everything(), names_to = "term", values_to = "beta")
ggplot(draws_long, aes(beta)) +
facet_wrap(~ term, scales = "free") +
geom_density(fill = "#6baed6", alpha = 0.6) +
geom_vline(xintercept = 0, linetype = 2) +
labs(title = "Sampling Distributions of Selected Coefficients (HC3)",
x = "Coefficient value", y = "Density") +
theme_bw(base_size = 12)
}
```
## Descriptive Tables
Produce journal-quality descriptives and regression tables. Below are multiple options to match target outlet styles (APA, AER, WSJ, etc.)
```{r}
# Example with gtsummary: one-table overview
if (requireNamespace("gtsummary", quietly = TRUE)) {
library(gtsummary)
movies_small %>%
mutate(
across(c(budget, us_gross), log1p),
score_group = cut(metascore,
breaks = quantile(metascore, probs = c(0, .5, 1)),
include.lowest = TRUE,
labels = c("Low score", "High score"))
) %>%
select(metascore, budget, us_gross, runtime, year, score_group) %>%
tbl_summary(
by = score_group,
statistic = list(all_continuous() ~ "{mean} ({sd})"),
digits = all_continuous() ~ 1
) %>%
add_overall() %>%
add_p() %>%
bold_labels()
}
```
```{r}
# modelsummary example
if (requireNamespace("modelsummary", quietly = TRUE)) {
library(modelsummary)
lm_mod <- lm(mpg ~ wt + hp + cyl, mtcars)
msummary(lm_mod, vcov = c("iid","robust","HC4"))
modelplot(lm_mod, vcov = c("iid","robust","HC4"))
}
```
The chunk below reports stargazer-formatted regression and correlation output for the `attitude` dataset.
```{r report-stargazer-attitude, message=FALSE}
# stargazer examples, including correlation and ASCII output
if (requireNamespace("stargazer", quietly = TRUE)) {
library(stargazer)
stargazer(attitude)
linear.1 <- lm(rating ~ complaints + privileges + learning + raises + critical, data = attitude)
linear.2 <- lm(rating ~ complaints + privileges + learning, data = attitude)
attitude$high.rating <- (attitude$rating > 70)
probit.model <- glm(high.rating ~ learning + critical + advance,
data = attitude,
family = binomial(link = "probit"))
stargazer(linear.1, linear.2, probit.model,
title = "Results",
align = TRUE)
# ASCII text output with CI
stargazer(
linear.1,
linear.2,
type = "text",
title = "Regression Results",
dep.var.labels = c("Overall Rating", "High Rating"),
covariate.labels = c(
"Handling of Complaints",
"No Special Privileges",
"Opportunity to Learn",
"Performance-Based Raises",
"Too Critical",
"Advancement"
),
omit.stat = c("LL", "ser", "f"),
ci = TRUE,
ci.level = 0.90,
single.row = TRUE
)
# Correlation table
correlation.matrix <- cor(attitude[, c("rating", "complaints", "privileges")])
stargazer(correlation.matrix, title = "Correlation Matrix")
}
```
The template below shows the LaTeX-formatted stargazer template that can be enabled when producing a typeset manuscript.
```{r report-stargazer-latex, eval=FALSE}
# LaTeX output (uncomment to use)
stargazer(
linear.1,
linear.2,
probit.model,
title = "Regression Results",
align = TRUE,
dep.var.labels = c("Overall Rating", "High Rating"),
covariate.labels = c(
"Handling of Complaints",
"No Special Privileges",
"Opportunity to Learn",
"Performance-Based Raises",
"Too Critical",
"Advancement"
),
omit.stat = c("LL", "ser", "f"),
no.space = TRUE
)
```
### Export APA theme (flextable)
The chunk below demonstrates an APA-styled flextable rendering of a five-row, five-column slice of `mtcars`.
```{r report-apa-flextable, eval = FALSE}
data("mtcars")
library(flextable)
theme_apa(flextable(mtcars[1:5,1:5]))
```
You can export data frames to LaTeX via `xtable` and ready-made styles via `stargazer`. (Ensure output directory exists.) The snippet below shows the xtable-and-stargazer export idiom.
```{r report-export-latex, eval=FALSE}
print(xtable::xtable(mtcars, type = "latex"),
file = file.path(getwd(), "output", "mtcars_xtable.tex"))
# American Economic Review style
stargazer::stargazer(
mtcars,
title = "Testing",
style = "aer",
out = file.path(getwd(), "output", "mtcars_stargazer.tex")
)
```
However, some exporters don't play well with **table notes**. Below is a custom function following AMA-style notes placement.
```{r, eval = FALSE}
ama_tbl <- function(data, caption, label, note, output_path) {
library(tidyverse)
library(xtable)
# Function to determine column alignment
get_column_alignment <- function(data) {
# xtable align requires length ncol + 1; first is for rownames
alignment <- c("l", "l")
for (col in seq_len(ncol(data))[-1]) {
if (is.numeric(data[[col]])) {
alignment <- c(alignment, "r")
} else {
alignment <- c(alignment, "c")
}
}
alignment
}
data %>%
# bold + left align first column
rename_with(~paste0("\\\\multicolumn{1}{l}{\\\\textbf{", ., "}}"), 1) %>%
# bold + center align all other columns
`colnames<-`(ifelse(colnames(.) != colnames(.)[1],
paste0("\\\\multicolumn{1}{c}{\\\\textbf{", colnames(.), "}}"),
colnames(.))) %>%
xtable(caption = caption,
label = label,
align = get_column_alignment(data),
auto = TRUE) %>%
print(
include.rownames = FALSE,
caption.placement = "top",
hline.after = c(-1, 0),
add.to.row = list(
pos = list(nrow(data)),
command = c(
paste0("\\\\hline \n \\\\multicolumn{", ncol(data),
"}{l}{ \n \\\\begin{tabular}{@{}p{0.9\\\\linewidth}@{}} \n",
"Note: ", note, "\n \\\\end{tabular} } \n")
)
),
sanitize.colnames.function = identity,
table.placement = "h",
file = output_path
)
}
```
```{r, eval = FALSE}
ama_tbl(
mtcars,
caption = "This is caption",
label = "tab:this_is_label",
note = "this is note",
output_path = file.path(getwd(), "output", "mtcars_custom_ama.tex")
)
```
## Visualizations & Plots
Customize plots to match target journal aesthetics. Below we provide an **American Marketing Association--ready** theme and examples. (Change fonts on your system as needed.)
```{r}
# Base ggplot setup
library(ggplot2)
```
```{r}
# AMA-inspired theme (serif base, clean grid)
amatheme <- theme_bw(base_size = 14, base_family = "serif") +
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
line = element_line(),
text = element_text(),
legend.title = element_text(size = rel(0.6), face = "bold"),
legend.text = element_text(size = rel(0.6)),
legend.background = element_rect(color = "black"),
plot.title = element_text(
size = rel(1.2),
face = "bold",
hjust = 0.5,
margin = margin(b = 15)
),
plot.margin = unit(c(1, 1, 1, 1), "cm"),
axis.line = element_line(colour = "black", linewidth = .8),
axis.ticks = element_line(),
axis.title.x = element_text(size = rel(1.2), face = "bold"),
axis.title.y = element_text(size = rel(1.2), face = "bold"),
axis.text.y = element_text(size = rel(1)),
axis.text.x = element_text(size = rel(1))
)
```
Figure \@ref(fig:report-amatheme-npg) shows the example mtcars scatterplot rendered with the custom AMA-style theme and NPG palette.
```{r report-amatheme-npg, fig.cap="MPG against cylinders by gear, styled with the custom AMA theme and the NPG color palette.", message=FALSE}
# Example plot
library(tidyverse)
library(ggsci)
data("mtcars")
yourplot <- mtcars %>%
select(mpg, cyl, gear) %>%
ggplot(aes(x = mpg, y = cyl, color = factor(gear))) +
geom_point(size = 2, alpha = .8) +
labs(title = "Example Plot", x = "MPG", y = "Cylinders", color = "Gears")
yourplot + amatheme + scale_color_npg()
```
```{r}
yourplot + amatheme + scale_color_viridis_d()
```
```{r}
# Other pre-specified themes
library(ggthemes)
yourplot + theme_stata()
yourplot + theme_economist()
yourplot + theme_economist_white()
yourplot + theme_wsj()
# APA-like theme from jtools
jtools::theme_apa(
legend.font.size = 12,
x.font.size = 12,
y.font.size = 12
)
yourplot + jtools::theme_apa()
```
## One-Table Summary
A pragmatic approach for business audiences: show **descriptives** and **key regression** side-by-side (or sequentially) in a single, coherent section. This can be achieved with `gtsummary::tbl_merge()` combining `tbl_summary()` and `tbl_regression()`.
```{r}
if (requireNamespace("gtsummary", quietly = TRUE) &&
requireNamespace("gt", quietly = TRUE)) {
library(gtsummary)
library(gt)
mod_for_tbl <- lm(metascore ~ log1p(budget) + log1p(us_gross) + runtime + year, data = movies_small)
tbl1 <- movies_small %>%
mutate(across(c(budget, us_gross), log1p)) %>%
select(metascore, budget, us_gross, runtime, year) %>%
tbl_summary(
statistic = all_continuous() ~ "{mean} ({sd})",
digits = all_continuous() ~ 1
) %>%
bold_labels()
tbl2 <- tbl_regression(mod_for_tbl, exponentiate = FALSE) %>%
add_glance_table(include = c(r.squared, adj.r.squared, AIC, BIC), label = list(r.squared ~ "R^2",
adj.r.squared ~ "Adj. R^2"))
tbl_merge(
tbls = list(tbl1, tbl2),
tab_spanner = c("**Descriptive Statistics**", "**Regression Results**")
) %>%
as_gt() %>%
tab_source_note(source_note = md("SEs are HC3-robust unless otherwise noted."))
}
```
## Inference / Prediction
Beyond classical t-tests:
- Bootstrap CIs (resample rows)
- Permutation tests (randomize treatment/feature under $H_0$)
- Cross-validation for predictive accuracy (RMSE/MAE)
- Prediction intervals for new observations
```{r}
# Bootstrap coefficient CIs (percentile) using boot
if (requireNamespace("boot", quietly = TRUE)) {
library(boot)
boot_fun <- function(data, idx) {
m <- lm(metascore ~ log1p(budget) + log1p(us_gross) + runtime + year, data = data[idx, , drop = FALSE])
coef(m)
}
bt <- boot(movies_small, statistic = boot_fun, R = 1000)
# Percentile CI for log1p(budget)
boot.ci(bt, type = "perc", index = which(names(coef(fit)) == "log1p(budget)"))
}
```
```{r}
# Simple K-fold CV (caret) for RMSE
if (requireNamespace("caret", quietly = TRUE)) {
library(caret)
ctrl <- trainControl(method = "cv", number = 5)
cv_fit <- train(
metascore ~ log1p(budget) + log1p(us_gross) + runtime + year,
data = movies_small,
method = "lm",
trControl = ctrl
)
cv_fit
}
```
```{r}
# Prediction intervals for new data
newdat <- tibble(
budget = median(movies_small$budget, na.rm = TRUE),
us_gross = median(movies_small$us_gross, na.rm = TRUE),
runtime = median(movies_small$runtime, na.rm = TRUE),
year = median(movies_small$year, na.rm = TRUE)
)
predict(fit, newdata = newdat, interval = "prediction", level = 0.95)
```
------------------------------------------------------------------------
## Appendix: Reproducible Snippets
```{r}
# jtools: baseline summary again for traceability
data(movies)
fit <- lm(metascore ~ budget + us_gross + year, data = movies)
summ(fit)
summ(
fit,
scale = TRUE,
vifs = TRUE,
part.corr = TRUE,
confint = TRUE,
pvals = FALSE
) # notice that scale here is TRUE
# obtain cluster-robust SE
data("PetersenCL", package = "sandwich")
fit2 <- lm(y ~ x, data = PetersenCL)
summ(fit2, robust = "HC3", cluster = "firm")
```
```{r, eval = FALSE}
# Model to Equation via equatiomatic (if available)
# install.packages("equatiomatic") # not available for R 4.2
fit <- lm(metascore ~ budget + us_gross + year, data = movies)
# show the theoretical model