UKBAnalytica is a high-performance R package for working with UK Biobank Research Analysis Platform (RAP) data inside approved RAP projects. It focuses on standardized phenotyping, survival-ready datasets, scalable preprocessing, and downstream analysis.
Attention The package does not ship UK Biobank participant-level source records; examples use field IDs, simulated toy data, or user-provided tables that remain within RAP-controlled storage.
For details, please visit: Full documentation for UKBAnalytica
You can install the development version of UKBAnalytica from GitHub with:
# install.packages("devtools")
devtools::install_github("Hinna0818/UKBAnalytica")library(UKBAnalytica)
library(data.table)
## suppose a participant-level table is available within your approved RAP project
ukb_data <- fread("population.csv")
diseases <- get_predefined_diseases()[
c("AA", "Hypertension", "Diabetes")
]
analysis_dt <- build_survival_dataset(
dt = ukb_data,
disease_definitions = diseases,
prevalent_sources = c("ICD10", "ICD9", "Self-report", "Death"),
outcome_sources = c("ICD10", "ICD9", "Death"),
primary_disease = "AA",
show_flow = TRUE,
dt_threads = 8
)UKBAnalytica builds disease phenotypes by taking the earliest valid evidence
from multiple UK Biobank sources. Predefined disease definitions are available
through get_predefined_diseases(), and custom definitions can be created with
create_disease_definition().
Supported sources include:
ICD10: hospital inpatient diagnosis codes.ICD9: historical hospital inpatient diagnosis codes.Self-report: touchscreen/verbal interview illness codes.Death: primary or contributory death-cause ICD-10 codes.OPCS4: hospital operative procedure codes.CancerRegistry: cancer registry diagnosis dates and ICD-10 morphology information.FirstOccurrence: UKB first-occurrence date fields derived from linked health records.Algorithm: UKB algorithmically-defined outcomes, such as myocardial infarction, stroke, dementia, asthma, COPD, and Parkinson's disease.
If a selected source is not defined for a disease, it is ignored automatically.
For example, cancer registry evidence is used only when the disease definition
has a cancer_icd10_pattern; procedure evidence is used only when
opcs4_pattern is defined.
Run the following inside a UK Biobank RAP R session. Participant-level data should remain inside approved RAP projects and RAP-controlled storage.
library(UKBAnalytica)
dataset <- rap_find_dataset()
fields <- rap_list_fields()
meta <- ukb_metadata_setup(fields_df = fields)
ids <- get_variable_set("clinical_core", output = "field_id")
dt <- ukb_extract_fields(
field_id = ids,
metadata = meta,
mode = "sync",
strip_entity_prefix = FALSE
)
dt <- ukb_decode(dt, metadata = meta)
dt <- ukb_clean_missing(dt, action = "na")
ukb_snapshot(dt, "Clinical core extracted and cleaned")disease_defs <- get_predefined_diseases()[c("Hypertension", "Diabetes", "Stroke")]
analysis_dt <- build_survival_dataset(
dt = analysis_input,
disease_definitions = disease_defs,
primary_disease = "Stroke",
prevalent_sources = c("ICD10", "ICD9", "Self-report", "Death", "FirstOccurrence"),
outcome_sources = c("ICD10", "ICD9", "Death", "FirstOccurrence"),
baseline_col = "p53_i0",
show_flow = TRUE
)Here analysis_input should already contain baseline variables and the
diagnosis/date columns required by the selected sources. The output contains one
*_history column per disease, one *_incident column per disease, and for the
primary disease a standard survival endpoint: outcome_status and
outcome_surv_time.
UKBAnalytica ships a curated set of AI agent skills under
inst/skills/UKBAnalytica_skills/. Each skill is a self-contained prompt
document that a Claude Code agent (or any compatible AI assistant) can load to
generate RAP-ready R scripts, plan analysis workflows, and interpret aggregate
outputs β without ever seeing real participant-level data.
| Skill | Phase | Coverage |
|---|---|---|
ukbsci-rap-extract |
P2 | Discover UKB fields and execute extractions via dx extract_dataset (sync) or table-exporter (async) |
ukbsci-cohort |
P2 | Define disease phenotypes from ICD-10/9, self-report, death, OPCS4, cancer registry, and build Cox-ready survival datasets |
ukbsci-workflow |
P2 | End-to-end study orchestrator β produces a phased plan and calls sub-skills in order |
ukbsci-regression |
P3 | Batch linear / logistic / Cox / GLM / negative-binomial / GAM regression; unified run_regression() interface; PH diagnostics; lag sensitivity; Fine-Gray competing risks; trend tests |
ukbsci-survival |
P3 | Kaplan-Meier curves with risk tables and log-rank p-values |
ukbsci-baseline |
P3 | Stratified Table 1 (baseline characteristics) via tableone |
ukbsci-propensity |
P4 | Propensity scores, PSM, IPTW (ATE/ATT/ATC), balance diagnostics, weighted regression |
ukbsci-mediation |
P4 | Causal mediation analysis (4-way decomposition); single and multi-mediator with sensitivity |
ukbsci-subgroup-sensitivity |
P4 | Subgroup Γ interaction tests across Cox / logistic / linear / GLM / negbin; complete-case and early-event sensitivity filters |
ukbsci-imputation |
P4 | Multiple imputation (mice) and Rubin's-rules pooling (mitools) with FMI diagnostics |
ukbsci-metabolomics |
P5 | Nightingale NMR metabolite ORA β name mapping, classification, custom or MetaboAnalystR backend, dot/bar plot visualization |
ukbsci-proteomics |
P5 | UKB Olink / UKB-PPP: ID mapping, GO/KEGG ORA, STRING PPI, community detection |
ukbsci-ml |
P5 | End-to-end ML workflows (classification + survival): split, feature select, tune, fit, evaluate, SHAP |
ukbsci-preprocess |
P5 | Variable cleaning, composite-variable builders (BP, air pollution, diet score), variable-set catalog |
ukbsci-plot |
P6 | Manuscript figures: forest, volcano, calibration; shared neutral theme/palettes; multi-format save helper |
All skills enforce a strict script-generation-only boundary:
- What the agent receives: column names, variable roles, intended analysis design, and aggregate outputs (flow counts, coefficient tables, model metrics, enrichment results, non-identifying figures).
- What the agent must never receive: real UKB participant rows,
eidvalues, exact dates, raw RAP fields, row-level predictions, SHAP matrices, screenshots, or log excerpts containing row-level values.
The workflow is: describe your schema β the agent generates an R script β you run the script inside RAP β you share only aggregate results back for interpretation. Real participant-level data should remain inside the approved RAP project and RAP-controlled storage at all times.
Here we provide some learning materials for UK Biobank in which you may be interested:

