Skip to content

Hinna0818/UKBAnalytica

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

109 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

UKBAnalytica: Scalable Phenotyping and Statistical Pipeline for UK Biobank RAP Data Analysis

License: MIT GitHub stars GitHub last commit Visits

UKBAnalytica logo

UKBAnalytica is a high-performance R package for working with UK Biobank Research Analysis Platform (RAP) data inside approved RAP projects. It focuses on standardized phenotyping, survival-ready datasets, scalable preprocessing, and downstream analysis.

Attention The package does not ship UK Biobank participant-level source records; examples use field IDs, simulated toy data, or user-provided tables that remain within RAP-controlled storage.

For details, please visit: Full documentation for UKBAnalytica

Installation

You can install the development version of UKBAnalytica from GitHub with:

# install.packages("devtools")
devtools::install_github("Hinna0818/UKBAnalytica")

Quick start

library(UKBAnalytica)
library(data.table)

## suppose a participant-level table is available within your approved RAP project
ukb_data <- fread("population.csv")

diseases <- get_predefined_diseases()[
  c("AA", "Hypertension", "Diabetes")
]

analysis_dt <- build_survival_dataset(
  dt = ukb_data,
  disease_definitions = diseases,
  prevalent_sources = c("ICD10", "ICD9", "Self-report", "Death"),
  outcome_sources = c("ICD10", "ICD9", "Death"),
  primary_disease = "AA",
  show_flow = TRUE,
  dt_threads = 8
)

Disease Definition Sources

UKBAnalytica builds disease phenotypes by taking the earliest valid evidence from multiple UK Biobank sources. Predefined disease definitions are available through get_predefined_diseases(), and custom definitions can be created with create_disease_definition().

Supported sources include:

  • ICD10: hospital inpatient diagnosis codes.
  • ICD9: historical hospital inpatient diagnosis codes.
  • Self-report: touchscreen/verbal interview illness codes.
  • Death: primary or contributory death-cause ICD-10 codes.
  • OPCS4: hospital operative procedure codes.
  • CancerRegistry: cancer registry diagnosis dates and ICD-10 morphology information.
  • FirstOccurrence: UKB first-occurrence date fields derived from linked health records.
  • Algorithm: UKB algorithmically-defined outcomes, such as myocardial infarction, stroke, dementia, asthma, COPD, and Parkinson's disease.

If a selected source is not defined for a disease, it is ignored automatically. For example, cancer registry evidence is used only when the disease definition has a cancer_icd10_pattern; procedure evidence is used only when opcs4_pattern is defined.

Minimal RAP Extraction Workflow

Run the following inside a UK Biobank RAP R session. Participant-level data should remain inside approved RAP projects and RAP-controlled storage.

library(UKBAnalytica)

dataset <- rap_find_dataset()
fields <- rap_list_fields()

meta <- ukb_metadata_setup(fields_df = fields)

ids <- get_variable_set("clinical_core", output = "field_id")

dt <- ukb_extract_fields(
  field_id = ids,
  metadata = meta,
  mode = "sync",
  strip_entity_prefix = FALSE
)

dt <- ukb_decode(dt, metadata = meta)
dt <- ukb_clean_missing(dt, action = "na")

ukb_snapshot(dt, "Clinical core extracted and cleaned")

Basic Survival Dataset

disease_defs <- get_predefined_diseases()[c("Hypertension", "Diabetes", "Stroke")]

analysis_dt <- build_survival_dataset(
  dt = analysis_input,
  disease_definitions = disease_defs,
  primary_disease = "Stroke",
  prevalent_sources = c("ICD10", "ICD9", "Self-report", "Death", "FirstOccurrence"),
  outcome_sources = c("ICD10", "ICD9", "Death", "FirstOccurrence"),
  baseline_col = "p53_i0",
  show_flow = TRUE
)

Here analysis_input should already contain baseline variables and the diagnosis/date columns required by the selected sources. The output contains one *_history column per disease, one *_incident column per disease, and for the primary disease a standard survival endpoint: outcome_status and outcome_surv_time.

AI Agent Skills for UKB data analyses

UKBAnalytica ships a curated set of AI agent skills under inst/skills/UKBAnalytica_skills/. Each skill is a self-contained prompt document that a Claude Code agent (or any compatible AI assistant) can load to generate RAP-ready R scripts, plan analysis workflows, and interpret aggregate outputs β€” without ever seeing real participant-level data.

Skill Phase Coverage
ukbsci-rap-extract P2 Discover UKB fields and execute extractions via dx extract_dataset (sync) or table-exporter (async)
ukbsci-cohort P2 Define disease phenotypes from ICD-10/9, self-report, death, OPCS4, cancer registry, and build Cox-ready survival datasets
ukbsci-workflow P2 End-to-end study orchestrator β€” produces a phased plan and calls sub-skills in order
ukbsci-regression P3 Batch linear / logistic / Cox / GLM / negative-binomial / GAM regression; unified run_regression() interface; PH diagnostics; lag sensitivity; Fine-Gray competing risks; trend tests
ukbsci-survival P3 Kaplan-Meier curves with risk tables and log-rank p-values
ukbsci-baseline P3 Stratified Table 1 (baseline characteristics) via tableone
ukbsci-propensity P4 Propensity scores, PSM, IPTW (ATE/ATT/ATC), balance diagnostics, weighted regression
ukbsci-mediation P4 Causal mediation analysis (4-way decomposition); single and multi-mediator with sensitivity
ukbsci-subgroup-sensitivity P4 Subgroup Γ— interaction tests across Cox / logistic / linear / GLM / negbin; complete-case and early-event sensitivity filters
ukbsci-imputation P4 Multiple imputation (mice) and Rubin's-rules pooling (mitools) with FMI diagnostics
ukbsci-metabolomics P5 Nightingale NMR metabolite ORA β€” name mapping, classification, custom or MetaboAnalystR backend, dot/bar plot visualization
ukbsci-proteomics P5 UKB Olink / UKB-PPP: ID mapping, GO/KEGG ORA, STRING PPI, community detection
ukbsci-ml P5 End-to-end ML workflows (classification + survival): split, feature select, tune, fit, evaluate, SHAP
ukbsci-preprocess P5 Variable cleaning, composite-variable builders (BP, air pollution, diet score), variable-set catalog
ukbsci-plot P6 Manuscript figures: forest, volcano, calibration; shared neutral theme/palettes; multi-format save helper

Data privacy boundary

All skills enforce a strict script-generation-only boundary:

  • What the agent receives: column names, variable roles, intended analysis design, and aggregate outputs (flow counts, coefficient tables, model metrics, enrichment results, non-identifying figures).
  • What the agent must never receive: real UKB participant rows, eid values, exact dates, raw RAP fields, row-level predictions, SHAP matrices, screenshots, or log excerpts containing row-level values.

The workflow is: describe your schema β†’ the agent generates an R script β†’ you run the script inside RAP β†’ you share only aggregate results back for interpretation. Real participant-level data should remain inside the approved RAP project and RAP-controlled storage at all times.

Supplementary Materials

Here we provide some learning materials for UK Biobank in which you may be interested:

Star History

Star History Chart

About

πŸ“Š A Scalable Phenotyping and Statistical Pipeline for UK Biobank RAP Data Analysis

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages