UKBAnalytica: Scalable Phenotyping and Statistical Pipeline for UK Biobank RAP Data Analysis

UKBAnalytica is a high-performance R package for working with UK Biobank Research Analysis Platform (RAP) data inside approved RAP projects. It focuses on standardized phenotyping, survival-ready datasets, scalable preprocessing, and downstream analysis.

Attention The package does not ship UK Biobank participant-level source records; examples use field IDs, simulated toy data, or user-provided tables that remain within RAP-controlled storage.

For details, please visit: Full documentation for UKBAnalytica

Installation

You can install the development version of UKBAnalytica from GitHub with:

# install.packages("devtools")
devtools::install_github("Hinna0818/UKBAnalytica")

Quick start

library(UKBAnalytica)
library(data.table)

## suppose a participant-level table is available within your approved RAP project
ukb_data <- fread("population.csv")

diseases <- get_predefined_diseases()[
  c("AA", "Hypertension", "Diabetes")
]

analysis_dt <- build_survival_dataset(
  dt = ukb_data,
  disease_definitions = diseases,
  prevalent_sources = c("ICD10", "ICD9", "Self-report", "Death"),
  outcome_sources = c("ICD10", "ICD9", "Death"),
  primary_disease = "AA",
  show_flow = TRUE,
  dt_threads = 8
)

Disease Definition Sources

UKBAnalytica builds disease phenotypes by taking the earliest valid evidence from multiple UK Biobank sources. Predefined disease definitions are available through get_predefined_diseases(), and custom definitions can be created with create_disease_definition().

Supported sources include:

ICD10: hospital inpatient diagnosis codes.
ICD9: historical hospital inpatient diagnosis codes.
Self-report: touchscreen/verbal interview illness codes.
Death: primary or contributory death-cause ICD-10 codes.
OPCS4: hospital operative procedure codes.
CancerRegistry: cancer registry diagnosis dates and ICD-10 morphology information.
FirstOccurrence: UKB first-occurrence date fields derived from linked health records.
Algorithm: UKB algorithmically-defined outcomes, such as myocardial infarction, stroke, dementia, asthma, COPD, and Parkinson's disease.

If a selected source is not defined for a disease, it is ignored automatically. For example, cancer registry evidence is used only when the disease definition has a cancer_icd10_pattern; procedure evidence is used only when opcs4_pattern is defined.

Minimal RAP Extraction Workflow

Run the following inside a UK Biobank RAP R session. Participant-level data should remain inside approved RAP projects and RAP-controlled storage.

library(UKBAnalytica)

dataset <- rap_find_dataset()
fields <- rap_list_fields()

meta <- ukb_metadata_setup(fields_df = fields)

ids <- get_variable_set("clinical_core", output = "field_id")

dt <- ukb_extract_fields(
  field_id = ids,
  metadata = meta,
  mode = "sync",
  strip_entity_prefix = FALSE
)

dt <- ukb_decode(dt, metadata = meta)
dt <- ukb_clean_missing(dt, action = "na")

ukb_snapshot(dt, "Clinical core extracted and cleaned")

Basic Survival Dataset

disease_defs <- get_predefined_diseases()[c("Hypertension", "Diabetes", "Stroke")]

analysis_dt <- build_survival_dataset(
  dt = analysis_input,
  disease_definitions = disease_defs,
  primary_disease = "Stroke",
  prevalent_sources = c("ICD10", "ICD9", "Self-report", "Death", "FirstOccurrence"),
  outcome_sources = c("ICD10", "ICD9", "Death", "FirstOccurrence"),
  baseline_col = "p53_i0",
  show_flow = TRUE
)

Here analysis_input should already contain baseline variables and the diagnosis/date columns required by the selected sources. The output contains one *_history column per disease, one *_incident column per disease, and for the primary disease a standard survival endpoint: outcome_status and outcome_surv_time.

AI Agent Skills for UKB data analyses

UKBAnalytica ships a curated set of AI agent skills under inst/skills/UKBAnalytica_skills/. Each skill is a self-contained prompt document that a Claude Code agent (or any compatible AI assistant) can load to generate RAP-ready R scripts, plan analysis workflows, and interpret aggregate outputs — without ever seeing real participant-level data.

Skill	Phase	Coverage
`ukbsci-rap-extract`	P2	Discover UKB fields and execute extractions via `dx extract_dataset` (sync) or table-exporter (async)
`ukbsci-cohort`	P2	Define disease phenotypes from ICD-10/9, self-report, death, OPCS4, cancer registry, and build Cox-ready survival datasets
`ukbsci-workflow`	P2	End-to-end study orchestrator — produces a phased plan and calls sub-skills in order
`ukbsci-regression`	P3	Batch linear / logistic / Cox / GLM / negative-binomial / GAM regression; unified `run_regression()` interface; PH diagnostics; lag sensitivity; Fine-Gray competing risks; trend tests
`ukbsci-survival`	P3	Kaplan-Meier curves with risk tables and log-rank p-values
`ukbsci-baseline`	P3	Stratified Table 1 (baseline characteristics) via `tableone`
`ukbsci-propensity`	P4	Propensity scores, PSM, IPTW (ATE/ATT/ATC), balance diagnostics, weighted regression
`ukbsci-mediation`	P4	Causal mediation analysis (4-way decomposition); single and multi-mediator with sensitivity
`ukbsci-subgroup-sensitivity`	P4	Subgroup × interaction tests across Cox / logistic / linear / GLM / negbin; complete-case and early-event sensitivity filters
`ukbsci-imputation`	P4	Multiple imputation (mice) and Rubin's-rules pooling (mitools) with FMI diagnostics
`ukbsci-metabolomics`	P5	Nightingale NMR metabolite ORA — name mapping, classification, custom or MetaboAnalystR backend, dot/bar plot visualization
`ukbsci-proteomics`	P5	UKB Olink / UKB-PPP: ID mapping, GO/KEGG ORA, STRING PPI, community detection
`ukbsci-ml`	P5	End-to-end ML workflows (classification + survival): split, feature select, tune, fit, evaluate, SHAP
`ukbsci-preprocess`	P5	Variable cleaning, composite-variable builders (BP, air pollution, diet score), variable-set catalog
`ukbsci-plot`	P6	Manuscript figures: forest, volcano, calibration; shared neutral theme/palettes; multi-format save helper

Data privacy boundary

All skills enforce a strict script-generation-only boundary:

What the agent receives: column names, variable roles, intended analysis design, and aggregate outputs (flow counts, coefficient tables, model metrics, enrichment results, non-identifying figures).
What the agent must never receive: real UKB participant rows, eid values, exact dates, raw RAP fields, row-level predictions, SHAP matrices, screenshots, or log excerpts containing row-level values.

The workflow is: describe your schema → the agent generates an R script → you run the script inside RAP → you share only aggregate results back for interpretation. Real participant-level data should remain inside the approved RAP project and RAP-controlled storage at all times.

Supplementary Materials

Here we provide some learning materials for UK Biobank in which you may be interested:

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
.github/workflows		.github/workflows
R		R
data		data
docs		docs
inst		inst
man		man
quarto-site		quarto-site
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
Makefile		Makefile
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UKBAnalytica: Scalable Phenotyping and Statistical Pipeline for UK Biobank RAP Data Analysis

Installation

Quick start

Disease Definition Sources

Minimal RAP Extraction Workflow

Basic Survival Dataset

AI Agent Skills for UKB data analyses

Data privacy boundary

Supplementary Materials

Star History

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

UKBAnalytica: Scalable Phenotyping and Statistical Pipeline for UK Biobank RAP Data Analysis

Installation

Quick start

Disease Definition Sources

Minimal RAP Extraction Workflow

Basic Survival Dataset

AI Agent Skills for UKB data analyses

Data privacy boundary

Supplementary Materials

Star History

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages