fix: add contributor churn analysis script by prajeeta15 · Pull Request #112 · hiero-hackers/analytics

prajeeta15 · 2026-04-08T09:40:57Z

Title : Add contributor churn analysis and progression visualization

issue : #76

Description :

This PR introduces a contributor churn analysis pipeline to measure how users progress through contribution difficulty levels:

Good First Issue → Beginner → Intermediate → Advanced

The implementation focuses on quantifying drop-offs, conversion rates, and progression patterns, along with a clear, high-level presentation of the data.

Changes Made :

Added new script:
src/hiero_analytics/run_contributor_churn_analysis.py
Implemented contributor progression tracking:

Identifies each contributor’s starting level and maximum level reached

Computes transition metrics across all levels

Calculates churn and conversion rates (e.g., % stopping at GFI, % reaching Advanced)

Added key metrics:

% of contributors who stop at Good First Issue

% progressing to Beginner, Intermediate, and Advanced

Overall “hit rate” for becoming Advanced contributors

Introduced two high-level visualizations:

Progression Funnel (GFI → Advanced conversion)

Retention Curve (contributors retained vs PR count)

Added a simple prediction framework:

Uses an 80/20 split (train/test) as proposed

Demonstrates how early contributor behavior (PR count, tenure) can predict progression to Advanced

Added fallback to mock data when GitHub API access is unavailable, ensuring the pipeline is always runnable

Impact :

Provides a clear, churn-focused view of contributor progression
Enables measurement of onboarding effectiveness (GFI → Advanced pipeline)
Lays groundwork for future ML-based contributor outcome prediction
Simplifies insights into where contributors drop off and where improvements are needed

exploreriii

Hi @prajeeta15
Thanks!
I wonder how much of the existing project utils we can use to abstract some of the logic in this script, so we have less to maintain over time as the repo scales
Also, can you push the output PNGs?

Signed-off-by: Prajeeta Pal <prajeetapal@gmail.com>

prajeeta15 · 2026-04-13T04:52:40Z

new changes made:

I have updated the script
./src/hiero_analytics/run_contributor_churn_analysis.py to remove the mock data generation.
the script will now exit with an error if it's run without a GITHUB_TOKEN, ensuring that any future PNGs generated will be based on 100% real data.
to generate the "real data" versions of these charts, we'll just need to run the script once a GITHUB_TOKEN is provided in the .env file or environment.
80/20 Split -> maintained an 80/20 data split for the prediction model, which now focuses on predicting "Advanced" status using real-world characteristics like PR frequency and contributor tenure.
replaced custom matplotlib code with established project prmitives.

prajeeta15 · 2026-04-13T04:54:11Z

@exploreriii i dont have dev access for GITHUB_TOKEN generation for this repo. could you create one and share it w me personally so I could check once ? or if you could create and check and lmk if the script throws any errors.

coderabbitai · 2026-04-13T04:57:04Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Runs contributor churn/progression analysis strictly from real GitHub data (requires GITHUB_TOKEN). Fetches merged PRs via GraphQL, maps issue labels to difficulty levels, computes per-author progression and transition metrics, filters Good First Issue starters, and emits CSV plus funnel and retention charts.

Changes

Contributor churn analysis + plotting

Layer / File(s)	Summary
Data fetching & shapes `src/hiero_analytics/run_contributor_churn_analysis.py`	Script now requires `GITHUB_TOKEN`, fetches merged-PR difficulty data via GraphQL with caching, converts results to a pandas DataFrame (`prs_to_dataframe`), and adds a per-PR `level` column computed by `get_contributor_level(issue_labels)`.
Core analysis `src/hiero_analytics/analysis/contributor_churn.py`	Implements `compute_progression_stats(df)` and `compute_transition_metrics(df)`. Both handle empty input, rank difficulty levels (mapping `"Unknown"` → -1), deduplicate to one row per (author, pr_number) choosing highest-ranked entry, and compute per-author levels, first/last seen, pr_count, max_level, start_level, tenure_days, and aggregated transition counts. Removed earlier `run_prediction_analysis`.
Top-level orchestration `src/hiero_analytics/run_contributor_churn_analysis.py`	Orchestrates flow: resolve repo dirs, enforce token, fetch and convert PRs, drop rows missing author/pr_merged_at, sort, compute progression stats, filter `start_level == "Good First Issue"`, compute/print funnel and transition metrics, write `contributor_progression.csv`, and generate plots. Adds module entrypoint calling `run()`.
Plotting (visual constraints) `src/hiero_analytics/plotting/lines.py`, `.../bars.py`, `.../scatter.py`	All plotting functions now clamp y-axis lower bound to 0 via `ax.set_ylim(bottom=0)` (applied in `plot_line`/`plot_multiline`, `plot_bar`/`plot_stacked_bar`, and scatter/regression plotting).
Outputs / Files written `.../data/.../contributor_progression.csv`, `.../plots/contributor_churn_funnel.png`, `.../plots/contributor_retention.png`	Writes GFI-starters progression CSV and generates funnel and retention PNGs from computed funnel counts and retention-by-min-PR thresholds.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'fix: add contributor churn analysis script' is somewhat vague and mixes a 'fix' prefix with an additive action. However, it does refer to the primary changeset focus: adding a new contributor churn analysis script. The title is partially related to the main change but uses imprecise framing.
Description check	✅ Passed	The description thoroughly documents the contributor churn analysis pipeline, including objectives, changes made (new script, progression tracking, visualizations, prediction framework), and impact. It is clearly related to the changeset and provides meaningful context about the implementation.
Docstring Coverage	✅ Passed	Docstring coverage is 87.50% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 317249d4-7bea-4225-8a3f-81a07a800a37

📥 Commits

Reviewing files that changed from the base of the PR and between a4e64e3 and 4a63842.

⛔ Files ignored due to path filters (2)

outputs/charts/repo/hiero-ledger_hiero-sdk-python/contributor_churn_funnel.png is excluded by !**/*.png
outputs/charts/repo/hiero-ledger_hiero-sdk-python/contributor_retention.png is excluded by !**/*.png

📒 Files selected for processing (1)

src/hiero_analytics/run_contributor_churn_analysis.py

MonaaEid

Nice work overall, I have a concern...the summary counts are based on max_level thresholds, which works for a funnel, but there are no computed transition metrics across all levels.. is this the final approach you plan to go with?

prajeeta15 · 2026-04-17T15:40:31Z

Nice work overall, I have a concern...the summary counts are based on max_level thresholds, which works for a funnel, but there are no computed transition metrics across all levels.. is this the final approach you plan to go with?

adding transition metrics to capture movement between levels across contributor journeys. this will show "how" the contributors progress, not just how many reach each stage....

coderabbitai

Actionable comments posted: 1

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: fede5a09-03a8-4428-9df8-55562f2ae243

📥 Commits

Reviewing files that changed from the base of the PR and between 4a63842 and 1adc053.

📒 Files selected for processing (2)

src/hiero_analytics/analysis/contributor_churn.py
src/hiero_analytics/run_contributor_churn_analysis.py

MonaaEid

lgtm, could you please do a commit with the updated PNGs? thanks

MonaaEid

lgtm. need a second opinion from @exploreriii @Adityarya11 @danielmarv

coderabbitai

Actionable comments posted: 1

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: a7a7e511-9a44-4e19-8f6e-8276696cf7f6

📥 Commits

Reviewing files that changed from the base of the PR and between 5563a6c and bf78e40.

⛔ Files ignored due to path filters (2)

outputs/charts/repo/hiero-ledger_hiero-sdk-python/contributor_churn_funnel.png is excluded by !**/*.png
outputs/charts/repo/hiero-ledger_hiero-sdk-python/contributor_retention.png is excluded by !**/*.png

📒 Files selected for processing (2)

src/hiero_analytics/plotting/scatter.py
src/hiero_analytics/run_contributor_churn_analysis.py

exploreriii

Good start, looking at this, I think it is fairly accurate what you have plotted!
For this reason we can merge as is, then make some tweaks.

Alternatively:

I would like to suggest strongly simplifying this.

Right now your code captures:
1- progression (e.g. beginner->advanced)
2- regression (e.g. advanced-->beginner)
3- prediction based on a hard coded rule

I think we can delete functionality doing 2 & 3
Advanced users with merged PRs have already demonstrated competence at that level, even if they change to beginner or etc, it still shows they completed an advanced PR and can work at that level
The prediction -- we can do this formally at a later date, by fitting and not hard coding a model

Recommended definition of progression that demonstrates the strength of the GFI pipeline:
A contributor progresses when they reach a higher level than any level they have previously reached. (this will include people that did not ever do a GFI)

or
A GFI starter progresses when, after their first Good First Issue PR, they reach a level with a higher rank than any level they previously reached. (possibly my preference?)

Core metrics to report
GFI starters
Progressed to Beginner+
Progressed to Intermediate+
Progressed to Advanced

You do log output, but the output is aggregated - there is no way to verify the names being recorded/etc are as expected - would suggest adding more logs or writing to a CSV before plotting from that.

What would you like to do?

prajeeta15 · 2026-05-01T14:44:55Z

Good start, looking at this, I think it is fairly accurate what you have plotted! For this reason we can merge as is, then make some tweaks.

Alternatively:

I would like to suggest strongly simplifying this.

Right now your code captures: 1- progression (e.g. beginner->advanced) 2- regression (e.g. advanced-->beginner) 3- prediction based on a hard coded rule

I think we can delete functionality doing 2 & 3 Advanced users with merged PRs have already demonstrated competence at that level, even if they change to beginner or etc, it still shows they completed an advanced PR and can work at that level The prediction -- we can do this formally at a later date, by fitting and not hard coding a model

Recommended definition of progression that demonstrates the strength of the GFI pipeline: A contributor progresses when they reach a higher level than any level they have previously reached. (this will include people that did not ever do a GFI)

or A GFI starter progresses when, after their first Good First Issue PR, they reach a level with a higher rank than any level they previously reached. (possibly my preference?)

Core metrics to report GFI starters Progressed to Beginner+ Progressed to Intermediate+ Progressed to Advanced

You do log output, but the output is aggregated - there is no way to verify the names being recorded/etc are as expected - would suggest adding more logs or writing to a CSV before plotting from that.

What would you like to do?

I agree with simplifying the logic to focus exclusively on progression. Tracking the "highest level reached" is a more robust indicator of contributor growth than handling regressions or using hard-coded predictions.

My recommendation for the next steps:

Adopt the GFI-centric definition: "A GFI starter progresses when, after their first Good First Issue PR, they reach a level with a higher rank than any level they previously reached." This directly measures the pipeline's effectiveness.
Implement CSV Logging: Before aggregation, I would like to write the raw contributor-level transitions to a CSV. This allows us to audit specific usernames and verify that the "Progressed to X" counts are accurate.
Track Max Rank: I will simplify the internal state to only store the max_rank_achieved for each contributor to determine if a new PR constitutes progression.

I would like to proceed with these logic updates and the addition of CSV output for better data transparency.

exploreriii · 2026-05-01T14:54:46Z

Yap sounds great !

coderabbitai · 2026-05-04T05:03:49Z

CodeRabbit chat interactions are restricted to organization members for this repository. Ask an organization member to interact with CodeRabbit, or set chat.allow_non_org_members: true in your configuration.

exploreriii

This logged output does not make sense given what you plotted
--- Contributor Churn Analysis ---
GFI Starters: 132 (100.0%)
Progressed to Beginner+: 39 (29.5%)
Progressed to Intermediate+: 24 (18.2%)
Progressed to Advanced: 6 (4.5%)

--- Level Transition Metrics ---
from to count
Beginner Advanced 1
Beginner Intermediate 12
Good First Issue Advanced 1
Good First Issue Beginner 25
Good First Issue Intermediate 14
Intermediate Advanced 10

GFI to advanced = 1, why plot 4?

There's also a few unknowns
prajeeta15,"['Good First Issue', 'Unknown', 'Unknown', 'Intermediate', 'Intermediate']",2025-10-22 21:59:46+00:00,2025-12-29 10:26:56+00:00,5,Intermediate,Good First Issue,67

has exploreriii never completed a GFI?

prajeeta15 · 2026-05-13T13:31:27Z

contributor_transitions.png: Shows the specific upward paths contributors took.
avg_tenure_by_level.png: Shows how many days contributors stayed active based on their highest level.
you have to run the analysis script with your GITHUB_TOKEN to generate them.
can you generate with the pngs with the token and let me know once if there is any error log or mismatch in the graph? @exploreriii so I can understand what's up and look into it

exploreriii

Thank you very much @prajeeta15
You have added some new functionality and thank you for making the corrections.
I have some questions how we are getting some users going from beginner->advanced, etc and not gfi->advanced, so would want to learn more about the pipeline as it sees their first issue-linked, difficulty-labeled, merged PR, and then start level is defined as the first non unknown level. Or perhaps i'm not fully understanding that chart as it shows different data from contributor churn funnel
In this case, we can merge and then open a new issue to investigate

exploreriii

Could you please correct the DCO signing then this can be merged

subsequent issues to investigate either in this pr or a next pr:
Mounil2005, aceppaluni, emiliyank , exploreriii, Akshat8510 <-- i think these started from beginner/intermediate but actually started from unknown (before we had the issue requirement bots it seems)
looks like some issues are being labelled post closure, which will bypass the issue guard bots. maybe we should skip these labels in this case as was probably a tidy up for issue by difficulty

fix: add contributor churn analysis script

925616d

exploreriii requested a review from Adityarya11 April 8, 2026 09:46

Adityarya11 previously approved these changes Apr 8, 2026

View reviewed changes

Adityarya11 reviewed Apr 8, 2026

View reviewed changes

Comment thread src/hiero_analytics/run_contributor_churn_analysis.py Outdated

exploreriii reviewed Apr 8, 2026

View reviewed changes

Comment thread src/hiero_analytics/run_contributor_churn_analysis.py Outdated

prajeeta15 marked this pull request as draft April 9, 2026 05:18

exploreriii requested changes Apr 11, 2026

View reviewed changes

prajeeta15 added 2 commits April 13, 2026 10:12

Merge remote-tracking branch 'upstream/main' into churn

040e42e

fix: resolving mock data and plots

4a63842

Signed-off-by: Prajeeta Pal <prajeetapal@gmail.com>

prajeeta15 dismissed Adityarya11’s stale review via 4a63842 April 13, 2026 04:45

prajeeta15 marked this pull request as ready for review April 13, 2026 04:54

coderabbitai Bot reviewed Apr 13, 2026

View reviewed changes

Comment thread src/hiero_analytics/run_contributor_churn_analysis.py Outdated

Comment thread src/hiero_analytics/run_contributor_churn_analysis.py Outdated

exploreriii reviewed Apr 13, 2026

View reviewed changes

Comment thread src/hiero_analytics/run_contributor_churn_analysis.py Outdated

exploreriii requested review from Adityarya11 and MonaaEid April 14, 2026 20:15

MonaaEid reviewed Apr 15, 2026

View reviewed changes

Comment thread src/hiero_analytics/run_contributor_churn_analysis.py Outdated

MonaaEid reviewed Apr 15, 2026

View reviewed changes

Comment thread outputs/charts/repo/hiero-ledger_hiero-sdk-python/contributor_retention.png

prajeeta15 and others added 2 commits April 17, 2026 21:13

Merge branch 'hiero-hackers:main' into churn

0c835ec

fix: refactoring transition metrics

1adc053

coderabbitai Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread src/hiero_analytics/analysis/contributor_churn.py Outdated

fix: refactor transition metrics

c3ec7be

prajeeta15 requested review from MonaaEid and exploreriii April 22, 2026 04:15

MonaaEid reviewed Apr 24, 2026

View reviewed changes

Comment thread src/hiero_analytics/run_contributor_churn_analysis.py Outdated

Merge branch 'hiero-hackers:main' into churn

5563a6c

fix:update contributor churn charts

fd104a9

prajeeta15 requested a review from MonaaEid April 27, 2026 08:51

MonaaEid previously approved these changes Apr 27, 2026

View reviewed changes

exploreriii reviewed Apr 27, 2026

View reviewed changes

Comment thread outputs/charts/repo/hiero-ledger_hiero-sdk-python/contributor_retention.png

exploreriii marked this pull request as draft April 29, 2026 12:20

Update churn analysis + add retention and funnel charts

bf78e40

prajeeta15 dismissed MonaaEid’s stale review via bf78e40 April 30, 2026 05:56

prajeeta15 marked this pull request as ready for review April 30, 2026 05:57

coderabbitai Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread src/hiero_analytics/run_contributor_churn_analysis.py Outdated

remove mock data

317b45d

exploreriii reviewed Apr 30, 2026

View reviewed changes

exploreriii marked this pull request as draft May 1, 2026 12:29

modifying progression logic

bee801e

prajeeta15 marked this pull request as ready for review May 4, 2026 05:00

prajeeta15 requested a review from exploreriii May 4, 2026 05:00

exploreriii requested changes May 5, 2026

View reviewed changes

exploreriii marked this pull request as draft May 5, 2026 10:17

fix: cleaned transitions

47acd75

prajeeta15 marked this pull request as ready for review May 19, 2026 14:45

exploreriii approved these changes Jun 3, 2026

View reviewed changes

exploreriii reviewed Jun 3, 2026

View reviewed changes

Conversation

prajeeta15 commented Apr 8, 2026

Title : Add contributor churn analysis and progression visualization

issue : #76

Description :

Changes Made :

Impact :

Uh oh!

Uh oh!

Uh oh!

exploreriii left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

prajeeta15 commented Apr 13, 2026

new changes made:

Uh oh!

prajeeta15 commented Apr 13, 2026

Uh oh!

coderabbitai Bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MonaaEid left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

prajeeta15 commented Apr 17, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MonaaEid left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MonaaEid left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

exploreriii left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

prajeeta15 commented May 1, 2026

Uh oh!

exploreriii commented May 1, 2026

Uh oh!

coderabbitai Bot commented May 4, 2026

Uh oh!

exploreriii left a comment

Choose a reason for hiding this comment

Uh oh!

prajeeta15 commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

exploreriii left a comment

coderabbitai Bot commented Apr 13, 2026 •

edited

Loading

prajeeta15 commented May 13, 2026 •

edited

Loading