Skip to content

fix: add contributor churn analysis script#112

Open
prajeeta15 wants to merge 12 commits into
hiero-hackers:mainfrom
prajeeta15:churn
Open

fix: add contributor churn analysis script#112
prajeeta15 wants to merge 12 commits into
hiero-hackers:mainfrom
prajeeta15:churn

Conversation

@prajeeta15

Copy link
Copy Markdown
Contributor

Title : Add contributor churn analysis and progression visualization

issue : #76

Description :

This PR introduces a contributor churn analysis pipeline to measure how users progress through contribution difficulty levels:

Good First Issue → Beginner → Intermediate → Advanced

The implementation focuses on quantifying drop-offs, conversion rates, and progression patterns, along with a clear, high-level presentation of the data.

Changes Made :

  • Added new script:
    src/hiero_analytics/run_contributor_churn_analysis.py
  • Implemented contributor progression tracking:
  • Identifies each contributor’s starting level and maximum level reached
  • Computes transition metrics across all levels
  • Calculates churn and conversion rates (e.g., % stopping at GFI, % reaching Advanced)
  • Added key metrics:
  • % of contributors who stop at Good First Issue
  • % progressing to Beginner, Intermediate, and Advanced
  • Overall “hit rate” for becoming Advanced contributors
  • Introduced two high-level visualizations:
  • Progression Funnel (GFI → Advanced conversion)
  • Retention Curve (contributors retained vs PR count)
  • Added a simple prediction framework:
  • Uses an 80/20 split (train/test) as proposed
  • Demonstrates how early contributor behavior (PR count, tenure) can predict progression to Advanced
  • Added fallback to mock data when GitHub API access is unavailable, ensuring the pipeline is always runnable

Impact :

  • Provides a clear, churn-focused view of contributor progression
  • Enables measurement of onboarding effectiveness (GFI → Advanced pipeline)
  • Lays groundwork for future ML-based contributor outcome prediction
  • Simplifies insights into where contributors drop off and where improvements are needed

@exploreriii exploreriii requested a review from Adityarya11 April 8, 2026 09:46
Adityarya11
Adityarya11 previously approved these changes Apr 8, 2026
Comment thread src/hiero_analytics/run_contributor_churn_analysis.py Outdated
Comment thread src/hiero_analytics/run_contributor_churn_analysis.py Outdated
@prajeeta15 prajeeta15 marked this pull request as draft April 9, 2026 05:18

@exploreriii exploreriii left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @prajeeta15
Thanks!
I wonder how much of the existing project utils we can use to abstract some of the logic in this script, so we have less to maintain over time as the repo scales
Also, can you push the output PNGs?

Comment thread src/hiero_analytics/run_contributor_churn_analysis.py Outdated
Comment thread src/hiero_analytics/run_contributor_churn_analysis.py Outdated
Comment thread src/hiero_analytics/run_contributor_churn_analysis.py Outdated
Comment thread src/hiero_analytics/run_contributor_churn_analysis.py Outdated
@prajeeta15

Copy link
Copy Markdown
Contributor Author

new changes made:

  1. I have updated the script
    ./src/hiero_analytics/run_contributor_churn_analysis.py to remove the mock data generation.
  2. the script will now exit with an error if it's run without a GITHUB_TOKEN, ensuring that any future PNGs generated will be based on 100% real data.
  3. to generate the "real data" versions of these charts, we'll just need to run the script once a GITHUB_TOKEN is provided in the .env file or environment.
  4. 80/20 Split -> maintained an 80/20 data split for the prediction model, which now focuses on predicting "Advanced" status using real-world characteristics like PR frequency and contributor tenure.
  5. replaced custom matplotlib code with established project prmitives.

@prajeeta15

Copy link
Copy Markdown
Contributor Author

@exploreriii i dont have dev access for GITHUB_TOKEN generation for this repo. could you create one and share it w me personally so I could check once ? or if you could create and check and lmk if the script throws any errors.

@prajeeta15 prajeeta15 marked this pull request as ready for review April 13, 2026 04:54
@coderabbitai

coderabbitai Bot commented Apr 13, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Runs contributor churn/progression analysis strictly from real GitHub data (requires GITHUB_TOKEN). Fetches merged PRs via GraphQL, maps issue labels to difficulty levels, computes per-author progression and transition metrics, filters Good First Issue starters, and emits CSV plus funnel and retention charts.

Changes

Contributor churn analysis + plotting

Layer / File(s) Summary
Data fetching & shapes
src/hiero_analytics/run_contributor_churn_analysis.py
Script now requires GITHUB_TOKEN, fetches merged-PR difficulty data via GraphQL with caching, converts results to a pandas DataFrame (prs_to_dataframe), and adds a per-PR level column computed by get_contributor_level(issue_labels).
Core analysis
src/hiero_analytics/analysis/contributor_churn.py
Implements compute_progression_stats(df) and compute_transition_metrics(df). Both handle empty input, rank difficulty levels (mapping "Unknown" → -1), deduplicate to one row per (author, pr_number) choosing highest-ranked entry, and compute per-author levels, first/last seen, pr_count, max_level, start_level, tenure_days, and aggregated transition counts. Removed earlier run_prediction_analysis.
Top-level orchestration
src/hiero_analytics/run_contributor_churn_analysis.py
Orchestrates flow: resolve repo dirs, enforce token, fetch and convert PRs, drop rows missing author/pr_merged_at, sort, compute progression stats, filter start_level == "Good First Issue", compute/print funnel and transition metrics, write contributor_progression.csv, and generate plots. Adds module entrypoint calling run().
Plotting (visual constraints)
src/hiero_analytics/plotting/lines.py, .../bars.py, .../scatter.py
All plotting functions now clamp y-axis lower bound to 0 via ax.set_ylim(bottom=0) (applied in plot_line/plot_multiline, plot_bar/plot_stacked_bar, and scatter/regression plotting).
Outputs / Files written
.../data/.../contributor_progression.csv, .../plots/contributor_churn_funnel.png, .../plots/contributor_retention.png
Writes GFI-starters progression CSV and generates funnel and retention PNGs from computed funnel counts and retention-by-min-PR thresholds.
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: add contributor churn analysis script' is somewhat vague and mixes a 'fix' prefix with an additive action. However, it does refer to the primary changeset focus: adding a new contributor churn analysis script. The title is partially related to the main change but uses imprecise framing.
Description check ✅ Passed The description thoroughly documents the contributor churn analysis pipeline, including objectives, changes made (new script, progression tracking, visualizations, prediction framework), and impact. It is clearly related to the changeset and provides meaningful context about the implementation.
Docstring Coverage ✅ Passed Docstring coverage is 87.50% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 317249d4-7bea-4225-8a3f-81a07a800a37

📥 Commits

Reviewing files that changed from the base of the PR and between a4e64e3 and 4a63842.

⛔ Files ignored due to path filters (2)
  • outputs/charts/repo/hiero-ledger_hiero-sdk-python/contributor_churn_funnel.png is excluded by !**/*.png
  • outputs/charts/repo/hiero-ledger_hiero-sdk-python/contributor_retention.png is excluded by !**/*.png
📒 Files selected for processing (1)
  • src/hiero_analytics/run_contributor_churn_analysis.py

Comment thread src/hiero_analytics/run_contributor_churn_analysis.py Outdated
Comment thread src/hiero_analytics/run_contributor_churn_analysis.py Outdated
Comment thread src/hiero_analytics/run_contributor_churn_analysis.py Outdated

@MonaaEid MonaaEid left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work overall, I have a concern...the summary counts are based on max_level thresholds, which works for a funnel, but there are no computed transition metrics across all levels.. is this the final approach you plan to go with?

Comment thread src/hiero_analytics/run_contributor_churn_analysis.py Outdated
@prajeeta15

Copy link
Copy Markdown
Contributor Author

Nice work overall, I have a concern...the summary counts are based on max_level thresholds, which works for a funnel, but there are no computed transition metrics across all levels.. is this the final approach you plan to go with?

adding transition metrics to capture movement between levels across contributor journeys. this will show "how" the contributors progress, not just how many reach each stage....

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: fede5a09-03a8-4428-9df8-55562f2ae243

📥 Commits

Reviewing files that changed from the base of the PR and between 4a63842 and 1adc053.

📒 Files selected for processing (2)
  • src/hiero_analytics/analysis/contributor_churn.py
  • src/hiero_analytics/run_contributor_churn_analysis.py

Comment thread src/hiero_analytics/analysis/contributor_churn.py Outdated

@MonaaEid MonaaEid left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, could you please do a commit with the updated PNGs? thanks

Comment thread src/hiero_analytics/run_contributor_churn_analysis.py Outdated
@prajeeta15 prajeeta15 requested a review from MonaaEid April 27, 2026 08:51
MonaaEid
MonaaEid previously approved these changes Apr 27, 2026

@MonaaEid MonaaEid left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. need a second opinion from @exploreriii @Adityarya11 @danielmarv

@exploreriii exploreriii marked this pull request as draft April 29, 2026 12:20
@prajeeta15 prajeeta15 marked this pull request as ready for review April 30, 2026 05:57

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: a7a7e511-9a44-4e19-8f6e-8276696cf7f6

📥 Commits

Reviewing files that changed from the base of the PR and between 5563a6c and bf78e40.

⛔ Files ignored due to path filters (2)
  • outputs/charts/repo/hiero-ledger_hiero-sdk-python/contributor_churn_funnel.png is excluded by !**/*.png
  • outputs/charts/repo/hiero-ledger_hiero-sdk-python/contributor_retention.png is excluded by !**/*.png
📒 Files selected for processing (2)
  • src/hiero_analytics/plotting/scatter.py
  • src/hiero_analytics/run_contributor_churn_analysis.py

Comment thread src/hiero_analytics/run_contributor_churn_analysis.py Outdated

@exploreriii exploreriii left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start, looking at this, I think it is fairly accurate what you have plotted!
For this reason we can merge as is, then make some tweaks.

Alternatively:

I would like to suggest strongly simplifying this.

Right now your code captures:
1- progression (e.g. beginner->advanced)
2- regression (e.g. advanced-->beginner)
3- prediction based on a hard coded rule

I think we can delete functionality doing 2 & 3
Advanced users with merged PRs have already demonstrated competence at that level, even if they change to beginner or etc, it still shows they completed an advanced PR and can work at that level
The prediction -- we can do this formally at a later date, by fitting and not hard coding a model

Recommended definition of progression that demonstrates the strength of the GFI pipeline:
A contributor progresses when they reach a higher level than any level they have previously reached. (this will include people that did not ever do a GFI)

or
A GFI starter progresses when, after their first Good First Issue PR, they reach a level with a higher rank than any level they previously reached. (possibly my preference?)

Core metrics to report
GFI starters
Progressed to Beginner+
Progressed to Intermediate+
Progressed to Advanced

You do log output, but the output is aggregated - there is no way to verify the names being recorded/etc are as expected - would suggest adding more logs or writing to a CSV before plotting from that.

What would you like to do?

Comment thread src/hiero_analytics/analysis/contributor_churn.py
Comment thread src/hiero_analytics/analysis/contributor_churn.py Outdated
Comment thread src/hiero_analytics/run_contributor_churn_analysis.py
Comment thread src/hiero_analytics/run_contributor_churn_analysis.py Outdated
Comment thread src/hiero_analytics/run_contributor_churn_analysis.py
Comment thread src/hiero_analytics/analysis/contributor_churn.py Outdated
@exploreriii exploreriii marked this pull request as draft May 1, 2026 12:29
@prajeeta15

Copy link
Copy Markdown
Contributor Author

Good start, looking at this, I think it is fairly accurate what you have plotted! For this reason we can merge as is, then make some tweaks.

Alternatively:

I would like to suggest strongly simplifying this.

Right now your code captures: 1- progression (e.g. beginner->advanced) 2- regression (e.g. advanced-->beginner) 3- prediction based on a hard coded rule

I think we can delete functionality doing 2 & 3 Advanced users with merged PRs have already demonstrated competence at that level, even if they change to beginner or etc, it still shows they completed an advanced PR and can work at that level The prediction -- we can do this formally at a later date, by fitting and not hard coding a model

Recommended definition of progression that demonstrates the strength of the GFI pipeline: A contributor progresses when they reach a higher level than any level they have previously reached. (this will include people that did not ever do a GFI)

or A GFI starter progresses when, after their first Good First Issue PR, they reach a level with a higher rank than any level they previously reached. (possibly my preference?)

Core metrics to report GFI starters Progressed to Beginner+ Progressed to Intermediate+ Progressed to Advanced

You do log output, but the output is aggregated - there is no way to verify the names being recorded/etc are as expected - would suggest adding more logs or writing to a CSV before plotting from that.

What would you like to do?

I agree with simplifying the logic to focus exclusively on progression. Tracking the "highest level reached" is a more robust indicator of contributor growth than handling regressions or using hard-coded predictions.

My recommendation for the next steps:

  • Adopt the GFI-centric definition: "A GFI starter progresses when, after their first Good First Issue PR, they reach a level with a higher rank than any level they previously reached." This directly measures the pipeline's effectiveness.
  • Implement CSV Logging: Before aggregation, I would like to write the raw contributor-level transitions to a CSV. This allows us to audit specific usernames and verify that the "Progressed to X" counts are accurate.
  • Track Max Rank: I will simplify the internal state to only store the max_rank_achieved for each contributor to determine if a new PR constitutes progression.

I would like to proceed with these logic updates and the addition of CSV output for better data transparency.

@exploreriii

Copy link
Copy Markdown
Contributor

Yap sounds great !

@prajeeta15 prajeeta15 marked this pull request as ready for review May 4, 2026 05:00
@prajeeta15 prajeeta15 requested a review from exploreriii May 4, 2026 05:00
@coderabbitai

coderabbitai Bot commented May 4, 2026

Copy link
Copy Markdown

CodeRabbit chat interactions are restricted to organization members for this repository. Ask an organization member to interact with CodeRabbit, or set chat.allow_non_org_members: true in your configuration.

@exploreriii exploreriii left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logged output does not make sense given what you plotted
--- Contributor Churn Analysis ---
GFI Starters: 132 (100.0%)
Progressed to Beginner+: 39 (29.5%)
Progressed to Intermediate+: 24 (18.2%)
Progressed to Advanced: 6 (4.5%)

--- Level Transition Metrics ---
from to count
Beginner Advanced 1
Beginner Intermediate 12
Good First Issue Advanced 1
Good First Issue Beginner 25
Good First Issue Intermediate 14
Intermediate Advanced 10

GFI to advanced = 1, why plot 4?

There's also a few unknowns
prajeeta15,"['Good First Issue', 'Unknown', 'Unknown', 'Intermediate', 'Intermediate']",2025-10-22 21:59:46+00:00,2025-12-29 10:26:56+00:00,5,Intermediate,Good First Issue,67

has exploreriii never completed a GFI?

@exploreriii exploreriii marked this pull request as draft May 5, 2026 10:17
@prajeeta15

prajeeta15 commented May 13, 2026

Copy link
Copy Markdown
Contributor Author

contributor_transitions.png: Shows the specific upward paths contributors took.
avg_tenure_by_level.png: Shows how many days contributors stayed active based on their highest level.
you have to run the analysis script with your GITHUB_TOKEN to generate them.
can you generate with the pngs with the token and let me know once if there is any error log or mismatch in the graph? @exploreriii so I can understand what's up and look into it

@prajeeta15 prajeeta15 marked this pull request as ready for review May 19, 2026 14:45

@exploreriii exploreriii left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @prajeeta15
You have added some new functionality and thank you for making the corrections.
I have some questions how we are getting some users going from beginner->advanced, etc and not gfi->advanced, so would want to learn more about the pipeline as it sees their first issue-linked, difficulty-labeled, merged PR, and then start level is defined as the first non unknown level. Or perhaps i'm not fully understanding that chart as it shows different data from contributor churn funnel
In this case, we can merge and then open a new issue to investigate

@exploreriii exploreriii left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please correct the DCO signing then this can be merged

subsequent issues to investigate either in this pr or a next pr:
Mounil2005, aceppaluni, emiliyank , exploreriii, Akshat8510 <-- i think these started from beginner/intermediate but actually started from unknown (before we had the issue requirement bots it seems)
looks like some issues are being labelled post closure, which will bypass the issue guard bots. maybe we should skip these labels in this case as was probably a tidy up for issue by difficulty

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants