Skip to content

otonashi-labs/chatgpt-wrapped

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ChatGPT Wrapped: DS grade

Analyze your ChatGPT history locally with industrial-grade LLM metadata extraction and generate a feature rich, interactive dashboard. The pipeline leaves you with a rich metadata layer for every conversation, ready for your own experiments or search engines.

License Python Frontend

Overview

Distributions

Roko

And more!

🚀 Quick Start

Note: This implementation currently supports OpenRouter only for metadata extraction.

Note: The pipeline handles all your history since launch (May 2023), automatically generating multi-year heatmaps and timelines.

  1. Clone the Repository:
    git clone https://github.com/otonashi-labs/chatgpt-wrapped.git
    cd chatgpt-wrapped
  2. Export Data: Go to ChatGPT Settings → Data Controls → Export Data. You'll receive an email with a zip file. Locate conversations.json inside and place it into data/conversations/.
  3. Configure AI: Copy env.example to .env and add your OpenRouter API Key.
  4. Install Dependencies:
    # Install Python tools
    pip install -r unroller/requirements.txt -r metadater/requirements.txt
    
    # Install Dashboard generator (requires Bun)
    cd wrapped && bun install && cd ..
  5. Run Pipeline:
    python run.py --concurrency 10
    (Concurrency of 10 processes towards LLM calls ~100 chats in 1-2 minutes. Feel free to increase if your rate limits allow.)
  6. View Dashboard:
    • Open wrapped/wrapped.html directly in your browser.
    • Or run a local dev server for live viewing:
      cd wrapped && bun run dev
      Then open http://localhost:9876.

An obfuscated example dashboard is included in the repository. Note that GitHub does not render HTML files directly; for the full interactive experience, it is recommended to view it locally.


🤖 AI Coding Agent? Check out AI_README.md for a technical guide on how to navigate and customize this repository.


🫦 Motivation (hooman written)

So it's always been a struggle to find something in ChatGPT chats.

Imagine you need a formula from research you have done months ago. Or banger GTM idea you have written to chat at 2 am random Thursday. You know that it is there, but oh man it takes time and grind to find it. Especially if you have thousands of chats. That is why an idea of building a good search over the chats has been around with me; you know - proper SOTA agentic search.

For a good search you need to build the metadata layer over chats. I've decided to do it two fold:

  1. deterministic -unroll/ module
  2. LLM infused - metadater/prompt.md & Gemini 3 Flash

Once the metadata has been obtained - I've realized that it's a "Wrapped season" going right now. So here it goes - nice side quest.

Maybe in some near future - full agentic search thingy will be released here as well. I am currently tinkering on it. In the direction of a proper "Second Brain".

If you’re into personal knowledge tooling / retrieval / evaluation / agentic search: I’d love issues, PRs, and wild ideas.

ALSO: I will be very grateful for the feedback on metadata and indexing. How to make it better? How to make the important conversations to "surface" even more?


🏗️ What's under the hood

The pipeline is designed to handle thousands of conversations with high precision.

1. Unroll (unroller/)

Splits your monolithic conversations.json (often hundreds of MBs) into manageable, monthly-organized files. It also performs initial enrichment:

  • Command: python unroller/unroll.py data/conversations/conversations.json
  • Deterministic Metadata:
    {
      "total_messages": 12,
      "messages_by_role": {"user": 5, "assistant": 5, "system": 2},
      "total_tokens": 2500, // Estimated via char count
      "user_tokens": 800,
      "assistant_tokens": 1700,
      "models_used": ["gpt-4o"],
      "primary_model": "gpt-4o",
      "duration_seconds": 120.5,
      "duration_human": "2m 0s",
      "word_count": 450,
      "image_count": 0,
      "audio_count": 0,
      "is_voice_conversation": false
    }

2. Infuse Metadata (metadater/)

The "brain" of the project. It uses Gemini 3 Flash to analyze every conversation against a custom 10-domain taxonomy defined in metadater/config.py. Each conversation is enriched with metadata according to the instructions in metadater/prompt.md:

  • Classification: Domain, sub-domain, conversation type, and request types.
  • Context: User intent, specific keywords, and entity extraction.
  • Quality Metrics: 8+ numerical scores measuring engagement and response quality.
  • Dynamics: Tone, mood, and flow patterns.

For a full explanation of the extraction logic and available fields, see metadater/prompt.md and the taxonomy in metadater/config.py.

Note: Improving this metadata layer is a hot area for future work. I am actively looking for ways to make indexing better and to help important conversations surface more effectively. Feedback and wild ideas are very welcome!

Example LLM Metadata (llm_meta):

{
  "domain": "problem_solving",
  "sub_domain": "debugging",
  "conversation_type": "troubleshooting",
  "user_intent": "Fixing a race condition in a Python script using asyncio and threading locks",
  "request_types": ["task", "explanation"],
  "keywords": ["race condition", "threading", "lock", "asyncio", "deadlock"],
  "entities_people": [],
  "entities_companies": ["OpenAI", "GitHub"],
  "entities_products": ["Visual Studio Code"],
  "entities_places": [],
  "technologies": ["Python", "httpx", "asyncio"],
  "concepts": ["Concurrency Control", "Mutual Exclusion"],
  "inferred_future_relevance_score": 85,
  "urgency_score": 40,
  "complexity_score": 70,
  "information_density": 90,
  "depth_of_engagement": 75,
  "user_satisfaction_inferred": 95,
  "user_request_quality_inferred": 80,
  "ai_response_quality_score": 90,
  "serendipity_vs_general_public": 75,
  "serendipity_vs_power_users": 65,
  "conversation_flow": "iterative",
  "user_mood": "focused",
  "conversation_tone": "technical",
  "one_line_summary": "Debugging Python asyncio race condition with threading locks",
  "outcome_type": "task_completed",
  "information_direction": "collaborative",
  "topic_tags": ["python_concurrency", "debugging_session"]
}

3. Generate Wrapped (wrapped/)

Aggregates all metadata into a unified statistics engine and produces a feature-rich dashboard. There are two ways to use it:

  • Static Mode: Generates a standalone, interactive wrapped.html file that you can open anywhere.
    python wrapped/aggregate.py
    cd wrapped && bun run generate
  • Live Mode: Runs a local development server for a more dynamic experience.
    cd wrapped && bun run dev

Performance & Cost

  • Gemini 3 Flash: Chosen for its massive 1M token context window and low cost.
  • Concurrency: Optimized for speed with parallel async requests. A concurrency of 10 can process approximately 100 conversations every 1-2 minutes.
  • Cost Estimate: Processing ~1,500 conversations typically costs between $5-7 USD via OpenRouter.

🛡️ Privacy First

  • Local Processing: Your raw data never leaves your machine except for the metadata extraction request sent to the LLM.
  • No Tracking: This tool has no analytics or external reporting.
  • Protected: The .gitignore is pre-configured to ensure no JSON exports or .env files are ever committed.

📝 Notes & Discussion

  • Gemini 3 Flash seems to treat GPT-4o with slight arrogance. This is seen by weirdly lower costs for 4o. Or it's just the LLM models progress.
  • More to come...

📄 License

MIT

About

ChatGPT Wrapped creation: Data Science grade

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors