Analyze your ChatGPT history locally with industrial-grade LLM metadata extraction and generate a feature rich, interactive dashboard. The pipeline leaves you with a rich metadata layer for every conversation, ready for your own experiments or search engines.
And more!
Note: This implementation currently supports OpenRouter only for metadata extraction.
Note: The pipeline handles all your history since launch (May 2023), automatically generating multi-year heatmaps and timelines.
- Clone the Repository:
git clone https://github.com/otonashi-labs/chatgpt-wrapped.git cd chatgpt-wrapped - Export Data: Go to ChatGPT Settings → Data Controls → Export Data. You'll receive an email with a zip file. Locate
conversations.jsoninside and place it intodata/conversations/. - Configure AI: Copy
env.exampleto.envand add your OpenRouter API Key. - Install Dependencies:
# Install Python tools pip install -r unroller/requirements.txt -r metadater/requirements.txt # Install Dashboard generator (requires Bun) cd wrapped && bun install && cd ..
- Run Pipeline:
(Concurrency of 10 processes towards LLM calls ~100 chats in 1-2 minutes. Feel free to increase if your rate limits allow.)
python run.py --concurrency 10
- View Dashboard:
- Open
wrapped/wrapped.htmldirectly in your browser. - Or run a local dev server for live viewing:
Then open
cd wrapped && bun run dev
http://localhost:9876.
- Open
An obfuscated example dashboard is included in the repository. Note that GitHub does not render HTML files directly; for the full interactive experience, it is recommended to view it locally.
🤖 AI Coding Agent? Check out AI_README.md for a technical guide on how to navigate and customize this repository.
So it's always been a struggle to find something in ChatGPT chats.
Imagine you need a formula from research you have done months ago. Or banger GTM idea you have written to chat at 2 am random Thursday. You know that it is there, but oh man it takes time and grind to find it. Especially if you have thousands of chats. That is why an idea of building a good search over the chats has been around with me; you know - proper SOTA agentic search.
For a good search you need to build the metadata layer over chats. I've decided to do it two fold:
- deterministic -
unroll/module - LLM infused -
metadater/prompt.md& Gemini 3 Flash
Once the metadata has been obtained - I've realized that it's a "Wrapped season" going right now. So here it goes - nice side quest.
Maybe in some near future - full agentic search thingy will be released here as well. I am currently tinkering on it. In the direction of a proper "Second Brain".
If you’re into personal knowledge tooling / retrieval / evaluation / agentic search: I’d love issues, PRs, and wild ideas.
ALSO: I will be very grateful for the feedback on metadata and indexing. How to make it better? How to make the important conversations to "surface" even more?
The pipeline is designed to handle thousands of conversations with high precision.
Splits your monolithic conversations.json (often hundreds of MBs) into manageable, monthly-organized files. It also performs initial enrichment:
- Command:
python unroller/unroll.py data/conversations/conversations.json - Deterministic Metadata:
{ "total_messages": 12, "messages_by_role": {"user": 5, "assistant": 5, "system": 2}, "total_tokens": 2500, // Estimated via char count "user_tokens": 800, "assistant_tokens": 1700, "models_used": ["gpt-4o"], "primary_model": "gpt-4o", "duration_seconds": 120.5, "duration_human": "2m 0s", "word_count": 450, "image_count": 0, "audio_count": 0, "is_voice_conversation": false }
The "brain" of the project. It uses Gemini 3 Flash to analyze every conversation against a custom 10-domain taxonomy defined in metadater/config.py. Each conversation is enriched with metadata according to the instructions in metadater/prompt.md:
- Classification: Domain, sub-domain, conversation type, and request types.
- Context: User intent, specific keywords, and entity extraction.
- Quality Metrics: 8+ numerical scores measuring engagement and response quality.
- Dynamics: Tone, mood, and flow patterns.
For a full explanation of the extraction logic and available fields, see metadater/prompt.md and the taxonomy in metadater/config.py.
Note: Improving this metadata layer is a hot area for future work. I am actively looking for ways to make indexing better and to help important conversations surface more effectively. Feedback and wild ideas are very welcome!
Example LLM Metadata (llm_meta):
{
"domain": "problem_solving",
"sub_domain": "debugging",
"conversation_type": "troubleshooting",
"user_intent": "Fixing a race condition in a Python script using asyncio and threading locks",
"request_types": ["task", "explanation"],
"keywords": ["race condition", "threading", "lock", "asyncio", "deadlock"],
"entities_people": [],
"entities_companies": ["OpenAI", "GitHub"],
"entities_products": ["Visual Studio Code"],
"entities_places": [],
"technologies": ["Python", "httpx", "asyncio"],
"concepts": ["Concurrency Control", "Mutual Exclusion"],
"inferred_future_relevance_score": 85,
"urgency_score": 40,
"complexity_score": 70,
"information_density": 90,
"depth_of_engagement": 75,
"user_satisfaction_inferred": 95,
"user_request_quality_inferred": 80,
"ai_response_quality_score": 90,
"serendipity_vs_general_public": 75,
"serendipity_vs_power_users": 65,
"conversation_flow": "iterative",
"user_mood": "focused",
"conversation_tone": "technical",
"one_line_summary": "Debugging Python asyncio race condition with threading locks",
"outcome_type": "task_completed",
"information_direction": "collaborative",
"topic_tags": ["python_concurrency", "debugging_session"]
}Aggregates all metadata into a unified statistics engine and produces a feature-rich dashboard. There are two ways to use it:
- Static Mode: Generates a standalone, interactive
wrapped.htmlfile that you can open anywhere.python wrapped/aggregate.py cd wrapped && bun run generate
- Live Mode: Runs a local development server for a more dynamic experience.
cd wrapped && bun run dev
- Gemini 3 Flash: Chosen for its massive 1M token context window and low cost.
- Concurrency: Optimized for speed with parallel async requests. A concurrency of 10 can process approximately 100 conversations every 1-2 minutes.
- Cost Estimate: Processing ~1,500 conversations typically costs between $5-7 USD via OpenRouter.
- Local Processing: Your raw data never leaves your machine except for the metadata extraction request sent to the LLM.
- No Tracking: This tool has no analytics or external reporting.
- Protected: The
.gitignoreis pre-configured to ensure no JSON exports or.envfiles are ever committed.
- Gemini 3 Flash seems to treat GPT-4o with slight arrogance. This is seen by weirdly lower costs for 4o. Or it's just the LLM models progress.
- More to come...
MIT


