Making Hudi Github Project More Agentic #18324

soumilshah1995 · 2026-03-15T20:22:56Z

soumilshah1995
Mar 15, 2026

Hi everyone — I had a small proposal that could help make our open-source GitHub repos much more AI-agent friendly and easier to maintain.

Idea: introduce a few lightweight repo conventions so AI agents understand the project better and can contribute safely.

A. AGENTS.md
A file that explains the repository structure, coding conventions, and important modules so AI agents understand how the project is organized before making changes.

B. ARCHITECTURE.md
A high-level overview of the system architecture, key components, and data flow. This helps agents reason about design decisions instead of only reading code.

C. Documentation Skill (AI-based)
Introduce an automated documentation workflow where an AI agent periodically scans the repo and generates or updates documentation for APIs, modules, and examples. This could help keep open-source documentation consistently up to date.

Goal:
Make the repo easier for both humans and AI agents to understand, improve onboarding, and enable more reliable automated contributions.

Happy to experiment with a small prototype if folks think this could be useful.

codope · 2026-03-16T05:56:43Z

codope
Mar 16, 2026
Collaborator

Yes, i think this would be very useful for the community. How do we plan to keep the AGENTS/ARCHITECTURE.md files updates with major changes? Also, if you could share some of your own experience of how you use agents with Hudi repo, it would be very helpful for the community.

2 replies

geserdugarov Mar 24, 2026

I suppose that for keeping AGENTS/ARCHITECTURE.md we have to except eventual consistency of them. For instance, someone on weekly basis will burn his tokens, and use the latest model with the biggest context windows to pass through all merged changes, and update those files.

From my experience, the goal of managing of AI-based data should be to keep it tight. It's really easy to allow to grow it exponentially. The more description you have the highest probability of conflicts in it. Removing is hard these days.

cshuo Mar 25, 2026
Collaborator

That's a great point. For keeping AGENTS.md / ARCHITECTURE.md up to date, maybe we can use a lightweight process:

Add a PR checklist item: “Does this change require updates to AGENTS/ARCHITECTURE docs?”
Optionally add a non-blocking CI reminder when large structural changes are detected (new modules/package moves/public API changes).
Do a quick doc consistency pass before each release.

rahil-c · 2026-04-02T04:00:59Z

rahil-c
Apr 2, 2026
Collaborator

@soumilshah1995 One idea that we were talking about in the hudi community sync that @vinothchandar brought up was the idea of revamping the current hudi-cli to embrace this new era of agents. Ideally giving claude, codex or some LLM context on the hudi project in terms of how to run some commands or even adding some new skill for operating a hudi table (such as running table services or tuning certain configs). Happy to sync at some point and talk thru some ideas with you as well!

2 replies

soumilshah1995 Apr 4, 2026
Author

That's a really nice idea

brucearctor Apr 5, 2026

sounds like agentskills.

An example:

https://temporal.io/blog/introducing-temporal-developer-skill

ex ->
https://github.com/temporalio/skill-temporal-developer/tree/main/references/java

vinothchandar · 2026-04-06T20:45:27Z

vinothchandar
Apr 6, 2026
Collaborator

I think indeed this is a very important step that the Hudi community needs to take. I feel there is a fundamental gap here that an AI-native experience can bridge for the project. For many years now, I think we've been having this divide where users want both advanced, powerful features and don't want to deal with the complexity of tuning and learning configs and things like that. Without AI, this unfortunately was not possible. For any other database, you get a system that can do many things powerfully; then you also need to spend the time learning the system.

As the data lake became a lake house, you need data engineers to also level up and become kind of data lake administrators. Unfortunately, this never happened. Engineers are sold completely closed solutions even today for the sake of ease of use.

But I think AI presents us an opportunity to bridge this gap once and for all for open source. I would like to reimagine the interface for the project and for any user to build a lake house to be a Claude Code or codex-like terminal experience.

I think with lots of will and a little bit of skill, I think we can go get there much faster than we think. The vision is you create, build, optimize, deploy -- everything from a terminal agentic CLI.

For starters

fully porting the hudi-cli to this model, with ability to connct to local fs, cloud storage locations, catalogs will help the entire community do more without breaking a sweat.
Build some version aware (foundational models don't get this) RAG over hudi-docs, can help optimize config tuning, along with an optimize skill.

I built a toy MCP server with some 27 tools, over the weekend, with auto discovery of tables (local fs for now, we can bring catalogs and cloud storage in there pretty easily). I think we can even make @hudi/table/<table_name> work as MCP resources.

[13:25:58] hudi-agent % claude
 ▐▛███▜▌   Claude Code v2.1.92
...
...
❯ what hudi tables do I have?

⏺ hudi - hudi_list_tables (MCP)
  ⎿  {
       "result": "Discovered 11 table(s):\n  hudi_demo_table                 COPY_ON_WRITE         /private/tmp/hudi-
     agent-demo/demo_table\n  customers                       COPY_ON_WRITE         /private/tmp/hudi-datalake/custom
     … +9 lines (ctrl+o to expand)

⏺ You have 11 Hudi tables across two storage locations:

  ┌─────────────────┬──────┬──────────────────────────────────────────┐
  │      Table      │ Type │                   Path                   │
  ├─────────────────┼──────┼──────────────────────────────────────────┤
  │ hudi_demo_table │ COW  │ /private/tmp/hudi-agent-demo/demo_table  │
  ├─────────────────┼──────┼──────────────────────────────────────────┤
  │ customers       │ COW  │ /private/tmp/hudi-datalake/customers     │
  ├─────────────────┼──────┼──────────────────────────────────────────┤
  │ shipments       │ COW  │ /private/tmp/hudi-datalake/shipments     │
  ├─────────────────┼──────┼──────────────────────────────────────────┤
  │ payments        │ COW  │ /private/tmp/hudi-datalake/payments      │
  ├─────────────────┼──────┼──────────────────────────────────────────┤
  │ products        │ COW  │ /private/tmp/hudi-datalake/products      │
  ├─────────────────┼──────┼──────────────────────────────────────────┤
  │ user_sessions   │ MOR  │ /private/tmp/hudi-datalake/user_sessions │
  ├─────────────────┼──────┼──────────────────────────────────────────┤
  │ audit_log       │ COW  │ /private/tmp/hudi-datalake/audit_log     │
  ├─────────────────┼──────┼──────────────────────────────────────────┤
  │ clickstream     │ MOR  │ /private/tmp/hudi-datalake/clickstream   │
  ├─────────────────┼──────┼──────────────────────────────────────────┤
  │ inventory       │ MOR  │ /private/tmp/hudi-datalake/inventory     │
  ├─────────────────┼──────┼──────────────────────────────────────────┤
  │ orders          │ COW  │ /private/tmp/hudi-datalake/orders        │
  ├─────────────────┼──────┼──────────────────────────────────────────┤
  │ reviews         │ MOR  │ /private/tmp/hudi-datalake/reviews       │
  └─────────────────┴──────┴──────────────────────────────────────────┘

  6 are Copy-on-Write and 4 are Merge-on-Read tables. Most live under /private/tmp/hudi-datalake/. Want to inspect any of
  them?

❯ Analyze layout of table clickstream

⏺ hudi - hudi_describe_table (MCP)(table_name: "clickstream")
  ⎿  {
       "result": "[\n  {\n    \"Property\": \"basePath\",\n    \"Value\": \"/private/tmp/hudi-datalake/clickstream\"\
     n  },\n  {\n    \"Property\": \"metaPath\",\n    \"Value\": \"/private/tmp/hudi-datalake/clickstream/.hoodie\"\n
     … +25 lines (ctrl+o to expand)

⏺ hudi - hudi_fetch_schema (MCP)(table_name: "clickstream")
  ⎿  {
       "result": "{\n  \"type\" : \"record\",\n  \"name\" : \"clickstream_record\",\n  \"namespace\" : \"hoodie.click
     stream\",\n  \"fields\" : [ {\n    \"name\" : \"_hoodie_commit_time\",\n    \"type\" : [ \"null\", \"string\" ],
     … +10 lines (ctrl+o to expand)

⏺ hudi - hudi_timeline_active (MCP)(table_name: "clickstream")
  ⎿  {
       "result": "[\n  {\n    \"Instant\": \"20260404203259123\",\n    \"Action\": \"deltacommit\",\n    \"State\": \
     "COMPLETED\"\n  }\n]"
     }

⏺ hudi - hudi_metadata_partitions (MCP)(table_name: "clickstream")
  ⎿  {
       "result": "[\n  \"dt=2024-01-01\",\n  \"dt=2024-01-02\",\n  \"dt=2024-01-03\"\n]"
     }

⏺ hudi - hudi_metadata_stats (MCP)(table_name: "clickstream")
  ⎿  {
       "result": "[]"
     }

⏺ hudi - hudi_commits_show (MCP)(table_name: "clickstream")
  ⎿  {
       "result": "[\n  {\n    \"CommitTime\": \"20260404203259123\",\n    \"Action\": \"deltacommit\",\n    \"TotalBy
     tesWritten\": 1307319,\n    \"TotalFilesInsert\": 3,\n    \"TotalFilesUpdated\": 0,\n    \"Partitions\": 3,\n
     … +2 lines (ctrl+o to expand)

⏺ hudi - hudi_metadata_files (MCP)(table_name: "clickstream")
  ⎿  {
       "result": "[]"
     }

⏺ hudi - hudi_compactions_show (MCP)(table_name: "clickstream")
  ⎿  {
       "result": "[]"
     }

⏺ Here's the full layout analysis for clickstream:

  ---
  Table Identity

  ┌─────────────────┬────────────────────────────────────────┐
  │    Property     │                 Value                  │
  ├─────────────────┼────────────────────────────────────────┤
  │ Type            │ Merge-on-Read (MOR)                    │
  ├─────────────────┼────────────────────────────────────────┤
  │ Base path       │ /private/tmp/hudi-datalake/clickstream │
  ├─────────────────┼────────────────────────────────────────┤
  │ Record key      │ id                                     │
  ├─────────────────┼────────────────────────────────────────┤
  │ Partition field │ dt (hive-style)                        │
  ├─────────────────┼────────────────────────────────────────┤
  │ Ordering field  │ ts                                     │
  ├─────────────────┼────────────────────────────────────────┤
  │ Merge mode      │ EVENT_TIME_ORDERING                    │
  ├─────────────────┼────────────────────────────────────────┤
  │ Key generator   │ SIMPLE                                 │
  ├─────────────────┼────────────────────────────────────────┤
  │ Table version   │ 9                                      │
  ├─────────────────┼────────────────────────────────────────┤
  │ Timeline layout │ v2                                     │
  └─────────────────┴────────────────────────────────────────┘

  Schema (6 user fields + 5 meta fields)

  ┌─────────┬─────────────────────────────────────┐
  │  Field  │                Type                 │
  ├─────────┼─────────────────────────────────────┤
  │ id      │ int (nullable)                      │
  ├─────────┼─────────────────────────────────────┤
  │ user_id │ int (nullable)                      │
  ├─────────┼─────────────────────────────────────┤
  │ page    │ string (nullable)                   │
  ├─────────┼─────────────────────────────────────┤
  │ action  │ string (nullable)                   │
  ├─────────┼─────────────────────────────────────┤
  │ ts      │ long (nullable) — ordering field    │
  ├─────────┼─────────────────────────────────────┤
  │ dt      │ string (nullable) — partition field │
  └─────────┴─────────────────────────────────────┘

  Plus the standard _hoodie_commit_time, _hoodie_commit_seqno, _hoodie_record_key, _hoodie_partition_path,
  _hoodie_file_name meta fields.

  Partitions (3)

  - dt=2024-01-01
  - dt=2024-01-02
  - dt=2024-01-03

  Timeline & Commits

  Single completed deltacommit at 20260404203259123:
  - 8 records inserted across 3 partitions
  - 3 log files written (one per partition — typical for MOR inserts)
  - ~1.3 MB total bytes written
  - 0 updates, 0 write errors

  Metadata Table

  Enabled indexes: files, column_stats, partition_stats

  Compactions

  None pending. Since there's only one deltacommit with log files and no base (parquet) files yet, the table is entirely
  in delta log form. A compaction would convert the logs into columnar base files — something to consider if you plan to
  run read-optimized queries.

  ---
  Observations

  1. Very young table — single commit, 8 records. Essentially a fresh table with seed data.
  2. All data in log files — no base files exist yet. Snapshot queries will read logs directly; read-optimized queries
  would return nothing until a compaction runs.
  3. Even distribution — ~2-3 records per partition across the 3 date partitions.
  4. Good metadata setup — column_stats and partition_stats are enabled, which will help with data skipping as the table
  grows.

  Would you like me to dig deeper into any specific aspect, or look at the file-level details within a partition?

✻ Cooked for 33s

1 reply

vinothchandar Apr 6, 2026
Collaborator

This can be get way sharper if we leaned in and wrote skills

vinothchandar · 2026-04-06T20:50:58Z

vinothchandar
Apr 6, 2026
Collaborator

@soumilshah1995 on the original AGENTS.md/ARCHITECTURE.md/CLAUDE.md +1 . I think @yihua has a PR for CLAUDE.md file. we should consider that as well. but happy to take contributions around this. get it landed and keep chipping away

1 reply

soumilshah1995 Apr 8, 2026
Author

Hi @vinothchandar — your ideas are fantastic. Having the Hudi CLI controlled by Claude sounds like a great starting point.
Also, could you share if there’s an existing CLAUDE.md in any active PR? If not, I’d be happy to help add one. We could initialize it using /init, let Claude scan the repository, and generate a baseline CLAUDE.md. From there, we can gradually evolve it by adding rules, coding standards, and best practices.

This could help make the Hudi repo more agent-friendly, and for anyone using modern AI tools, it would also encourage better documentation and consistency across contributions.

Making Hudi Github Project More Agentic #18324

Uh oh!

soumilshah1995 Mar 15, 2026

Replies: 4 comments · 6 replies

Uh oh!

codope Mar 16, 2026 Collaborator

Uh oh!

Uh oh!

geserdugarov Mar 24, 2026

Uh oh!

cshuo Mar 25, 2026 Collaborator

Uh oh!

rahil-c Apr 2, 2026 Collaborator

Uh oh!

soumilshah1995 Apr 4, 2026 Author

Uh oh!

brucearctor Apr 5, 2026

Uh oh!

vinothchandar Apr 6, 2026 Collaborator

Uh oh!

vinothchandar Apr 6, 2026 Collaborator

Uh oh!

vinothchandar Apr 6, 2026 Collaborator

Uh oh!

soumilshah1995 Apr 8, 2026 Author

soumilshah1995
Mar 15, 2026

Replies: 4 comments 6 replies

codope
Mar 16, 2026
Collaborator

cshuo Mar 25, 2026
Collaborator

rahil-c
Apr 2, 2026
Collaborator

soumilshah1995 Apr 4, 2026
Author

vinothchandar
Apr 6, 2026
Collaborator

vinothchandar Apr 6, 2026
Collaborator

vinothchandar
Apr 6, 2026
Collaborator

soumilshah1995 Apr 8, 2026
Author