1. Issue
We currently have a SCIP textbook bot that answers student questions about textbook pages only. However we would like to increase its functionality to questions about tutorial, recitations, PYPs and lecture content as well. We would like to propose a method to store these content and use RAG to respond.
2. Summary
We keep a map of every document we have (course, year, description), initial prompt to let the AI pick which documents are relevant, fetch those documents, and then make a 2nd prompt to chatgpt.
Scope
In: Past-year exams, lecture slides, tutorial sheets, recitation sheets. Web chat interface.
3. How It Works
The whole flow is two LLM calls with a simple backend fetch in between.
Step 1 — The AI picks the documents
When a student asks a question, our Elixir backend prepends the document map to the prompt. This map is a (JSON) list of every document we have: its title, course, year, type (exam/lecture/etc.), a one-line description, and an S3 link. The AI returns a JSON array of links to the documents it thinks are relevant.
Step 2 — We fetch the documents
The backend takes that JSON array of S3 links and fetches the actual document content. We extract the files and bundle it up for the next step.
Step 3 — The AI answers the question
We send a second prompt to GPT-5 containing the original student question plus the selected documents. The prompt tells the model to answer using only the provided materials and to cite its sources (document name, year, page). If the documents don't contain enough information, it says so instead of making something up.
4. What We Store
S3 Buckets
We will make one new S3 bucket/ route to the existing one:
- Content – the uploaded files (PDF, PPTX, DOCX) for download/reference.
JSON/ Postgres (document map)
One table that serves as the "index" the AI reads in Step 1. This gives the AI enough context to return a list of documents we need to fetch.
Formatted as PostgreSQL columns but may change to JSON stored in backend
| Column |
Type |
What it's for |
| id |
UUID |
Primary key. |
| title |
TEXT |
Human-readable name (e.g. "CS1101S Final 2023"). |
| description |
TEXT |
One-line summary the AI reads to decide relevance. |
| doc_type |
TEXT |
'exam', 'lecture', 'tutorial', or 'recitation'. |
| year |
INT |
Academic year. |
| week |
INT |
Week number (null for lectures/tutorials). |
| s3_original_url |
TEXT |
Link to the original file in S3. |
At query time, we load this table into the Step 1 prompt as a JSON array.
5. Tech Stack Needed
| Layer |
Choice |
Notes |
| Database |
PostgreSQL |
Stores the document map/ Store in backend JSON file |
| Storage |
AWS S3 (3 buckets) |
Originals, extracted text, and assets. Accessed via ex_aws. |
| Routing LLM |
GPT-5 |
Cheap and fast for reading the map and returning a JSON array of links. |
| Generation LLM |
GPT-5 |
Handles the actual answer generation with full document context. |
1. Issue
We currently have a SCIP textbook bot that answers student questions about textbook pages only. However we would like to increase its functionality to questions about tutorial, recitations, PYPs and lecture content as well. We would like to propose a method to store these content and use RAG to respond.
2. Summary
We keep a map of every document we have (course, year, description), initial prompt to let the AI pick which documents are relevant, fetch those documents, and then make a 2nd prompt to chatgpt.
Scope
In: Past-year exams, lecture slides, tutorial sheets, recitation sheets. Web chat interface.
3. How It Works
The whole flow is two LLM calls with a simple backend fetch in between.
Step 1 — The AI picks the documents
When a student asks a question, our Elixir backend prepends the document map to the prompt. This map is a (JSON) list of every document we have: its title, course, year, type (exam/lecture/etc.), a one-line description, and an S3 link. The AI returns a JSON array of links to the documents it thinks are relevant.
Step 2 — We fetch the documents
The backend takes that JSON array of S3 links and fetches the actual document content. We extract the files and bundle it up for the next step.
Step 3 — The AI answers the question
We send a second prompt to GPT-5 containing the original student question plus the selected documents. The prompt tells the model to answer using only the provided materials and to cite its sources (document name, year, page). If the documents don't contain enough information, it says so instead of making something up.
4. What We Store
S3 Buckets
We will make one new S3 bucket/ route to the existing one:
JSON/ Postgres (document map)
One table that serves as the "index" the AI reads in Step 1. This gives the AI enough context to return a list of documents we need to fetch.
Formatted as PostgreSQL columns but may change to JSON stored in backend
At query time, we load this table into the Step 1 prompt as a JSON array.
5. Tech Stack Needed