Institutional Chatbot (RAG)

Overview

Built IAN, an institutional assistant for INSA Toulouse that answers student questions by combining semantic search (FAISS + MiniLM) with a large generator (Mixtral‑8x7B) using a RAG pipeline. The system prioritizes up‑to‑date factual answers and clear failure messages.

Data Collection

We implemented a Java scraper targeting INSA public pages and Moodle:

6,215 pages harvested in ~20 minutes
≈4.45M words extracted, 55% of documents with identifiable update date
Limitations: scanned PDFs, images and complex tables required manual handling

Experimental Roadmap

We compared three approaches:

From‑scratch small language models (PyTorch) — computationally expensive, poor results without large infrastructure
Fine‑tuning GPT‑2 on local data — modest gains; human evaluation: ~3.8% fully satisfactory responses
RAG (final solution) — retrieval + generation gave best practical results and supports live updates

RAG Pipeline

Pipeline summary:

Chunk documents into 1000‑char segments
Embed with MiniLM, index vectors in FAISS
At query time, retrieve top segments and pass them to Mixtral‑8x7B alongside the prompt

Benefits: up‑to‑date answers without heavy re‑training.

Evaluation & Tools

Prototype interface (Streamlit) named IAN. Evaluation on a small question set: 7/8 relevant answers. Additional tooling:

Cosine similarity diagnostics for chunk usefulness
Cross‑encoder re‑rankers for precision (MiniLM)
Binary classifier to detect non‑answerable queries

Results & Impact

RAG significantly outperformed local training and fine‑tuning for the use case. The system provides reliable, updateable guidance for students and administrative queries.

Next Steps

Index sub‑specializations in FAISS for faster, more accurate retrieval
Integrate structured sources (schedules, databases) and multimodal inputs
Deploy monitoring for drift and automated re-indexing