RAG
Echo
AI-Powered Search & Reasoning on Your Data — On Your Terms
Echo is a modular Retrieval-Augmented Generation (RAG) platform that lets users upload documents, query them using AI (OpenAI or local LLM), and see results—all behind a secure authentication and logging system. It's a foundation for AI-powered enterprise search, document Q&A, and internal knowledge bases, designed to be privacy-conscious and extendable.
Process
Source code | Local host setup:
https://github.com/saadaziz/echo-private | C:\Users\saad0\Documents\source\echo
https://github.com/saadaziz/echo (public repo) | C:\Users\saad0\Documents\source\echo-public
https://github.com/saadaziz/identity-backend
Get started - guide
DevOps
Incident Management
Knowledge management
- Roadmap (Business-facing, executive, “what we are building, when, and why.”)
- Engineering journal - Everything else
- Remove secrets
- Production checklist
- Centralize logging on aurorahours, integrate with identity, and allow local to log there as well
- Centralize remaining services, and move out of mono repo (echo)
- CRUD, one to many tags with a file (mvp, other aspects later)
- These tags can be used to optimize the query
- Add issue, when JWT_SECRET_KEY do not match in logging-backend, and identity-backend, cpanel, environment variables, logs stop writing
- Poor man's message-queue
- Poor man's API rate limiting
- Poor man's Service-to-Service Authentication and Authorization
- Support a poor man's design: Have multiple “subscribers” pull from your message-queue, process, ack/fail, dead-letter. (You don’t need Kafka for <1000 msg/sec or MVP scale.)
- Remaining poor man's items
Release 1.0 Goals
Core Goals
-
Single Sign-On & Identity: Centralized login system using OAuth2/OIDC, acting as an "Auth0/Okta for microservices" for your ecosystem.
-
Job/Document Ingestion: Upload files, queue them for processing, and parse their content.
-
RAG Query Interface: Users can ask questions; responses are generated from their own document corpus (via OpenAI or local LLM like Ollama).
-
Audit & Observability: Every major action is logged to a central log service for transparency, debugging, and compliance.
-
MVP Simplicity: Designed for clarity and fast learning, even at the cost of some performance.
Release 1.0 QA Runs
8/1/2025 - 11:45 AM | Notes
8/3/2025 - 1:30 AM | Prod ready checklist, Work log | 11:30 AM
8/3/2025 - 3:15 PM | git Labels doc
Issues
- 2025-08-02 - Secrets: app.secret_key, dev-secret
- 2025-08-02 - Issue: Investigate, is this zero trust, and is every request going through one login service?
- Lack of understanding: OIDC Flow: "Authorization Code Flow" (with PKCE optional for public clients)
- Not sure, but do I need PKCE?
- 2025-08-02 - Issue: [WARN] Failed to log to logging-backend: 401 {"error":"Invalid issuer"}
- 2025-08-02 - Issue: Decodes a JWT (without verifying the signature—not for production, but okay for local dev/test).
- 2025-08-02 - Issue: Logs show session_id=no-session-id.
- 2025-08-02 - Issue: so slow!
- 2025-08-03 - Secrets, safegaurds break OIDC/OAuth2 flow
- 2025-08-03 - Session is used for login/authorization state (with warnings about this in the comments). OK for MVP. In production, you’d want stronger session security, CSRF protection, and secure cookie settings. Session secret (FLASK_SECRET_KEY) is loaded from env.
- 2025-08-03 - Database SQLite used for storing authorization codes; table created if not present. Auth codes are one-time use—deleted after exchange (good!). No attempt at code expiry/cleanup, but not a dealbreaker for a demo.
- 2025-08-03 - OAuth Logic Checks for valid client_id, client_secret, and redirect_uri. Returns JWT with standard OpenID fields. JWT secret is loaded from env; uses HS256. Demo supports only one client, one user—fine for resume/MVP.
- 2025-08-03 - Dev/Test Endpoints /test-token and /ping are exposed unless DEV_MODE is false. Mitigated: You already have a DEV_MODE flag to pop dev/test endpoints. OK: Just make sure not to push a production-facing repo with DEV_MODE=True.
- 2025-08-03 - Logging All log calls go through unified_log (stderr + remote). Logger can leak sensitive info in DEBUG/WARN: You log full client_secret in /token endpoint in DEBUG mode: python Copy Edit unified_log("DEBUG", f"DEBUG: received client_secret: {client_secret!r}") unified_log("DEBUG", f"DEBUG: expected client_secret: {CLIENT_SECRETS.get(client_id)!r}") Recommendation: Comment this in prod. For demo/dev, leave it, but note it in your README as a security risk.
- 2025-08-03 HTML Form Action Your login.html form posts to /identity-backend/login, but your Flask code expects /login. Make sure these match or it won’t work.
- 2025-08-03 Using an old load of the One Login page (For example, you load http://localhost:5000, and walk away for an hour, and come back). If you try to login, you will se this error: Missing or invalid state or code. Resolve by logging in again by accessing url http://localhost:5000/
- 2025-08-03 Latency
- 2025-08-04 Latency improved by moving centralized logging calls out of logging_service.py running on http://localhost:5050, and into logging-backend running on cPanel domainracer.com's hosted site aurorahours.com
- 2025-08-04 - JWT signature issue? Verify that the secret key, default: "dev-client-secret" matches
- 2025-08-04 - Log end points /logs and /log have security disabled, great curl command examples are located here as well
- 2025-08-06 - Issue: Disk Usage Warning: The user “auroraho” (aurorahours.com) has near
Observability
- Extra Credit: You can generate and store a request_id (UUID) in the session or context and propagate it through the logs for E2E correlation.
- Token expired at 2025-08-03T06:32:34Z, issued at 2025-08-03T06:17:34Z, now=2025-08-03T06:32:36Z -
- 2025-08-04 03:01:24 ERROR API Gateway - Failed to log to central: HTTPConnectionPool(host='localhost', port=5020): Max retries exceeded with url: /log (Caused by ConnectTimeoutError(<urllib3.connection.HTTPCon - This implies that the logging_service.py has not been started
Debt
- To expedite dev, I only did local development inside of a single folder, but they need to be broken out into their own repository
- Determining the future of our current echo/api-gateway.py "microservice"
Architecture
MVP - A narration on the baby steps that need to come together
High Level Design (HLD) Diagrams
Message - Queue - High Level System Design Document
Flows
User Flow:
User’s browser is redirected to the Identity Service if not logged in.
After successful login, browser is redirected back to API Gateway with a code.
API Gateway exchanges the code for a JWT (id_token), stores it in session.
For each action, API Gateway checks the JWT for validity (signature, expiry, etc.).
If the token expires, user is prompted to log in again.
Service-to-Service Flow:
Worker service creates a signed JWT (with a shared secret).
Sends this JWT as a Bearer token in the Authorization header when calling Logging Service.
Logging Service validates the JWT (signature, issuer, audience, expiry).
If valid, processes the request; else, rejects it.
Sauce
The real magic and value in a RAG app lies in those pre-OpenAI steps:
-
How you chunk your documents (size, overlap, semantic meaning)
-
How you embed those chunks to capture their meaning accurately
-
How you do the similarity search to retrieve the most relevant chunks for the user’s question
-
How you construct the prompt to feed those chunks plus the user question into the model in a way that guides the LLM toward the best, most accurate answer
Getting those right makes your LLM calls precise, cost-efficient, and effective — otherwise, you might feed irrelevant or too much info, confusing the model or wasting tokens.
Microservices (Core)
Identity Service: Handles auth, JWT, user management.
Logging Service: Central log collection for all events/jobs.
Parser Service: Extracts text from files (PDF, DOCX, etc.), returns structured chunks.
Job Manager (new): Manages a job queue in SQLite. Each job = a row (status: queued, running, complete, failed).
Embedding Service: Picks up “parse complete” jobs, calls OpenAI (or other embedding), stores results in vector DB.
Query Service: Handles user queries, does vector search + OpenAI call for RAG.
Identity Service: Handles auth, JWT, user management.
Logging Service: Central log collection for all events/jobs.
Parser Service: Extracts text from files (PDF, DOCX, etc.), returns structured chunks.
Job Manager (new): Manages a job queue in SQLite. Each job = a row (
status: queued, running, complete, failed).Embedding Service: Picks up “parse complete” jobs, calls OpenAI (or other embedding), stores results in vector DB.
Query Service: Handles user queries, does vector search + OpenAI call for RAG.
Data Stores
SQLite per service (for MVP, switchable to Postgres later).
ChromaDB/FAISS for embeddings (optional, can store vectors in SQLite for MVP).
SQLite per service (for MVP, switchable to Postgres later).
ChromaDB/FAISS for embeddings (optional, can store vectors in SQLite for MVP).
Business problems
First Principles
- Build boxes, that take input and output, allowing testability of discrete portions of logic
- My goal, is to blow past Okta/Auth0 in the next wave of IAM. Similar to stripe's disruption, focusing on developer experience. We can do the same for identity and authorization.
- Building microservices is akin to knowing when the hammer will fit the nail, and choosing the right tool for the job. Purposefully built to scale in a mega-app architecture, were apps built for the Business-Development product line will be able to rapidly build real scalable systems, in a micro period of time it would take with a monolith, or many different monoliths, solving the same problem repeatedly.
- Build boxes, that take input and output, allowing testability of discrete portions of logic
- My goal, is to blow past Okta/Auth0 in the next wave of IAM. Similar to stripe's disruption, focusing on developer experience. We can do the same for identity and authorization.
- Building microservices is akin to knowing when the hammer will fit the nail, and choosing the right tool for the job. Purposefully built to scale in a mega-app architecture, were apps built for the Business-Development product line will be able to rapidly build real scalable systems, in a micro period of time it would take with a monolith, or many different monoliths, solving the same problem repeatedly.
Features
- F1 - MVP - OneLogin/OIDC login form | In QA, 80% or so complete | Upload files, and query with logging
- F1.1 - MVP - Audit log with MS-SQL. Never mind, I think this is a bad idea, with no benefit.
- F2 - MVP - Observability - Logging microservice | https://aurorahours.com/logging-backend provides centralized logging for all microservices to report outwards
- F3 - MVP - AuthZ - microservice, priority #3
- F4 - MVP - User/system communication, SaaS readiness, p4
- F5 - MVP - Job/Task Queue -> Async, scale, reliability, background processing
- F6 - MVP - API GW - Routing, security, traffic control, service mesh
- MVP - Data abstraction layer, and migrate to mySQL or postgreSQL if it's available on cPanel
- MVP - File tagging, query tagging
- MVP - Parsing & Chunking | https://saadazizai.blogspot.com/2025/08/the-sauce-chunking-embedding-similarity.html
Your current worker calls a parser service that returns plain text. To improve accuracy and enable RAG, you want to chunk the text into smaller pieces (e.g., 500 tokens max). Store each chunk and its embedding vector (from OpenAI or other embedding model) in a dedicated DB table (embeddings). This chunking + embedding is the bread-and-butter of RAG. - MVP - Embedding & Vector Storage
Generate embeddings for each chunk. Store embeddings as vectors in a DB (currently embeddings table exists). This allows quick similarity searches (k-NN) for relevant chunks on query. - MVP - Similarity Search for Query
When a user asks a question, embed the question. Search DB for chunks with most similar embeddings. Select top N chunks (maybe 3-5) as context. - MVP - Prompt Construction
Compose a prompt combining relevant chunks + user query. Send prompt to LLM backend (OpenAI or Ollama). - MVP - Query Result & Logging
Return answer to user. Log query, chosen chunks, prompt, and answer. - Non-MVP-Feature: previous conversational history in future queries
- Non-MVP-Feature: Add user session & history for conversation
- Non-MVP-Feature: Support file metadata (titles, dates, tags)
- Non-MVP-Feature: User accounts + access control
- Non-MVP-Feature: Rich file types (pdf, docx, etc)
- Non-MVP-Feature: Fine-tuning or prompt tuning
- Non-MVP-Feature: AuthZ service, and fine grained permissions
- Non-MVP-Feature: Optimize how auth(x) requests are handled
Auth Workflow
- User loads http://localhost:5000
- API Gateway checks:
- Is there a session with a valid JWT?
- If no: Redirects to /login.
- User lands on login page, or gets redirected to OneLogin (OIDC provider).
- User enters their username and password, what happens next?
- OneLogin/OIDC login form):
- Credentials are posted to the identity-backend.
- Identity-backend validates:
- If correct, issues a one-time “authorization code.”
- Redirects user (browser) back to API Gateway with ?code=....
- API Gateway exchanges the code for a JWT (token) by POSTing to the identity-backend /token endpoint.
- Receives JWT:
- Verifies claims, signature, and stores it in the session cookie.
- User is now authenticated; page loads.
End to End Workflow
- User uploads file via API.
- API creates job: sets status
queued, saves file, returns job ID. - Worker polls for
queuedjobs: picks one, updates status torunning, processes, then updates status tocompleteand saves output. - User polls API for job status/results.
- All steps log events (received upload, job picked, processing started, finished, error, etc.) to Logging service.
- Query processed data via a REST endpoint that interacts with OpenAI (or another LLM), feeding in the retrieved chunks/embeddings.
Query Workflow
1. User Query Input - the user types a question
2. Retrieve Relevant Context - The system searches your own data (documents, notes, etc.) for the most relevant pieces.
Usually this is done by:
- Splitting your documents into smaller chunks.
- Creating embeddings (vector representations) of these chunks.
- Searching those vectors using similarity search (e.g., cosine similarity) based on the user’s query embedding.
This retrieval step outputs a handful of text chunks most related to the query.
3. Build the LLM Prompt
The system combines the retrieved chunks into a single context string.
It then prepends this context to a prompt template, like:
You are a helpful assistant. Use the following documents to answer the question.
DOCUMENTS:
<retrieved chunks here>
QUESTION:
<user query>
ANSWER:
4. Send to LLM API
This full prompt string (context + question) is sent as the input to the OpenAI API (e.g., in a chat.completions.create() call).
The LLM generates an answer based on the context you provided rather than just its internal knowledge.
Query-Workflow summary - What the system passes to OpenAI:
A single prompt string that includes:
- The most relevant document chunks retrieved from your own data (via vector search)
- The user’s question at the end
Todo-projects
Outcome: Design a solid chunking and embedding workflow | Read more
Set up efficient similarity search & retrieval
Write prompt templates that get the best out of OpenAI or other LLMs
Curl commands
To test using cmd prompt, and curl (defaults to openai api):
curl -X POST -H "Content-Type: application/json" -d "{\"question\": \"List all todo items\"}" http://localhost:5000/query
To test with ollama:
curl -X POST -H "Content-Type: application/json" -d "{\"question\": \"List all todo items\", \"model\": \"ollama\"}" http://localhost:5000/query
Competitors
rlama was not inspiring, and left a lot to be desired. I had a previous version, but I like this new version better. It was able to outperform rlama in just a few hours of development effort.
- Sadly, that is about how long it took to work on the rlama install.
How to know more than the 95%
If you focus on the following topics, you will know more than 95% of devs who say “I know OAuth2”!
1. Web Authentication & Authorization (AuthN & AuthZ)
-
How modern web apps keep users logged in (sessions vs. tokens)
-
What JWTs are, and how to use them safely
-
OAuth2 and OpenID Connect flows (esp. Authorization Code flow)
-
Common vulnerabilities (token forgery, replay, open redirect, etc.)
2. Microservices & API Gateways
-
Why and how to break apps into services
-
How to route requests, authenticate users, and enforce security across services
-
Service-to-service authentication (using JWTs, mTLS, API keys, etc.)
-
Logging, monitoring, and observability in distributed systems
3. Secure Web App & API Development
-
Managing secrets and environment variables
-
Cookie/session security flags
-
CSRF and XSS prevention
-
Error handling and what not to expose
-
Rate limiting and brute-force protection
4. Flask & Python Web Stack Mastery
-
Flask application structure for prod
-
Configuring Flask securely (secrets, cookies, env, error handling)
-
Async vs. sync in Python web servers
-
Gunicorn/uWSGI and reverse proxy deployment basics
5. OAuth2/OIDC: Deep Dives
-
All OAuth2 grant types (Auth Code, Client Credentials, etc.)
-
Refresh tokens and session management
-
PKCE for public clients (mobile, SPA)
-
Role-based access control (RBAC) with JWT claims
6. Modern Logging & Observability
-
Centralized log collection (ELK stack, Loki, etc.)
-
Audit trails and why logs are so important in security
-
How to avoid logging secrets
7. Deployment & Cloud Considerations
-
Serving Flask apps in production (Gunicorn, nginx, HTTPS)
-
Running SQLite vs. Postgres/MySQL in prod
-
Dockerization and container best practices
How to Deep Dive Next
-
Google/YouTube: For each topic above, look for modern blog posts or video courses (there’s a ton, especially from Auth0, Okta, Microsoft, and Flask documentation).
-
Practice:
-
Build and break small example apps for each auth flow.
-
Try changing JWT settings and see what breaks.
-
Add (and attack) your own endpoints to learn about vulnerabilities!
-
-
Books:
-
“OAuth 2 in Action” (Manning)
-
“Web Security for Developers” (Packt, O’Reilly, etc.)
-
TL;DR – Your Learning Path
-
Master web authentication (sessions vs. tokens)
-
Get comfortable with OAuth2/OIDC and JWT
-
Level up on microservices, API security, and Flask deployment
-
Dive into modern web security: cookies, CSRF, logging, error handling
-
Explore deployment and scaling for real-world production
Comments
Post a Comment