RAG

Echo
AI-Powered Search & Reasoning on Your Data — On Your Terms
Echo is a modular Retrieval-Augmented Generation (RAG) platform that lets users upload documents, query them using AI (OpenAI or local LLM), and see results—all behind a secure authentication and logging system. It's a foundation for AI-powered enterprise search, document Q&A, and internal knowledge bases, designed to be privacy-conscious and extendable.

Process

Source code | Local host setup:
https://github.com/saadaziz/echo-private | C:\Users\saad0\Documents\source\echo
https://github.com/saadaziz/echo (public repo) | C:\Users\saad0\Documents\source\echo-public
https://github.com/saadaziz/identity-backend
Get started - guide

DevOps

Uptime robot - Monitors

Incident Management

Incident tracker - Cybersecurity Dashboard - Contains Response Plans, and Post-Mortems.

Knowledge management

Roadmap (Business-facing, executive, “what we are building, when, and why.”)
Engineering journal - Everything else

Todo:

Remove secrets
Production checklist
Centralize logging on aurorahours, integrate with identity, and allow local to log there as well
Centralize remaining services, and move out of mono repo (echo)
CRUD, one to many tags with a file (mvp, other aspects later)

These tags can be used to optimize the query

Add issue, when JWT_SECRET_KEY do not match in logging-backend, and identity-backend, cpanel, environment variables, logs stop writing
Poor man's message-queue

Poor dead man's queue - Use cases

Poor man's API rate limiting
Poor man's Service-to-Service Authentication and Authorization
Support a poor man's design: Have multiple “subscribers” pull from your message-queue, process, ack/fail, dead-letter. (You don’t need Kafka for <1000 msg/sec or MVP scale.)
Remaining poor man's items

# todo - production ready checklist:

# - Dev-Only Endpoints Exposed in Production

# - Remove print/log statements that are not necessary any longer after debugging

# - Swap print/log for unified_log

# - Make sure to toggle DEV_MODE to false

# - See todo: Save params for login step (use Flask session, not secure for prod)

Release 1.0 Goals

Core Goals

Single Sign-On & Identity: Centralized login system using OAuth2/OIDC, acting as an "Auth0/Okta for microservices" for your ecosystem.
Job/Document Ingestion: Upload files, queue them for processing, and parse their content.
RAG Query Interface: Users can ask questions; responses are generated from their own document corpus (via OpenAI or local LLM like Ollama).
Audit & Observability: Every major action is logged to a central log service for transparency, debugging, and compliance.
MVP Simplicity: Designed for clarity and fast learning, even at the cost of some performance.

Release 1.0 QA Runs

8/1/2025 - 11:45 AM | Notes
8/3/2025 - 1:30 AM | Prod ready checklist, Work log | 11:30 AM
8/3/2025 - 3:15 PM | git Labels doc

Issues

2025-08-02 - Secrets: app.secret_key, dev-secret
2025-08-02 - Issue: Investigate, is this zero trust, and is every request going through one login service?

Lack of understanding: OIDC Flow: "Authorization Code Flow" (with PKCE optional for public clients)
Not sure, but do I need PKCE?

2025-08-02 - Issue: [WARN] Failed to log to logging-backend: 401 {"error":"Invalid issuer"}
2025-08-02 - Issue: Decodes a JWT (without verifying the signature—not for production, but okay for local dev/test).
2025-08-02 - Issue: Logs show session_id=no-session-id.
2025-08-02 - Issue: so slow!
2025-08-03 - Secrets, safegaurds break OIDC/OAuth2 flow
2025-08-03 - Session is used for login/authorization state (with warnings about this in the comments). OK for MVP. In production, you’d want stronger session security, CSRF protection, and secure cookie settings. Session secret (FLASK_SECRET_KEY) is loaded from env.
2025-08-03 - Database SQLite used for storing authorization codes; table created if not present. Auth codes are one-time use—deleted after exchange (good!). No attempt at code expiry/cleanup, but not a dealbreaker for a demo.
2025-08-03 - OAuth Logic Checks for valid client_id, client_secret, and redirect_uri. Returns JWT with standard OpenID fields. JWT secret is loaded from env; uses HS256. Demo supports only one client, one user—fine for resume/MVP.
2025-08-03 - Dev/Test Endpoints /test-token and /ping are exposed unless DEV_MODE is false. Mitigated: You already have a DEV_MODE flag to pop dev/test endpoints. OK: Just make sure not to push a production-facing repo with DEV_MODE=True.
2025-08-03 - Logging All log calls go through unified_log (stderr + remote). Logger can leak sensitive info in DEBUG/WARN: You log full client_secret in /token endpoint in DEBUG mode: python Copy Edit unified_log("DEBUG", f"DEBUG: received client_secret: {client_secret!r}") unified_log("DEBUG", f"DEBUG: expected client_secret: {CLIENT_SECRETS.get(client_id)!r}") Recommendation: Comment this in prod. For demo/dev, leave it, but note it in your README as a security risk.
2025-08-03 HTML Form Action Your login.html form posts to /identity-backend/login, but your Flask code expects /login. Make sure these match or it won’t work.
2025-08-03 Using an old load of the One Login page (For example, you load http://localhost:5000, and walk away for an hour, and come back). If you try to login, you will se this error: Missing or invalid state or code. Resolve by logging in again by accessing url http://localhost:5000/
2025-08-03 Latency

2025-08-04 Latency improved by moving centralized logging calls out of logging_service.py running on http://localhost:5050, and into logging-backend running on cPanel domainracer.com's hosted site aurorahours.com

2025-08-04 - JWT signature issue? Verify that the secret key, default: "dev-client-secret" matches
2025-08-04 - Log end points /logs and /log have security disabled, great curl command examples are located here as well
2025-08-06 - Issue: Disk Usage Warning: The user “auroraho” (aurorahours.com) has near

Observability

Extra Credit: You can generate and store a request_id (UUID) in the session or context and propagate it through the logs for E2E correlation.

What do logs mean?

Token expired at 2025-08-03T06:32:34Z, issued at 2025-08-03T06:17:34Z, now=2025-08-03T06:32:36Z -
2025-08-04 03:01:24 ERROR API Gateway - Failed to log to central: HTTPConnectionPool(host='localhost', port=5020): Max retries exceeded with url: /log (Caused by ConnectTimeoutError(<urllib3.connection.HTTPCon - This implies that the logging_service.py has not been started

Debt

To expedite dev, I only did local development inside of a single folder, but they need to be broken out into their own repository
Determining the future of our current echo/api-gateway.py "microservice"

Architecture

Image Updated 8/2/2025

Original located here: https://excalidraw.com/#room=00409ad53d052306ef2e,MVjbyRo4BBPxioGcK4Sy6w

MVP - A narration on the baby steps that need to come together

Components

Sub-system: api_gateway.py aka BFF

Data storage

Database

Decisions

AD1 - Queues for durable messaging

AD2 - Implement our own Queue: The decision to implement our own lightweight queue rather than adopt a commercial or cloud‑managed solution (like AWS SQS, RabbitMQ, or Kafka) was driven by a combination of scope, control, and cost considerations

AD3 - Subscribers poll RestApi, and not MySQL directly for poor man's fan-out.

AD4 - Jobs vs Messages - Jobs database evolves into a batch job tracker for document parsing, embedding, or any other asynchronous processing tasks

AD5 - Accepting risks of implementing Login DB and Audit DB on the same server

High Level Design (HLD) Diagrams

Message - Queue - High Level System Design Document

Flows

User Flow:

User’s browser is redirected to the Identity Service if not logged in.
After successful login, browser is redirected back to API Gateway with a code.
API Gateway exchanges the code for a JWT (id_token), stores it in session.
For each action, API Gateway checks the JWT for validity (signature, expiry, etc.).
If the token expires, user is prompted to log in again.

Service-to-Service Flow:

Worker service creates a signed JWT (with a shared secret).
Sends this JWT as a Bearer token in the Authorization header when calling Logging Service.
Logging Service validates the JWT (signature, issuer, audience, expiry).
If valid, processes the request; else, rejects it.

Sauce

The real magic and value in a RAG app lies in those pre-OpenAI steps:

How you chunk your documents (size, overlap, semantic meaning)
How you embed those chunks to capture their meaning accurately
How you do the similarity search to retrieve the most relevant chunks for the user’s question
How you construct the prompt to feed those chunks plus the user question into the model in a way that guides the LLM toward the best, most accurate answer

Getting those right makes your LLM calls precise, cost-efficient, and effective — otherwise, you might feed irrelevant or too much info, confusing the model or wasting tokens.

Microservices (Core)

Identity Service: Handles auth, JWT, user management.
Logging Service: Central log collection for all events/jobs.
Parser Service: Extracts text from files (PDF, DOCX, etc.), returns structured chunks.
Job Manager (new): Manages a job queue in SQLite. Each job = a row (`status: queued, running, complete, failed`).
Embedding Service: Picks up “parse complete” jobs, calls OpenAI (or other embedding), stores results in vector DB.
Query Service: Handles user queries, does vector search + OpenAI call for RAG.

Data Stores

SQLite per service (for MVP, switchable to Postgres later).
ChromaDB/FAISS for embeddings (optional, can store vectors in SQLite for MVP).

Business problems

To make money, find a problem in the marketplace, and solve it by adding more value than anyone else

First Principles

Build boxes, that take input and output, allowing testability of discrete portions of logic
My goal, is to blow past Okta/Auth0 in the next wave of IAM. Similar to stripe's disruption, focusing on developer experience. We can do the same for identity and authorization.
Building microservices is akin to knowing when the hammer will fit the nail, and choosing the right tool for the job. Purposefully built to scale in a mega-app architecture, were apps built for the Business-Development product line will be able to rapidly build real scalable systems, in a micro period of time it would take with a monolith, or many different monoliths, solving the same problem repeatedly.

Features

F1 - MVP - OneLogin/OIDC login form | In QA, 80% or so complete | Upload files, and query with logging
F1.1 - MVP - Audit log with MS-SQL. Never mind, I think this is a bad idea, with no benefit.
F2 - MVP - Observability - Logging microservice | https://aurorahours.com/logging-backend provides centralized logging for all microservices to report outwards
F3 - MVP - AuthZ - microservice, priority #3
F4 - MVP - User/system communication, SaaS readiness, p4
F5 - MVP - Job/Task Queue -> Async, scale, reliability, background processing
F6 - MVP - API GW - Routing, security, traffic control, service mesh
MVP - Data abstraction layer, and migrate to mySQL or postgreSQL if it's available on cPanel
MVP - File tagging, query tagging
MVP - Parsing & Chunking | https://saadazizai.blogspot.com/2025/08/the-sauce-chunking-embedding-similarity.html
Your current worker calls a parser service that returns plain text. To improve accuracy and enable RAG, you want to chunk the text into smaller pieces (e.g., 500 tokens max). Store each chunk and its embedding vector (from OpenAI or other embedding model) in a dedicated DB table (embeddings). This chunking + embedding is the bread-and-butter of RAG.
MVP - Embedding & Vector Storage
Generate embeddings for each chunk. Store embeddings as vectors in a DB (currently embeddings table exists). This allows quick similarity searches (k-NN) for relevant chunks on query.
MVP - Similarity Search for Query
When a user asks a question, embed the question. Search DB for chunks with most similar embeddings. Select top N chunks (maybe 3-5) as context.
MVP - Prompt Construction
Compose a prompt combining relevant chunks + user query. Send prompt to LLM backend (OpenAI or Ollama).
MVP - Query Result & Logging
Return answer to user. Log query, chosen chunks, prompt, and answer.
Non-MVP-Feature: previous conversational history in future queries
Non-MVP-Feature: Add user session & history for conversation
Non-MVP-Feature: Support file metadata (titles, dates, tags)
Non-MVP-Feature: User accounts + access control
Non-MVP-Feature: Rich file types (pdf, docx, etc)
Non-MVP-Feature: Fine-tuning or prompt tuning
Non-MVP-Feature: AuthZ service, and fine grained permissions
Non-MVP-Feature: Optimize how auth(x) requests are handled

Auth Workflow

User loads http://localhost:5000

API Gateway checks:

Is there a session with a valid JWT?

If no: Redirects to /login.

User lands on login page, or gets redirected to OneLogin (OIDC provider).

User enters their username and password, what happens next?

OneLogin/OIDC login form):

Credentials are posted to the identity-backend.
Identity-backend validates:

If correct, issues a one-time “authorization code.”
Redirects user (browser) back to API Gateway with ?code=....

API Gateway exchanges the code for a JWT (token) by POSTing to the identity-backend /token endpoint.
Receives JWT:

Verifies claims, signature, and stores it in the session cookie.

User is now authenticated; page loads.

End to End Workflow

User uploads file via API.
API creates job: sets status queued, saves file, returns job ID.
Worker polls for queued jobs: picks one, updates status to running, processes, then updates status to complete and saves output.
User polls API for job status/results.
All steps log events (received upload, job picked, processing started, finished, error, etc.) to Logging service.
Query processed data via a REST endpoint that interacts with OpenAI (or another LLM), feeding in the retrieved chunks/embeddings.

Query Workflow

1. User Query Input - the user types a question

2. Retrieve Relevant Context - The system searches your own data (documents, notes, etc.) for the most relevant pieces.

Usually this is done by:

Splitting your documents into smaller chunks.
Creating embeddings (vector representations) of these chunks.
Searching those vectors using similarity search (e.g., cosine similarity) based on the user’s query embedding.

This retrieval step outputs a handful of text chunks most related to the query.

3. Build the LLM Prompt

The system combines the retrieved chunks into a single context string.

It then prepends this context to a prompt template, like:

You are a helpful assistant. Use the following documents to answer the question.
DOCUMENTS:
<retrieved chunks here>
QUESTION:
<user query>
ANSWER:

4. Send to LLM API

This full prompt string (context + question) is sent as the input to the OpenAI API (e.g., in a chat.completions.create() call).

The LLM generates an answer based on the context you provided rather than just its internal knowledge.

Query-Workflow summary - What the system passes to OpenAI:

A single prompt string that includes:

The most relevant document chunks retrieved from your own data (via vector search)
The user’s question at the end

Todo-projects

Outcome: Design a solid chunking and embedding workflow | Read more

Set up efficient similarity search & retrieval

Write prompt templates that get the best out of OpenAI or other LLMs

Curl commands

To test using cmd prompt, and curl (defaults to openai api):

curl -X POST -H "Content-Type: application/json" -d "{\"question\": \"List all todo items\"}" http://localhost:5000/query

To test with ollama:

curl -X POST -H "Content-Type: application/json" -d "{\"question\": \"List all todo items\", \"model\": \"ollama\"}" http://localhost:5000/query

2025-08-04 - Log end points /logs and /log have security disabled, great curl command examples are located here as well

Competitors

rlama was not inspiring, and left a lot to be desired. I had a previous version, but I like this new version better. It was able to outperform rlama in just a few hours of development effort.

Sadly, that is about how long it took to work on the rlama install.

Aserto offers fine grained OPA, with gitops.

Okta, Auth0, and Microsoft's one login is a whale in the space.

How to know more than the 95%

If you focus on the following topics, you will know more than 95% of devs who say “I know OAuth2”!

1. Web Authentication & Authorization (AuthN & AuthZ)

How modern web apps keep users logged in (sessions vs. tokens)
What JWTs are, and how to use them safely
OAuth2 and OpenID Connect flows (esp. Authorization Code flow)
Common vulnerabilities (token forgery, replay, open redirect, etc.)

2. Microservices & API Gateways

Why and how to break apps into services
How to route requests, authenticate users, and enforce security across services
Service-to-service authentication (using JWTs, mTLS, API keys, etc.)
Logging, monitoring, and observability in distributed systems

3. Secure Web App & API Development

Managing secrets and environment variables
Cookie/session security flags
CSRF and XSS prevention
Error handling and what not to expose
Rate limiting and brute-force protection

4. Flask & Python Web Stack Mastery

Flask application structure for prod
Configuring Flask securely (secrets, cookies, env, error handling)
Async vs. sync in Python web servers
Gunicorn/uWSGI and reverse proxy deployment basics

5. OAuth2/OIDC: Deep Dives

All OAuth2 grant types (Auth Code, Client Credentials, etc.)
Refresh tokens and session management
PKCE for public clients (mobile, SPA)
Role-based access control (RBAC) with JWT claims

6. Modern Logging & Observability

Centralized log collection (ELK stack, Loki, etc.)
Audit trails and why logs are so important in security
How to avoid logging secrets

7. Deployment & Cloud Considerations

Serving Flask apps in production (Gunicorn, nginx, HTTPS)
Running SQLite vs. Postgres/MySQL in prod
Dockerization and container best practices

How to Deep Dive Next

Google/YouTube: For each topic above, look for modern blog posts or video courses (there’s a ton, especially from Auth0, Okta, Microsoft, and Flask documentation).
Practice:
- Build and break small example apps for each auth flow.
- Try changing JWT settings and see what breaks.
- Add (and attack) your own endpoints to learn about vulnerabilities!
Books:
- “OAuth 2 in Action” (Manning)
- “Web Security for Developers” (Packt, O’Reilly, etc.)

TL;DR – Your Learning Path

Master web authentication (sessions vs. tokens)
Get comfortable with OAuth2/OIDC and JWT
Level up on microservices, API security, and Flask deployment
Dive into modern web security: cookies, CSRF, logging, error handling
Explore deployment and scaling for real-world production

RAG

Echo