Omar Baher
Software & Data Engineer · Systems Builder
I build cloud-native data platforms at enterprise scale and ship production software end to end. From medallion-architecture pipelines on Databricks to multi-tenant SaaS and AI agent systems - I engineer across the full stack because great systems don't respect layer boundaries.
About
Data platforms and production systems - end to end
I build enterprise analytics pipelines by day and ship production software by night. The common thread: systems thinking, clean architecture, and code that works at scale.
I'm a Senior Data Solutions Engineer at Manulife Investment Management, where I own the end-to-end modernization of a legacy on-prem analytics platform into a cloud-native architecture on Databricks, Azure Data Factory, ADLS with Delta Lake, and Azure SQL MI - processing tens of millions of records through medallion-architecture pipelines using PySpark and Pandas.
But I'm also a software engineer who builds real products. COI Vault is a multi-tenant SaaS platform with Stripe billing, audit trails, and server-side plan enforcement. History Tales is an AI agent pipeline with 18 LangGraph nodes and 41 deterministic validation gates that no hallucination can bypass. Both are deployed, tested, and running in production.
I think in systems - whether that means designing a dimensional model for an analytics warehouse or architecting a Server Actions layer with Zod validation and auth boundaries. The same engineering rigor applies: clean abstractions, strict type safety, exhaustive testing, and deployment pipelines from day one.
I hold a Bachelor of Mathematics in Computational Mathematics with a Statistics Minor from the University of Waterloo, and I also run ScrubHouse Inc., a services business I founded in 2024 alongside my full-time engineering role.
10M+
Records Modeled
Medallion-architecture pipelines at Manulife
~$30K
Annual Cost Reduction
Re-architected enterprise data quality platform
2
Production Systems
SaaS platform & AI agent pipeline shipped end to end
41
Validation Gates
Deterministic tests no LLM hallucination can bypass
Data & Cloud Platforms
Languages & Frameworks
Analytics & Modeling
Software Engineering
Experience
Where I've built
Enterprise data platforms, government ML systems, production software, and a business of my own.
Senior Data Solutions Engineer
Manulife Investment Management · Toronto, ON
- ▹Own end-to-end modernization of a legacy on-prem analytics platform to a cloud-native architecture using Databricks, Azure Data Factory, ADLS with Delta Lake, Azure SQL MI, PySpark, and Pandas
- ▹Built high-performance operational analytics product for senior leadership - modeling tens of millions of records across medallion architecture layers with partitioned semantic models and optimized refresh strategies
- ▹Re-architected enterprise data quality platform, reducing infrastructure and operational costs by approximately $30K per year
Associate Data Scientist
Innovation, Science and Economic Development Canada · Remote
- ▹Developed ML-based patent landscape maps that reduced operational costs by 36% through automated classification and clustering
- ▹Applied NLP techniques including word embeddings and text vectorization for patent document analysis at scale
Data Science Developer Intern
Government of Ontario · Remote
- ▹Built Power BI dashboard prototypes for internal stakeholder reporting and decision support
- ▹Designed Python text-processing pipeline for document classification using logistic regression models
Bachelor of Mathematics - Computational Mathematics
Statistics Minor
University of Waterloo · 2022
President - ScrubHouse Inc.
Toronto, ON · May 2024 - Present
Founded and operate a services business alongside my full-time engineering role - managing operations, client acquisition, and growth strategy.
Featured Work
Systems I've shipped
Full-stack production software with real architecture decisions, real tradeoffs, and real users. I don't just design data platforms - I build products.
A multi-tenant SaaS platform that tracks vendor certificates of insurance, expirations, and compliance - with automated reminders and full audit trails. Built for property managers, condo boards, and general contractors.
Problem
Property managers and contractors manually track vendor COIs in spreadsheets. They miss expirations, exposing themselves to liability gaps. There's no purpose-built tool for this - just generic document managers that don't understand compliance workflows.
Architecture
Server-first Next.js application using the App Router pattern. Dashboard pages are Server Components that fetch data directly from Prisma - no client-side fetching, no loading spinners, no waterfall requests. All mutations go through Server Actions with Zod validation and auth checks at every boundary. Multi-tenant isolation is enforced at the data layer: every query filters by organization ID, and every server action verifies org membership before proceeding.
Stack
Key Technical Decisions
Server Components for dashboard pages
Dashboard data doesn't need interactivity on initial render. By keeping pages as Server Components, we eliminate client-side fetching entirely - the HTML arrives with data already rendered. This removes loading states, reduces JavaScript bundle size, and makes the app feel instant.
Server Actions for all mutations
Every write operation (create vendor, upload document, change plan) goes through a Server Action with Zod validation. This creates a single enforcement layer - auth checks, plan limits, input validation, and audit logging all happen in one place. No API routes to maintain separately.
JWT sessions over database sessions
For a B2B SaaS with moderate user counts, JWT sessions eliminate a database query on every request. The tradeoff is that session revocation requires token expiry rather than instant invalidation - acceptable for this use case.
A production-ready LangGraph agent that autonomously generates high-retention, evidence-led history documentary scripts. 18-node pipeline with dual-model architecture, deterministic validation gates, cross-run learning, and a full web interface.
Problem
Creating a high-quality documentary script requires weeks of research, fact-checking, narrative structuring, and multiple editing passes. Content creators need to cross-reference primary sources, maintain factual accuracy, engineer viewer retention, and hit precise timing targets - all while writing in a compelling cinematic style.
Architecture
An 18-node LangGraph workflow that separates concerns into distinct processing stages: research, analysis, writing, and quality assurance. Uses a dual-model architecture - creative tier (GPT-5) for writing nodes, fast tier (GPT-5.2) for analytical nodes. Deterministic validation gates between stages enforce structural constraints that no LLM hallucination can bypass. A feedback memory system learns from past runs and injects lessons into future prompts.
Stack
Key Technical Decisions
Dual-model architecture
Writing quality and analytical speed have different requirements. Creative nodes (Outline, ScriptGeneration, RetentionPass) use a high-quality model for nuanced prose. Analytical nodes (scoring, extraction, QC) use a faster model for structured JSON output. This cuts cost and latency by 60% without sacrificing script quality.
Deterministic validation gates
LLMs can hallucinate structural compliance. The HardGuardrailsNode and FactTightenNode use deterministic Python validators - word count, entity provenance, tension escalation, rehook cadence - that cannot be bypassed by model output. If validation fails, the pipeline loops back with specific feedback.
Two-stage script generation (Draft → Fact-Tighten)
Stage A writes the creative draft. Stage B rewrites with per-paragraph trace tags ([Beat B03 | Claims C001,C012]) that create an auditable link between every statement and its source evidence. Tags are stripped from the final script but available for verification.
Engineering Philosophy
How I build
Principles forged from enterprise data platforms, production SaaS, and AI agent systems - the same engineering rigor, regardless of the layer.
Medallion Architecture
Bronze, silver, gold - every pipeline I build follows a clear layered progression. Raw ingestion lands untouched, transformations are explicit and auditable, and the gold layer serves a single purpose: trusted data ready for consumption. No shortcuts between layers.
Server-Side Enforcement
The UI suggests, the server enforces. Plan limits, auth boundaries, input validation, and business rules all live in Server Actions or API handlers - never in client code. A beautiful button means nothing if the mutation behind it doesn't check permissions.
Clean Architecture
Clear boundaries between layers - whether it's bronze → silver → gold in Delta Lake, staging → intermediate → marts in dbt, or data fetching → business logic → presentation in application code. Separation makes systems debuggable at 2 AM.
Type Safety End to End
From database schema to API response to UI component - types should flow without breaks. Prisma generates them from the schema, Zod validates at boundaries, and TypeScript catches the rest at compile time. Runtime surprises are a design failure.
Deterministic Over Probabilistic
AI pipelines need guardrails that LLMs can't hallucinate past. Structural validation, schema conformance checks, and rule-based gates ensure that every output is verified before it moves downstream - 41 validators in History Tales exist for exactly this reason.
Production-First Mentality
Error handling, audit logging, data freshness checks, input validation, and CI/CD ship with v1 - not as tech debt. Whether it's a data pipeline at 6 AM or a SaaS endpoint under load, production doesn't wait for your refactor sprint.
Writing & Thinking
Technical writing
Long-form thinking on architecture, systems design, and engineering decisions. Coming soon.
Modernizing Legacy Analytics: On-Prem to Databricks + Delta Lake
Lessons from leading a full platform migration - replacing brittle on-prem pipelines with Databricks, ADLS Gen2, Delta Lake, and Azure Data Factory, while keeping the business running.
Server Actions as the Enforcement Layer: Why the UI Suggests but the Server Decides
On building multi-tenant SaaS where plan limits, auth checks, and input validation all live in Server Actions - and why moving enforcement to the server simplified everything.
Medallion Architecture in Practice: Modeling Tens of Millions of Records
How I structured bronze, silver, and gold layers with PySpark and Delta Lake for an operational analytics product - partitioned semantic models, incremental loads, and optimized refresh strategies.
Deterministic Validation Gates for AI Agent Pipelines
LLMs can produce structurally invalid output that looks correct. Here's how I built 41 validation gates with pydantic, Zod, and pytest that no hallucination can bypass.
Building a Multi-Tenant SaaS with Next.js, Prisma, and Stripe
A technical walkthrough of COI Vault - org-scoped data isolation, JWT sessions, Stripe webhook lifecycle, and soft deletes with audit trails for compliance-grade B2B software.
Contact
Let's connect
I'm always open to discussing systems architecture, data platform design, software engineering, and interesting technical challenges.