Senior Data Solutions Engineer at Manulife

Omar Baher

Software & Data Engineer · Systems Builder

I build cloud-native data platforms at enterprise scale and ship production software end to end. From medallion-architecture pipelines on Databricks to multi-tenant SaaS and AI agent systems - I engineer across the full stack because great systems don't respect layer boundaries.

About

Data platforms and production systems - end to end

I build enterprise analytics pipelines by day and ship production software by night. The common thread: systems thinking, clean architecture, and code that works at scale.

I'm a Senior Data Solutions Engineer at Manulife Investment Management, where I own the end-to-end modernization of a legacy on-prem analytics platform into a cloud-native architecture on Databricks, Azure Data Factory, ADLS with Delta Lake, and Azure SQL MI - processing tens of millions of records through medallion-architecture pipelines using PySpark and Pandas.

But I'm also a software engineer who builds real products. COI Vault is a multi-tenant SaaS platform with Stripe billing, audit trails, and server-side plan enforcement. History Tales is an AI agent pipeline with 18 LangGraph nodes and 41 deterministic validation gates that no hallucination can bypass. Both are deployed, tested, and running in production.

I think in systems - whether that means designing a dimensional model for an analytics warehouse or architecting a Server Actions layer with Zod validation and auth boundaries. The same engineering rigor applies: clean abstractions, strict type safety, exhaustive testing, and deployment pipelines from day one.

I hold a Bachelor of Mathematics in Computational Mathematics with a Statistics Minor from the University of Waterloo, and I also run ScrubHouse Inc., a services business I founded in 2024 alongside my full-time engineering role.

10M+

Records Modeled

Medallion-architecture pipelines at Manulife

~$30K

Annual Cost Reduction

Re-architected enterprise data quality platform

2

Production Systems

SaaS platform & AI agent pipeline shipped end to end

41

Validation Gates

Deterministic tests no LLM hallucination can bypass

Data & Cloud Platforms

Azure DatabricksAzure Data FactoryAzure Synapse AnalyticsADLS Gen2Azure SQL MIDelta LakeUnity CatalogApache Spark

Languages & Frameworks

PythonPySparkTypeScriptSQL (T-SQL · Spark SQL)ScalaNext.js (App Router)FastAPINode.js

Analytics & Modeling

dbt (Core & Cloud)Power BITabular EditorMedallion ArchitectureDimensional ModelingSemantic ModelsDAX

Software Engineering

System DesignMulti-Tenant SaaSServer Actions / API DesignPrisma / PostgreSQLStripe BillingAI Agent Pipelines (LangGraph)CI/CDDockerGitpytest · pydantic · Vitest

Experience

Where I've built

Enterprise data platforms, government ML systems, production software, and a business of my own.

Senior Data Solutions Engineer

Manulife Investment Management · Toronto, ON

Dec 2022 - Present
  • Own end-to-end modernization of a legacy on-prem analytics platform to a cloud-native architecture using Databricks, Azure Data Factory, ADLS with Delta Lake, Azure SQL MI, PySpark, and Pandas
  • Built high-performance operational analytics product for senior leadership - modeling tens of millions of records across medallion architecture layers with partitioned semantic models and optimized refresh strategies
  • Re-architected enterprise data quality platform, reducing infrastructure and operational costs by approximately $30K per year
DatabricksPySparkAzure Data FactoryDelta LakeAzure SQL MIADLS Gen2Power BIdbt

Associate Data Scientist

Innovation, Science and Economic Development Canada · Remote

Sep 2021 - Apr 2022
  • Developed ML-based patent landscape maps that reduced operational costs by 36% through automated classification and clustering
  • Applied NLP techniques including word embeddings and text vectorization for patent document analysis at scale
PythonNLPMachine LearningWord EmbeddingsText Vectorization

Data Science Developer Intern

Government of Ontario · Remote

Jan 2021 - Apr 2021
  • Built Power BI dashboard prototypes for internal stakeholder reporting and decision support
  • Designed Python text-processing pipeline for document classification using logistic regression models
PythonPower BILogistic RegressionNLP
Education

Bachelor of Mathematics - Computational Mathematics

Statistics Minor

University of Waterloo · 2022

Leadership

President - ScrubHouse Inc.

Toronto, ON · May 2024 - Present

Founded and operate a services business alongside my full-time engineering role - managing operations, client acquisition, and growth strategy.

Featured Work

Systems I've shipped

Full-stack production software with real architecture decisions, real tradeoffs, and real users. I don't just design data platforms - I build products.

B2B SaaS - Vendor Compliance & Document Tracking

COI Vault

A multi-tenant SaaS platform that tracks vendor certificates of insurance, expirations, and compliance - with automated reminders and full audit trails. Built for property managers, condo boards, and general contractors.

COI Vault - System ArchitectureCLIENTBrowserAuth Pages · Dashboard UIVERCEL - NEXT.JS APP ROUTERServer ComponentsDashboard · SSG LandingServer ActionsZod + Auth + Plan LimitsNextAuthJWT SessionsMiddlewareRoute GuardsCron JobsRemindersENFORCEMENT LAYERZod Validation · Org Isolation (orgId) · Plan Limits · Soft Deletes · Audit LoggingDATA LAYERPrisma ORMType-safe queriesorgId filteringPostgreSQLNeon (serverless)SSL · Connection PoolUserMembershipOrganizationVendor · DocEXTERNAL SERVICESStripeCheckout · Billing PortalWebhooks → Plan SyncFree · Pro ($29) · Team ($79)ResendExpiry Reminder Emails7-day advance warningVitestUnit + Integration TestsGitHub ActionsCI/CD → Vercel Deploy

Problem

Property managers and contractors manually track vendor COIs in spreadsheets. They miss expirations, exposing themselves to liability gaps. There's no purpose-built tool for this - just generic document managers that don't understand compliance workflows.

Architecture

Server-first Next.js application using the App Router pattern. Dashboard pages are Server Components that fetch data directly from Prisma - no client-side fetching, no loading spinners, no waterfall requests. All mutations go through Server Actions with Zod validation and auth checks at every boundary. Multi-tenant isolation is enforced at the data layer: every query filters by organization ID, and every server action verifies org membership before proceeding.

Stack

Next.js 16 (App Router)TypeScriptPrisma + PostgreSQLNextAuth (JWT)StripeResendVercel + CronVitestTailwind CSS v4

Key Technical Decisions

Server Components for dashboard pages

Dashboard data doesn't need interactivity on initial render. By keeping pages as Server Components, we eliminate client-side fetching entirely - the HTML arrives with data already rendered. This removes loading states, reduces JavaScript bundle size, and makes the app feel instant.

Server Actions for all mutations

Every write operation (create vendor, upload document, change plan) goes through a Server Action with Zod validation. This creates a single enforcement layer - auth checks, plan limits, input validation, and audit logging all happen in one place. No API routes to maintain separately.

JWT sessions over database sessions

For a B2B SaaS with moderate user counts, JWT sessions eliminate a database query on every request. The tradeoff is that session revocation requires token expiry rather than instant invalidation - acceptable for this use case.

AI Agent Pipeline - Documentary Script Generation

History Tales Script Generator

A production-ready LangGraph agent that autonomously generates high-retention, evidence-led history documentary scripts. 18-node pipeline with dual-model architecture, deterministic validation gates, cross-run learning, and a full web interface.

History Tales - 18-Node LangGraph PipelineRESEARCHANALYSISTopic DiscoveryFast TierFormat GuardNo LLMTopic ScoringFast TierResearch FetchNo LLMSource CredibilityNo LLMClaims ExtractionFast TierCross-CheckFast TierTimeline BuilderFast TierEmotional ExtractionFast TierOutlineCreative TierHard GuardrailsValidation GateScript GenerationCreative TierFact-TightenCreative TierRetention PassCreative TierEmotional IntensityFast TierSensory DensityFast TierQuality CheckFast TierFinalizeNo LLMQC fail → retry (max 2)Creative Tier (GPT-5)Fast Tier (GPT-5.2)Deterministic GateNo LLMRetry Loop

Problem

Creating a high-quality documentary script requires weeks of research, fact-checking, narrative structuring, and multiple editing passes. Content creators need to cross-reference primary sources, maintain factual accuracy, engineer viewer retention, and hit precise timing targets - all while writing in a compelling cinematic style.

Architecture

An 18-node LangGraph workflow that separates concerns into distinct processing stages: research, analysis, writing, and quality assurance. Uses a dual-model architecture - creative tier (GPT-5) for writing nodes, fast tier (GPT-5.2) for analytical nodes. Deterministic validation gates between stages enforce structural constraints that no LLM hallucination can bypass. A feedback memory system learns from past runs and injects lessons into future prompts.

Stack

LangGraphPythonOpenAI API (Dual-Model)FastAPINext.js 14 + shadcn/uiPydanticWikipedia + Archives APIVitest + Pytest

Key Technical Decisions

Dual-model architecture

Writing quality and analytical speed have different requirements. Creative nodes (Outline, ScriptGeneration, RetentionPass) use a high-quality model for nuanced prose. Analytical nodes (scoring, extraction, QC) use a faster model for structured JSON output. This cuts cost and latency by 60% without sacrificing script quality.

Deterministic validation gates

LLMs can hallucinate structural compliance. The HardGuardrailsNode and FactTightenNode use deterministic Python validators - word count, entity provenance, tension escalation, rehook cadence - that cannot be bypassed by model output. If validation fails, the pipeline loops back with specific feedback.

Two-stage script generation (Draft → Fact-Tighten)

Stage A writes the creative draft. Stage B rewrites with per-paragraph trace tags ([Beat B03 | Claims C001,C012]) that create an auditable link between every statement and its source evidence. Tags are stripped from the final script but available for verification.

Engineering Philosophy

How I build

Principles forged from enterprise data platforms, production SaaS, and AI agent systems - the same engineering rigor, regardless of the layer.

Medallion Architecture

Bronze, silver, gold - every pipeline I build follows a clear layered progression. Raw ingestion lands untouched, transformations are explicit and auditable, and the gold layer serves a single purpose: trusted data ready for consumption. No shortcuts between layers.

Server-Side Enforcement

The UI suggests, the server enforces. Plan limits, auth boundaries, input validation, and business rules all live in Server Actions or API handlers - never in client code. A beautiful button means nothing if the mutation behind it doesn't check permissions.

Clean Architecture

Clear boundaries between layers - whether it's bronze → silver → gold in Delta Lake, staging → intermediate → marts in dbt, or data fetching → business logic → presentation in application code. Separation makes systems debuggable at 2 AM.

Type Safety End to End

From database schema to API response to UI component - types should flow without breaks. Prisma generates them from the schema, Zod validates at boundaries, and TypeScript catches the rest at compile time. Runtime surprises are a design failure.

Deterministic Over Probabilistic

AI pipelines need guardrails that LLMs can't hallucinate past. Structural validation, schema conformance checks, and rule-based gates ensure that every output is verified before it moves downstream - 41 validators in History Tales exist for exactly this reason.

Production-First Mentality

Error handling, audit logging, data freshness checks, input validation, and CI/CD ship with v1 - not as tech debt. Whether it's a data pipeline at 6 AM or a SaaS endpoint under load, production doesn't wait for your refactor sprint.

Writing & Thinking

Technical writing

Long-form thinking on architecture, systems design, and engineering decisions. Coming soon.

Platform EngineeringComing soon

Modernizing Legacy Analytics: On-Prem to Databricks + Delta Lake

Lessons from leading a full platform migration - replacing brittle on-prem pipelines with Databricks, ADLS Gen2, Delta Lake, and Azure Data Factory, while keeping the business running.

Software ArchitectureComing soon

Server Actions as the Enforcement Layer: Why the UI Suggests but the Server Decides

On building multi-tenant SaaS where plan limits, auth checks, and input validation all live in Server Actions - and why moving enforcement to the server simplified everything.

Data EngineeringComing soon

Medallion Architecture in Practice: Modeling Tens of Millions of Records

How I structured bronze, silver, and gold layers with PySpark and Delta Lake for an operational analytics product - partitioned semantic models, incremental loads, and optimized refresh strategies.

AI EngineeringComing soon

Deterministic Validation Gates for AI Agent Pipelines

LLMs can produce structurally invalid output that looks correct. Here's how I built 41 validation gates with pydantic, Zod, and pytest that no hallucination can bypass.

Full-Stack EngineeringComing soon

Building a Multi-Tenant SaaS with Next.js, Prisma, and Stripe

A technical walkthrough of COI Vault - org-scoped data isolation, JWT sessions, Stripe webhook lifecycle, and soft deletes with audit trails for compliance-grade B2B software.

Contact

Let's connect

I'm always open to discussing systems architecture, data platform design, software engineering, and interesting technical challenges.