Hungry Wizard Case Study

Our Strategy and Process

Strategic Alignment and Discovery

Before execution began, I led a series of strategy sessions with my supervisor and colleagues to validate our internal product-market fit. Here is where we inserted the Business Model Canvas (BMC) and Value Proposition Design (VPD) into our workflow:
‍Business Model Canvas (BMC): We mapped our cost structures, potential revenue impact, and key payer partnerships, ensuring our data platform aligned with broader business goals in value-based care. This clarified our “where to play” and “how to win,” recognizing the need for automated ingestion to serve 150+ VBC contracts without exploding costs.
‍

‍Value Proposition Design (VPD): We pinpointed user segments (data ops, regional executives, analytics teams), identifying their jobs, pains, and gains. This allowed us to shape a scalable, cloud-based engine that tackled messy data ingestion—transforming CSV/Excel complexities into a frictionless experience for stakeholders.
Armed with these strategic insights, we set a high-level vision for an automated, end-to-end data ingestion pipeline. Everyone from data engineers to executive leadership embraced the platform’s why: it wasn’t just about cleaning data—it was about enabling real-time decisions in population health and value-based care.

For data ops teams and healthcare leaders struggling to unify messy CSV/Excel feeds from multiple payers,Hungry Wizard is a scalable data ingestion platform that automates, validates, and consolidates files, meeting strict SLAs and empowering real-time analytics for 150+ value-based care contracts.Unlike ad-hoc manual workflows or fragmented ETL scripts,we provide an end-to-end pipeline that slashes overhead, accelerates insights, and seamlessly scales with your VBC growth.

Agile and Lean Delivery

With clear strategic goals defined, we adopted an agile, lean product mindset to deliver iteratively. We used user-story mapping sessions to spotlight high-impact features, high-risk features, gaps in thinking and establishing clear workflows.

Our lean approach ensured we quickly validated each sub-product's value. Weekly demos and daily stand-ups bridged the India-US time zones, maintaining alignment with the strategic vision established in our BMC/Value Prop work. This cycle of feedback and iteration helped us refine each sub-product continuously, hitting our 5-day SLA goals without losing sight of the bigger picture: enabling scalable, parallel data ingestion for 150+ VBC contracts.

Risk Arrangement Data

Risk Arrangement Data (RAD) is a centralized repository for all value-based care (VBC) contract metadata. It consolidates key contract details (e.g., payer information and contract terms) that were previously scattered across ad hoc spreadsheets. By serving as a single source of truth, RAD ensures downstream processes always have consistent, up-to-date contract parameters in real time.

The Who and Why

Primary Users: Contract managers and data managers responsible for maintaining VBC contract details.
Update Frequency: Infrequent updates (only when new contracts are signed or existing contracts are modified).
Regular Access: Contract data is accessed regularly (e.g., monthly when processing incoming payer files).
Value: RAD ensures every team and system works from the same updated contract data, eliminating confusion caused by multiple Excel versions.

Technical & Functional Details

RAD stores a structured set of metadata fields for each contract. Key fields include:

Contract scope and identifiers: contract name, participating regions, and contract duration
Payer and model information: payer organization and risk model category (e.g. shared savings vs. capitation)
Performance metrics: quality measures and targets associated with the contract
Provider references: provider identifiers (Tax ID or NPI) linked to the contract’s roster

Under the hood, RAD runs on a relational database with a simple user interface for data entry. Other pipeline components query RAD on-the-fly to retrieve contract configurations, ensuring uniform rules and definitions across all processes.

Impact & Outcomes

5× faster contract onboarding
Supports 150+ value-based contracts smoothly
Frees teams to focus on strategy instead of manual fixes
Near 0 documenting efforts

Contract Toolkit

Managing payer file transformations used to be cumbersome, requiring spreadsheets of rules and custom-coded logic for each new contract. This approach led to delays (onboarding could stretch to a month) and frequent errors (misinterpretations between business analysts and developers).The Contract Toolkit solves this by providing a low-code, UI-driven transformation engine that replaces manual documentation and developer hardcoding. It empowers analysts to define and refine data transformation logic themselves—no code required. This self-service model cuts turnaround time dramatically and improves the accuracy of ingested data.

The Who and Why

Business analysts and data managers—those who truly understand incoming payer file formats—are the Toolkit’s key users. Previously, they documented every detail in Excel, waited for developers to implement the logic, and endured repeated clarifications. By transferring that responsibility to a web-based platform:

Ownership: BAs now set up transformations without relying on developers.
Easy Adoption: A short, in-tool training guide helps new users learn the UI in a day.
Accountability: Every change is logged, allowing easy audits or rollbacks.
Faster Iterations: Changes or corrections happen immediately, avoiding miscommunication and missed deadlines.

Functional and Technical Details

A simple, step-by-step UI guides BAs through defining file logic, drawing on relevant contract data from RAD. They select the incoming file, map source columns to standardized fields, and configure transformations or filtering rules—no coding required. All such rules compile into PySpark jobs for large-scale processing:

Mapping & Cleaning: Source columns tie directly to predefined “concept” fields with built-in cleaning.
Joining Files: Multiple inputs can be merged on specified keys (left/right/inner/outer).
Key Assignment: Conditions in the interface determine record keys, replacing custom-coded logic.
Filtering & Overrides: Invalid rows are filtered, and edge cases can be handled with quick overrides.
Live Preview: BAs see real-time transformation results before final ingestion, reducing rework.

Impact and Outcomes

Shifting transformation logic to the BAs who know the data best has drastically improved efficiency and quality:

Faster Onboarding: Monthly new contracts rose from 5–6 to 18–20 without developer bottlenecks.
Reduced Errors: Miscommunication effectively dropped to zero, and overall data interpretation mistakes decreased by 90% thanks to standardized mapping steps.
Quick Turnaround: Onboarding a new payer file takes 3–5 days instead of several weeks.
Scalable & Future-Proof: Teams can adapt transformations on the fly as payer requirements evolve, all while keeping a transparent log of changes for compliance and governance.

Data Acquisition and Prechecks

Together, these components handle file ingestion from payer sFTP servers and validate each dataset before it enters the core engine. Although simpler than other products, they form the first crucial steps of our monthly pipeline.

Data Acquisition and Distribution

We automate sFTP polling every hour, referencing RAD to know which files to expect from each payer. This removes hours of manual checking, placing files in correct folders for the pipeline. It's a straightforward step but essential for near real-time ingestion built using Azure Function Apps.

Prechecks

Right after acquisition, the precheck tool validates each file against expected columns/formats. If mismatches appear, we flag them immediately, preventing downstream errors. This simple but critical step cuts ingestion errors drastically.

Core Data Processing Engine - A Paradigm Shift from SQL to PySpark

Our Core Processing Engine is the heart of our data pipeline, replacing sequential SQL stored procedures with a parallelized PySpark architecture. This shift enabled us to tackle massive volumes of payer data in near-real time, aligning with modern big-data practices seen at leading tech firms.

Key Technical Highlights

Object-Oriented Modularity: The engine breaks down processing into separate classes (e.g., DataLoader, DataCleaner) for ingestion, transformations, and output. New or unusual contract types only require small “child classes,” preserving a scalable, maintainable base.
Concurrency & Parallelism: Processing runs on multiple contracts simultaneously, leveraging Spark’s partitioning to reduce the largest datasets from days to minutes. The engine can handle 20–30 incoming contracts at once—a feat previously impossible with sequential SQL logic.
Seamless Ecosystem Integration: Pulls contract metadata from RAD, transformation rules from the Contract Toolkit, and sends final outputs to our Data Validation Suite.

Challenges and Lessons

Transitioning from stored procedures to a distributed Spark engine demanded a steep learning curve. We faced memory constraints and data skew issues early on, prompting us to:

Tune Partitions & Caching: Minimizing shuffle overhead made large file processing 100× faster.
Adopt Best Practices: Lazy evaluation concepts, ephemeral clusters, and streamlined code all contributed to robust performance.
Educate Teams: Mentoring data engineers on Spark internals fostered a culture of performance awareness and continuous improvement.

Performance and Impact

Largest contract: from 5 days to 20 minutes
Average contract: from 5 hours to 3 minutes
Patient matching: from 2 days to 15 minutes
Operational Efficiency: Drastically reduced manual intervention. The pipeline now processes 1500 files monthly with ease, under heavy load.
Future-Proof Scalability: Flexible class design lets us onboard new payer feeds without overhauling core code, ensuring we keep pace as data volumes and contract complexity grow.

Why it Matters

By adopting PySpark and a truly parallel design, we revolutionized our data processing. Teams gain near-instant visibility into new contract data, supporting real-time analytics and broader value-based care initiatives. The Core Processing Engine turned an outdated, slow ingestion workflow into a streamlined, modern data pipeline—empowering us to scale far beyond our previous limits without sacrificing accuracy or control.

Data Validation and CDW Loading

Data Validation

Purpose: Ensures data accuracy by comparing final engine outputs to raw files and previous month baselines. Catches anomalies (e.g., missing columns, unexpected record count) before data is marked “valid.”

Prevents corrupted or incomplete data from being warehoused—reducing reprocessing overhead and building trust in final reports.

CDW Loading

Purpose: Move the validated data to a Cloud Data Warehouse, making it accessible for analytics teams and BI tools.

How It Works: After passing all validation checks, data is inserted or updated in the CDW, typically via a structured SQL or managed ingestion process.

Impact:Business stakeholders can confidently query near real-time data knowing it’s passed rigorous checks. Eliminates manual steps that once plagued monthly ingestion.

SLA Management Dashboard: Reactive to Real-Time Pipeline Oversight

Challenge: DataOps relied on emails and manual checks to track pipeline progress. Issues were often discovered only after SLA breaches or significant delays. This reactive approach made it hard to meet strict 5-day SLAs consistently.

Solution: A centralized dashboard to provide live visibility into each pipeline stage. It transformed monitoring from reactive to proactive by introducing instant alerts on delays or errors. The dashboard also includes integrated controls to trigger pipeline re-runs or tasks directly from the UI, enabling immediate intervention without switching tools.

Technical Highlights

API-Driven Integration: A custom API layer connects the dashboard with the data pipelines. It logs each pipeline stage’s start/end times and outcomes to a central store, updating SLA status in real time. The same API accepts commands from the UI to initiate or re-run pipeline tasks on demand, eliminating the need to manually run scripts or use separate tools.
Publish-Subscribe Architecture: The system leverages a pub/sub model (WebSockets) to push live status updates to all users. Multiple DataOps engineers can have the dashboard open and see changes simultaneously (for example, a stage completion or an alert) without any page refresh. This real-time broadcast ensures the whole team shares up-to-the-second information and can act in concert.

Impact & Outcomes

SLA Adherence: Virtually eliminated SLA breaches by catching and addressing delays before the 5-day deadline, with 100% of monthly contract ingestions now completing within the SLA window.
Operational Efficiency: Freed the team from constant manual monitoring. Two engineers now manage 150+ contract pipelines with ease, focusing on flagged issues instead of checking all pipelines manually.
Faster Recovery: Real-time alerts and one-click re-runs cut average troubleshooting and recovery time by over 50%. Pipeline issues that once took hours to find and fix are now resolved within minutes, minimizing data downtime.
Trust & Transparency: Executives now have a single, transparent source of truth for pipeline status, while DataOps addresses bottlenecks in real time. This shared visibility builds confidence across teams and eliminates the guesswork and last-minute scrambles of the past.

Hungry Wizard: Unlimited Data, Unbounded Scale

Leveraging modern cloud systems and parallel processing to tame healthcare data in near real time

Impact at a Glance

Business Outcomes

Product Outcomes

Traction Metrics

Origin Story

Context and Problem

My Role

Our Strategy and Process

Strategic Alignment and Discovery

Agile and Lean Delivery

Products and Implementation

Suite of products to handle the end to end process

Risk Arrangement Data

The Who and Why

Technical & Functional Details

Impact & Outcomes

Contract Toolkit

The Who and Why

Functional and Technical Details

Impact and Outcomes

Data Acquisition and Prechecks

Data Acquisition and Distribution

Prechecks

Core Data Processing Engine - A Paradigm Shift from SQL to PySpark

Key Technical Highlights

Challenges and Lessons

Performance and Impact

Why it Matters

Data Validation and CDW Loading

Data Validation

CDW Loading

SLA Management Dashboard: Reactive to Real-Time Pipeline Oversight

Technical Highlights

Impact & Outcomes