Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Hungry Wizard: Unlimited  Data, Unbounded Scale

Leveraging modern cloud systems and parallel processing to tame healthcare data in near real time

Impact at a Glance

Business Outcomes

  • Enabled $420M in earnings in 2022
  • #1 MSSP ACO in 2023, earning $167M
  • Increased aptitude for higher risk contracts and bigger rewards

Product Outcomes

  • Code Run time -> 99X
  • Concurrency 1-> 50
  • Contract Onboarding 1mo-> 3 days

Traction Metrics

  • 150+ monthly VBC contracts
  • 1.5M lives supported
  • 7-day SLA with only 2 data-ops engineers

Origin Story

I was initially focused on optimizing our analytics platform—a cutting-edge system capable of deep insights and financial projections for Value-Based Care. Yet no matter how sophisticated our algorithms became, we hit a massive bottleneck: the data. Incomplete or delayed data meant we couldn’t unlock meaningful results, and the pipeline itself was a tangle of messy CSVs, secure emails, and manual approvals.
At that time, eight full-time data engineers were juggling eight monthly contracts, each wrestling with last-minute requests. Expenses were mounting, our company was exploring offshore engineering in India, and leadership wanted more. Then, one day, my supervisor sat me down and: “We need to scale from 8 contracts a month to 150+ and all within a five-day SLA. You’ll form and lead a distributed team—some in India, some in the US—to solve this. Propose a solution and make it happen.”
It felt impossible—over a 10x jump in capacity, tighter budgets, a new offshore model, and a ticking clock on data timeliness. But I knew that if we pulled it off, we’d transform how our entire organization handled data. Our analytics engine would finally get the robust, near real-time feeds it deserved, and we’d pave the way for true data-driven decisions in population health and Value-Based Care.

Context and Problem

As our organization expanded into value-based care, we had to unify data from multiple payers—each delivering CSV, Excel, or HL7 feeds—so we could provide robust population health analytics. While we initially handled just a handful of monthly data feeds, leadership aimed to scale to over 150+ without blowing up engineering costs. Meeting an aggressive 7-day SLA demanded a scalable, automated platform that would unify these disparate formats and reduce our heavy reliance on manual processes.

Our legacy pipeline relied on one-off scripts, ad-hoc file handling, and no single source of truth for contract metadata. Each new contract integration took weeks of developer time to wrestle with messy data formats—slowing analytics and threatening SLAs. We needed an end-to-end solution capable of ingesting and normalizing diverse payer data at scale, ensuring we could power value-based care insights without drowning in constant rework.

My Role

As Director of Product and Data Engineering, I orchestrated the overall vision, roadmap, and cross-functional execution. This involved bridging executive stakeholders—like the VP of Population Health and COO of Pop Health—with product and engineering teams in both the US and India. Tapping into my strengths in strategic thinking and relationship-building, I ensured everyone—from data ops to regional executives—stayed laser-focused on our 5-day SLA for data availability.
To keep teams aligned across time zones, I coordinated daily stand-ups and weekly demos, fostering a cohesive, agile culture. Early on, I collaborated closely with principal engineers to architect our initial solution; as we scaled, I mentored BAs and project managers to adopt a product mindset, forming smaller, focused sub-product teams. I introduced continuous discovery and delivery practices, lean product thinking, and story mapping. This blend of technical architecture and customer-centric iteration ensured each step delivered real user value while meeting executive priorities.

Our Strategy and Process

Strategic Alignment and Discovery

Before execution began, I led a series of strategy sessions with my supervisor and colleagues to validate our internal product-market fit. Here is where we inserted the Business Model Canvas (BMC) and Value Proposition Design (VPD) into our workflow:
Business Model Canvas (BMC): We mapped our cost structures, potential revenue impact, and key payer partnerships, ensuring our data platform aligned with broader business goals in value-based care. This clarified our “where to play” and “how to win,” recognizing the need for automated ingestion to serve 150+ VBC contracts without exploding costs.

Value Proposition Design (VPD): We pinpointed user segments (data ops, regional executives, analytics teams), identifying their jobs, pains, and gains. This allowed us to shape a scalable, cloud-based engine that tackled messy data ingestion—transforming CSV/Excel complexities into a frictionless experience for stakeholders.
Armed with these strategic insights, we set a high-level vision for an automated, end-to-end data ingestion pipeline. Everyone from data engineers to executive leadership embraced the platform’s why: it wasn’t just about cleaning data—it was about enabling real-time decisions in population health and value-based care.

For data ops teams and healthcare leaders struggling to unify messy CSV/Excel feeds from multiple payers,Hungry Wizard is a scalable data ingestion platform that automates, validates, and consolidates files, meeting strict SLAs and empowering real-time analytics for 150+ value-based care contracts.Unlike ad-hoc manual workflows or fragmented ETL scripts,we provide an end-to-end pipeline that slashes overhead, accelerates insights, and seamlessly scales with your VBC growth.

Agile and Lean Delivery

With clear strategic goals defined, we adopted an agile, lean product mindset to  deliver iteratively. We used user-story mapping sessions to spotlight high-impact features, high-risk features, gaps in thinking and establishing clear workflows.

Our lean approach ensured we quickly validated each sub-product's value. Weekly demos  and daily stand-ups bridged the India-US time zones, maintaining alignment with the  strategic vision established in our BMC/Value Prop work. This cycle of feedback and  iteration helped us refine each sub-product continuously, hitting our 5-day SLA goals  without losing sight of the bigger picture: enabling scalable, parallel data ingestion  for 150+ VBC contracts.

Products and Implementation

Suite of products to handle the end to end process

Risk Arrangement Data

Risk Arrangement Data (RAD) is a centralized repository for all value-based care (VBC) contract metadata. It consolidates key contract details (e.g., payer information and contract terms) that were previously scattered across ad hoc spreadsheets. By serving as a single source of truth, RAD ensures downstream processes always have consistent, up-to-date contract parameters in real time.

The Who and Why

  • Primary Users: Contract managers and data managers responsible for maintaining VBC contract details.
  • Update Frequency: Infrequent updates (only when new contracts are signed or existing contracts are modified).
  • Regular Access: Contract data is accessed regularly (e.g., monthly when processing incoming payer files).
  • Value: RAD ensures every team and system works from the same updated contract data, eliminating confusion caused by multiple Excel versions.

Technical & Functional Details

RAD stores a structured set of metadata fields for each contract. Key fields include:

  • Contract scope and identifiers: contract name, participating regions, and contract duration
  • Payer and model information: payer organization and risk model category (e.g. shared savings vs. capitation)
  • Performance metrics: quality measures and targets associated with the contract
  • Provider references: provider identifiers (Tax ID or NPI) linked to the contract’s roster

Under the hood, RAD runs on a relational database with a simple user interface for data entry. Other pipeline components query RAD on-the-fly to retrieve contract configurations, ensuring uniform rules and definitions across all processes.

Impact & Outcomes

  • 5× faster contract onboarding
  • Supports 150+ value-based contracts smoothly
  • Frees teams to focus on strategy instead of manual fixes
  • Near 0 documenting efforts

Contract Toolkit

Managing payer file transformations used to be cumbersome, requiring spreadsheets of rules and custom-coded logic for each new contract. This approach led to delays (onboarding could stretch to a month) and frequent errors (misinterpretations between business analysts and developers).The Contract Toolkit solves this by providing a low-code, UI-driven transformation engine that replaces manual documentation and developer hardcoding. It empowers analysts to define and refine data transformation logic themselves—no code required. This self-service model cuts turnaround time dramatically and improves the accuracy of ingested data.

The Who and Why

Business analysts and data managers—those who truly understand incoming payer file formats—are the Toolkit’s key users. Previously, they documented every detail in Excel, waited for developers to implement the logic, and endured repeated clarifications. By transferring that responsibility to a web-based platform:

  • Ownership: BAs now set up transformations without relying on developers.
  • Easy Adoption: A short, in-tool training guide helps new users learn the UI in a day.
  • Accountability: Every change is logged, allowing easy audits or rollbacks.
  • Faster Iterations: Changes or corrections happen immediately, avoiding miscommunication and missed deadlines.

Functional and Technical Details

A simple, step-by-step UI guides BAs through defining file logic, drawing on relevant contract data from RAD. They select the incoming file, map source columns to standardized fields, and configure transformations or filtering rules—no coding required. All such rules compile into PySpark jobs for large-scale processing:

  • Mapping & Cleaning: Source columns tie directly to predefined “concept” fields with built-in cleaning.
  • Joining Files: Multiple inputs can be merged on specified keys (left/right/inner/outer).
  • Key Assignment: Conditions in the interface determine record keys, replacing custom-coded logic.
  • Filtering & Overrides: Invalid rows are filtered, and edge cases can be handled with quick overrides.
  • Live Preview: BAs see real-time transformation results before final ingestion, reducing rework.

Impact and Outcomes

Shifting transformation logic to the BAs who know the data best has drastically improved efficiency and quality:

  • Faster Onboarding: Monthly new contracts rose from 5–6 to 18–20 without developer bottlenecks.
  • Reduced Errors: Miscommunication effectively dropped to zero, and overall data interpretation mistakes decreased by 90% thanks to standardized mapping steps.
  • Quick Turnaround: Onboarding a new payer file takes 3–5 days instead of several weeks.
  • Scalable & Future-Proof: Teams can adapt transformations on the fly as payer requirements evolve, all while keeping a transparent log of changes for compliance and governance.

Data Acquisition and Prechecks

Together, these components handle file ingestion from payer sFTP servers and  validate each dataset before it enters the core engine. Although simpler than other products, they form the first crucial steps of our monthly pipeline.

Data Acquisition and Distribution

Prechecks

Core Data Processing Engine - A Paradigm Shift from SQL to PySpark

Our Core Processing Engine is the heart of our data pipeline, replacing sequential SQL stored procedures with a parallelized PySpark architecture. This shift enabled us to tackle massive volumes of payer data in near-real time, aligning with modern big-data practices seen at leading tech firms.

Key Technical Highlights

  • Object-Oriented Modularity: The engine breaks down processing into separate classes (e.g., DataLoader, DataCleaner) for ingestion, transformations, and output. New or unusual contract types only require small “child classes,” preserving a scalable, maintainable base.
  • Concurrency & Parallelism: Processing runs on multiple contracts simultaneously, leveraging Spark’s partitioning to reduce the largest datasets from days to minutes. The engine can handle 20–30 incoming contracts at once—a feat previously impossible with sequential SQL logic.
  • Seamless Ecosystem Integration: Pulls contract metadata from RAD, transformation rules from the Contract Toolkit, and sends final outputs to our Data Validation Suite.

Challenges and Lessons

Transitioning from stored procedures to a distributed Spark engine demanded a steep learning curve. We faced memory constraints and data skew issues early on, prompting us to:

  • Tune Partitions & Caching: Minimizing shuffle overhead made large file processing 100× faster.
  • Adopt Best Practices: Lazy evaluation concepts, ephemeral clusters, and streamlined code all contributed to robust performance.
  • Educate Teams: Mentoring data engineers on Spark internals fostered a culture of performance awareness and continuous improvement.

Performance and Impact

  • Largest contract: from 5 days to 20 minutes
  • Average contract: from 5 hours to 3 minutes
  • Patient matching: from 2 days to 15 minutes
  • Operational Efficiency: Drastically reduced manual intervention. The pipeline now processes 1500 files monthly with ease, under heavy load.
  • Future-Proof Scalability: Flexible class design lets us onboard new payer feeds without overhauling core code, ensuring we keep pace as data volumes and contract complexity grow.

Why it Matters

By adopting PySpark and a truly parallel design, we revolutionized our data processing. Teams gain near-instant visibility into new contract data, supporting real-time analytics and broader value-based care initiatives. The Core Processing Engine turned an outdated, slow ingestion workflow into a streamlined, modern data pipeline—empowering us to scale far beyond our previous limits without sacrificing accuracy or control.

Data Validation and CDW Loading

Data Validation

Purpose: Ensures data accuracy by comparing final engine outputs to raw files and previous month baselines. Catches anomalies (e.g., missing columns, unexpected record count) before data is marked “valid.”

Prevents corrupted or incomplete data from being warehoused—reducing reprocessing overhead and building trust in final reports.

CDW Loading

Purpose: Move the validated data to a Cloud Data Warehouse, making it accessible for analytics teams and BI tools.

How It Works: After passing all validation checks, data is inserted or updated in the CDW, typically via a structured SQL or managed ingestion process.

Impact:Business stakeholders can confidently query near real-time data knowing it’s passed rigorous checks. Eliminates manual steps that once plagued monthly ingestion.

SLA Management Dashboard: Reactive to Real-Time Pipeline Oversight

Challenge: DataOps relied on emails and manual checks to track pipeline progress. Issues were often discovered only after SLA breaches or significant delays. This reactive approach made it hard to meet strict 5-day SLAs consistently.

Solution: A centralized dashboard to provide live visibility into each pipeline stage. It transformed monitoring from reactive to proactive by introducing instant alerts on delays or errors. The dashboard also includes integrated controls to trigger pipeline re-runs or tasks directly from the UI, enabling immediate intervention without switching tools.

Technical Highlights

API-Driven Integration: A custom API layer connects the dashboard with the data pipelines. It logs each pipeline stage’s start/end times and outcomes to a central store, updating SLA status in real time. The same API accepts commands from the UI to initiate or re-run pipeline tasks on demand, eliminating the need to manually run scripts or use separate tools.
Publish-Subscribe Architecture: The system leverages a pub/sub model (WebSockets) to push live status updates to all users. Multiple DataOps engineers can have the dashboard open and see changes simultaneously (for example, a stage completion or an alert) without any page refresh. This real-time broadcast ensures the whole team shares up-to-the-second information and can act in concert.

Impact & Outcomes

  • SLA Adherence: Virtually eliminated SLA breaches by catching and addressing delays before the 5-day deadline, with 100% of monthly contract ingestions now completing within the SLA window.
  • Operational Efficiency: Freed the team from constant manual monitoring. Two engineers now manage 150+ contract pipelines with ease, focusing on flagged issues instead of checking all pipelines manually.
  • Faster Recovery: Real-time alerts and one-click re-runs cut average troubleshooting and recovery time by over 50%. Pipeline issues that once took hours to find and fix are now resolved within minutes, minimizing data downtime.
  • Trust & Transparency: Executives now have a single, transparent source of truth for pipeline status, while DataOps addresses bottlenecks in real time. This shared visibility builds confidence across teams and eliminates the guesswork and last-minute scrambles of the past.