Data Processing System for BI Tool Integration with DWH for Healthcare Analytics

NIX empowered a healthcare organization with a robust data management system that leverages enhanced Airflow pipelines to streamline curated content uploads and generate actionable healthcare insights.

Business Domain

Healthcare
Service

Data Engineering, ETL/ELT, Business Intelligence
Technologies

Oracle, Apache Spark, Apache Airflow

Business Overview

This project consists of the Big Data processing ecosystem to handle the orchestration of the ETL processes focused on getting initial data from different data sources. It allows augmenting medical and prescription drug claims, encounters, qualification, and health assessment data for further business analysis using Data Mart and BI tools with curated content.

The system’s target audience is insurance companies and businesses that buy insurance plans for employees and are interested in analyzing cost patterns and trends for further optimization. Insurance companies can examine changes in their expenses such as cash flow changes, financial indicators by years, costs by regions or diseases, and they can also receive insights based on their data by performing competitors’ analysis on various indicators such as region or industry.

Curated interactive dashboards provide analytics to understand the cost, use, and quality of the chosen data layer. Moreover, curated datasets allow users to build custom dashboards that provide insights from a business perspective and focus on the analysis rather than digging into schematics. Through dashboards, users can perform data inquiries and practice maintaining level detail to uncover practical insights effortlessly.

The system allows user to:

Access the essential data

Through intuitive dashboards, users can access all the necessary information on trends and treatment patterns, as well as benchmark comparisons.

Identify hidden patterns

Machine learning algorithms can identify trends by analyzing thousands of healthcare data elements, highlighting emerging cost drivers to justify early intervention.

Create custom dashboards

Ability to create custom dashboards with visualization options that can be simplified by drag and drop. The user can enter data, run it through the ETL process, and create custom dashboards focusing on the relevant information using the data engineering tools for validation.

The NIX team provides a configured ecosystem, sets up ETL processes, and manages curated data based on the client’s data structure and needs to bring business value to the end-users.

The system focuses on kinds of activities such as batch data processing and enhancing it with the help of different analytics methodologies and models. This allows data to be optimized for analysis with BI tools as well as the development of curated content for the BI tool to improve user experience
by simplifying the data interface.

Challenge

We needed to build a system that efficiently processes data and content updates for the BI tool, complying with strict security requirements, including HIPAA compliance. Moreover, the system needed to meet complex multi-tenant requirements that include two access control levels for the end-users – data set and type, and row-level.

Complex workflows with various technologies and cluster types (Kubernetes and Spark on Apache YARN), including validation, filtering, and processing data through the ML models using Spark and Docker-based deployment.
Deliver prepared content, so end-users can spend less time on preparing data and more on analyzing it.
Implement an efficient data model for the Data Mart’s high performance on the BI tool while taking into account a massive amount of data records.
Multiple types of data sources and data formats including data layout specifics per customer.
Create the ability to customize and optimize data models in Data Mart per to fit each customer’s needs and keep performance high.
Process curated content development including synchronization of assets with the legal system and the possibility to add new components, both general and customized, for a separate tenant
Testing of the system, taking into account the complexity of the flow and data model customization, along with the use of general curated content and specific to each tenant.

Solution

For data pipelines, the team used Airflow to orchestrate various ETL steps processing. These included reading data from various sources (with SQL from DWH, CSV files from SFTP, Parquet, and Avro from S3) and building a complex pipeline for data augmentation.

Airflow deployed to the Kubernetes cluster allows it to host orchestration infrastructure and run dockerized ML models scoring. Airflow handles dependencies including monitoring sources for data, submitting spark jobs, executing tasks against Kubernetes, and handling routines such as notifications and monitoring and managing ETL flows for the data itself.

As the main ETL, we used Spark and PySpark tools by running the Spark application against the Apache YARN cluster created with Ansible scripts and deployed against multiple VM servers. The YARN cluster can be scaled horizontally moving forward or replaced with cloud-based elastic and auto-scalable Spark service. Spark on Kubernetes capabilities with the flexible cluster can also be utilized. Moreover, a dedicated Apache YARN cluster helps process Spark applications for reading data, validation, augmentation, and scoring against some models.

Airflow solves not only orchestration and chaining of Spark application submissions but also handles various cases where model scoring is done using a pre-trained model packaged within a Docker container and submitted against Kubernetes cluster or deployed as a REST service.

As for curated content delivery, we enhanced the Airflow pipelines to handle the curated content upload and automatically trigger data pipelines to reload data along with Liquibase scripts for the DataMart database schema update.

The team implemented a converter from legacy curated content formats (from XML/CSV/YAML to JSON) and the process of updating, adding the content, taking into account the legacy system’s specifics, new general and specific for each tenant asset.

An outstanding challenge solved by pipeline orchestration was handling common logic and curated content along with customer-specific content for optimization and better maintenance.