Talend, Oracle, Spark
The client is a multinational company that serves the combined industries of health information technologies and clinical research.
The company needed to migrate data from various clients to a new multi-tenant SaaS platform and implement ETL procedures from the SaaS OLTP database to data warehouse (DWH) storage. A DWH is a data bank system separate from an operative data handling system, in which data from different, sometimes even very heterogeneous sources, is compressed and archived for the long term.
All existing client applications were located in different countries and had various data structures and schemes.
The main challenge was the heterogeneity of the data from the old system, and we needed to develop a huge data system that will support a large data stream on a regular basis.
The client request was to speed up data output and enhance user interaction with the system since it will help to gain more investors and clients.
We started with the first project that implied a data migration to the new DWH and solving the issue with storing all data sources – 500 GB per country – in one place. The idea of the project was to develop an ETL solution for data load using a unified data model.
The data includes a basic set of about 20 main entities – doctor, patient, prescriptions, diagnoses, and more.
The NIX team suggested developing and validating data load mappings with a 2-step data load:
Load from the source MS SQL in the staging Oracle DB area using Talend – the first stage of data cleaning. We later moved from Talend since there were not enough options for customization of different features to PL/SQL (including the DML Error Logging database feature) that were easier to support and update.
Load from the staging area into the main data warehouse using Oracle SQL/PL/SQL. Oracle DB has a built-in ability to save rejected rows after business analysis tests into separate tables without creating additional processors.
The second project (Cloudera+Spark+Looker) was to change the ETL tech stack to more inexpensive and more manageable. The project is for providing access to the prescription data of the physicians to conduct certain types of analyses, while the key constraint is that no nominative access to the prescription data of the physicians can be made available. The NIX team needed to generate CSV reports based on input files with parameters such as period of time, extension, and set of entities using DWH data and merge it with data from other systems.
Initially, the view tables contained a small number of entities and the system handled processing well. The increased data flow required more efficient processing resources, which is why we decided to move from Oracle DB to Hadoop from Cloudera as it provided more extensive capacity.
We also used software that orchestrates business processes that are pipeline parts for processing vast amounts of data. These are related to big data flows (Spark/Hadoop) as well as some domain-specific data processing written in Scala.
We evaluated the current solution, compared it with other possible ones, performed refactoring of the existing code, and then rewrote the code with a single spring boot app adding several new features. By using Sqoop scripts and Camunda as the orchestrator, we successfully transferred data from Oracle to Cloudera
The portal receives data from different sources such as sports data providers and traders teams on the server’s side and transfers information about games, teams, players, their statistics to the database.
We use a custom algorithm that calculates the bets on the player and takes into account the average results of players in previous games providing forecasts for
the upcoming game.
Then we create reports for trading tools that allow
traders to influence the algorithm
calculations if they need.
Firstly, the team receives input parameters files in CSV format that include data for 5 years.
The data is processed and validated on the Spark side – the business analysts’ work with anonymized data.
Spark jobs select the required info from Hadoop – three input files with parameters, doctors and patients – and output files with doctors and prescriptions.
When the data is validated and checked it is stored in HDFS in Parquet format for future customization. Spark job runs Hadoop tables and retrieves additional data on products, patients, and diagnoses.
After Spark has processed all data and compiled two output files, we assemble it into a CSV.
The client received a high-functioning, multi-tenant SaaS platform with homogeneous data that provides potential clients with a convenient tool for health data analysis. Implemented ETL procedures from SaaS OLTP database to DWH storage stand as a unified source that simplifies management through high-speed processing and accurate output.
ETL project goals were achieved with:
The development process was implemented with development, testing and production environments to allow testing and verifying ETL results for new clients with different data schemes.
10 experts (4 Java Developers, Lead Data Engineer, 4 Data Scientists, Data Analyst)
Talend ETL tool, Oracle SQL / PL/SQL, Maven, Spring Boot, Scala, Cloudera Hadoop, Spark job (Scala), Camunda
Our representative gets in touch with you within 24 hours.
We delve into your business needs and our expert team drafts the optimal solution for your project.
You receive a proposal with estimated effort, project timeline and recommended team structure.