Rescue mission to save Digital Learning startup
Leading provider of talent management and on-line learning solutions for corporate sector initiated a start-up to develop and bring to market new cloud-based digital learning marketplace and e-learning content delivery platform. Once development deemed completed the company struggled with launching the system in production due to number of critical functional and performance issues. Beta users and learning content providers complained that system is hardly usable and customer faced high risk of major decline of clientele. NIX was invited as technical consultant to cure the situation. Customer had very tight time frame of 3-4 months before market window is closed for another year.
Our first step was to send prompt reaction team to customer’s location for rapid evaluation. The team included experts in business analysts, enterprise solutions and DevOps. Analysis revealed the picture of failure. Solution appeared to be a heterogeneous distributed application that used a whole zoo of different development platforms (Ruby on Rails, Ruby Sinatra, Go, NodeJs). Different components appeared to be developed by different teams with uneven skills and different coding approach. The code in some components was tangled and sowed with defects. System design represented an unsuccessful attempt to implement microservices approach. Major design issues were identified in decomposition of microservices, communication between microservices, database schema and data consumption strategy. Inter-service communication, data handling and approach to authorization created the major bottlenecks with snowballing effect. As a result, performance of the system in production was extremely poor – the system could hardly serve even few concurrent users. Finally, we revealed major security vulnerabilities due to mistakes in implementation of authorization protocol. Based on evaluation NIX team came up with plan to resurrect the system in 3 months that was approved by the customer.
Within the remaining 3 months our development and devops teams carried out complete reengineering of the system: reviewed decomposition of microservices, implemented new authorization/security strategy, optimized database schema, re-created components and improved configuration of production environment. We got rid of most part part of the “techno zoo” (Go, Node.js, Ruby Sinatra components were redeveloped using .NET Core) and optimized the rest as deep as reasonably possible within provided timeframe. New authorization service was created using .Net Core and IdentityServer4 implementing OAuth 2.0 protocol with OpenID Connect layer. The most “defective” microservices were rewritten from scratch or wrapped using C#/.NET Core to address performance bottlenecks and generalize interoperability between the services.
The reborn system was designed and tested on Azure platform (however, later the customer took decision to keep system in AWS cloud). As a result, we provided scalable, secure and resilient system with horizontally scalable architecture. Performance significantly improved. From the initial limit of 10 requests/second the system is capable to stand the peaks of 500 requests/second with average response time 3 milliseconds in production. On a performance stand system demonstrated capacity to serve up to 7500 requests/sec with average response time up to 10 ms, which is far beyond the expectation. Solution went live on-time within the designated market window.
Backend: ASP.NET Core, RoR, RabbitMQ, Redis Frontend: HTML5, ReactJS Data layer: Microsoft SQL Server (dev and test), PostgreSql (production), MongoDB Deployment environment: Azure IaaS (dev and test), AWS (production)