Business Overview

Our client is an experienced technology company specializing in enterprise-grade AI solutions for highly regulated and sensitive sectors like FinTech and the healthcare industry. Their core offering is a proprietary LLM evaluation platform that benchmarks various AI agents to define which specific models deliver the optimal balance of accuracy, latency, and transparency for their customers’ unique requirements. By utilizing this platform, their customers can compare the results of the models, examine the reasoning, filters, and rules used, and identify the segment of data that was used to achieve this result.

Historically, the client’s expertise and infrastructure were centralized exclusively within the OpenAI ecosystem. Recognizing the strategic necessity of diversifying their LLM testing offerings to meet the growing AI market, the client partnered with NIX to expand their platform’s capabilities through a robust proof of concept (PoC). Our objective was to engineer a seamless integration layer—a universal connector bridging their existing OpenAI-based framework with the diverse suite of foundation models available via Amazon Bedrock.

Challenge

During the development of the AI connector and training the Bedrock models, our specialists had to adhere to the core client’s principles, namely:

  1. Reducing the number of hallucinations to a minimum: LLM outputs must be precise, objective, and strictly data-driven, preventing the generation of fabricated results.
  2. Transparency and observability: Every result provided by the LLMs required a clear, traceable reasoning chain.
02@2x
03@2x

Solution

The client’s existing evaluation workflow relied on a structured process, leveraging OpenAI models alongside specialized libraries to parse PDFs and unstructured text. The models would generate 5–10 targeted questions, extract answers via text parsing, and build a gold-standard dataset to assess model accuracy and precision.

The client’s ecosystem operates across four distinct functional pipelines:

  1. PDF parsing: Extracting data from complex documents
  2. Analysis and querying: Combining parsing with automated querying and response
  3. Review: Manual and automated validation of model outputs
  4. Testing: Final benchmarking and performance stress testing

To diversify the client’s capabilities, our team integrated and compared three leading foundational models within Amazon Bedrock: Anthropic Claude, Mistral AI, and Meta Llama. Given the highly sensitive nature of the client’s domain, we developed the AI connector in a strictly secured environment via GitHub, eliminating the risk of unauthorized access and ensuring that all integration code and LLM evaluation framework met enterprise security standards.

The PoC was executed in two strategic phases:

  • Phase 1: Focused on the core data ingestion and interaction layers, covering the PDF parsing and querying pipelines.
  • Phase 2: We developed the connector to support the review and testing pipelines. Thanks to this, our team completed the end-to-end integration with the client’s evaluation platform and significantly expanded its capabilities.
04@2x

The result was a unified interface that united OpenAI LLMs with Amazon Bedrock models. This allowed the company to provide its customers with a full spectrum of AI models, offering precise, transparent Bedrock LLMs for comprehensive training that meet their unique business needs.

Outcome

Upon the successful completion of the PoC, the client expanded their service offering beyond OpenAI, integrating the full suite of Amazon Bedrock foundation models into their LLM evaluation platform. This diversification allows their enterprise customers to choose the specific model that best aligns with their security, speed, and accuracy requirements. Our comprehensive testing of the three Bedrock models yielded high-performance LLM evaluation metrics across critical categories:

outcome img@2x
Team:

Team:

2 Project Managers Solution Architect DevOps Engineer Data Scientist
Tech stack:

Tech stack:

Python AWS Pandas AWS Bedrock GIT Prompt Engineering boto3

REQUEST A CONSULTATION

Contact us   

Relevant Case Studies

View all case studies

AI Agent for Enterprise-grade Device Management

Internet Services and Computer Software

Manufacturing

Success Story AI Agent for Enterprise-grade Device Management image

AI Hazard Detection for Care Facilities: 98% Accuracy in Safety Threat Prediction

Healthcare

Success Story AI Hazard Detection for Care Facilities: 98% Accuracy in Safety Threat Prediction image

Starday Foods: Scaling to 100K Posts per Hour With AI

Food & Beverages

Success Story Starday Foods: Scaling to 100K Posts per Hour With AI image

Driving AI Innovation for a Global Customer Service Leader

Social Networks and Communications

Success Story Driving AI Innovation for a Global Customer Service Leader image

AI-Driven Application for Mental Health Support in the US

Healthcare

Success Story AI-Driven Application for Mental Health Support in the US image

AI-powered System: Cybersecurity Report Generation and Risk Mitigation

Healthcare

Success Story AI-powered System: Cybersecurity Report Generation and Risk Mitigation image
01

Contact Us

Accessibility Adjustments
Adjust Background Colors
Adjust Text Colors