Bridging OpenAI and AWS Bedrock to Expand LLM Evaluation Platform

NIX developed a comprehensive LLM evaluation platform connector to benchmark OpenAI and AWS Bedrock models.

Business Domain

Internet Services and Computer Software
Service

Data Science, AI, Chatbot, Generative AI
Technologies

AWS Bedrock, Python, GIT, Prompt Engineering

Business Overview

Our client is an experienced technology company specializing in enterprise-grade AI solutions for highly regulated and sensitive sectors like FinTech and the healthcare industry. Their core offering is a proprietary LLM evaluation platform that benchmarks various AI agents to define which specific models deliver the optimal balance of accuracy, latency, and transparency for their customers’ unique requirements. By utilizing this platform, their customers can compare the results of the models, examine the reasoning, filters, and rules used, and identify the segment of data that was used to achieve this result.

Historically, the client’s expertise and infrastructure were centralized exclusively within the OpenAI ecosystem. Recognizing the strategic necessity of diversifying their LLM testing offerings to meet the growing AI market, the client partnered with NIX to expand their platform’s capabilities through a robust proof of concept (PoC). Our objective was to engineer a seamless integration layer—a universal connector bridging their existing OpenAI-based framework with the diverse suite of foundation models available via Amazon Bedrock.

Challenge

During the development of the AI connector and training the Bedrock models, our specialists had to adhere to the core client’s principles, namely:

Reducing the number of hallucinations to a minimum: LLM outputs must be precise, objective, and strictly data-driven, preventing the generation of fabricated results.
Transparency and observability: Every result provided by the LLMs required a clear, traceable reasoning chain.

Solution

The client’s existing evaluation workflow relied on a structured process, leveraging OpenAI models alongside specialized libraries to parse PDFs and unstructured text. The models would generate 5–10 targeted questions, extract answers via text parsing, and build a gold-standard dataset to assess model accuracy and precision.

The client’s ecosystem operates across four distinct functional pipelines:

PDF parsing: Extracting data from complex documents
Analysis and querying: Combining parsing with automated querying and response
Review: Manual and automated validation of model outputs
Testing: Final benchmarking and performance stress testing

To diversify the client’s capabilities, our team integrated and compared three leading foundational models within Amazon Bedrock: Anthropic Claude, Mistral AI, and Meta Llama. Given the highly sensitive nature of the client’s domain, we developed the AI connector in a strictly secured environment via GitHub, eliminating the risk of unauthorized access and ensuring that all integration code and LLM evaluation framework met enterprise security standards.

The PoC was executed in two strategic phases:

Phase 1: Focused on the core data ingestion and interaction layers, covering the PDF parsing and querying pipelines.
Phase 2: We developed the connector to support the review and testing pipelines. Thanks to this, our team completed the end-to-end integration with the client’s evaluation platform and significantly expanded its capabilities.

The result was a unified interface that united OpenAI LLMs with Amazon Bedrock models. This allowed the company to provide its customers with a full spectrum of AI models, offering precise, transparent Bedrock LLMs for comprehensive training that meet their unique business needs.

Outcome

Upon the successful completion of the PoC, the client expanded their service offering beyond OpenAI, integrating the full suite of Amazon Bedrock foundation models into their LLM evaluation platform. This diversification allows their enterprise customers to choose the specific model that best aligns with their security, speed, and accuracy requirements. Our comprehensive testing of the three Bedrock models yielded high-performance LLM evaluation metrics across critical categories:

Team:

2 Project Managers Solution Architect DevOps Engineer Data Scientist

Tech stack:

Python AWS Pandas AWS Bedrock GIT Prompt Engineering boto3

Relevant Case Studies

View all case studies

AI Agent for Enterprise-grade Device Management

Internet Services and Computer Software

Manufacturing

Success Story AI Agent for Enterprise-grade Device Management image

AI Hazard Detection for Care Facilities: 98% Accuracy in Safety Threat Prediction

Healthcare

Starday Foods: Scaling to 100K Posts per Hour With AI

Food & Beverages

Driving AI Innovation for a Global Customer Service Leader

Social Networks and Communications

AI-Driven Application for Mental Health Support in the US

Healthcare

AI-powered System: Cybersecurity Report Generation and Risk Mitigation

Healthcare

Contact Us

Processing…

I am interested in:
- Software solutions and services
- PR & Media
- Partnership
- Career
- Other
First Name**
Last Name**
Phone Field Validation
Phone number**
Email**
Leave us a message**
Attach file
Accepted file types: jpg, jpeg, png, webp, pdf, docx, txt, Max. file size: 50 MB.
*
I have read and I agree to the Privacy Policy
*
I have read and agree NIX United’s online cookie notice
This field is hidden when viewing the form
Phone Field Full Number
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.