Overview
A NSW Government agency, responsible for a critical application managing up to 3 million transactions
annually from multiple providers, partnered with DevOps1 to ensure the robustness and scalability of their
platform.
The project focused on implementing a rigorous performance testing framework and a secure Site Reliability
Engineering (SRE) platform. By integrating advanced tooling and AI-driven analysis, the agency aimed to
proactively identify performance and quality requirements and ensure seamless service delivery.
Challenges
The agency faced several critical challenges in ensuring the reliability of their high-volume transaction
system:
- High Transaction Volume: The application processes up to 3 million transactions per
year, requiring absolute stability and performance under load.
- Complex Integration Ecosystem: Data is ingested from multiple external providers,
creating complex integration points that are prone to bottlenecks.
- NFR Assurance: There was a critical need to identify and mitigate potential
performance and quality requirements, such as latency and scalability limits, before the system went
live.
- Analysis Bottlenecks: Traditional performance analysis was time-consuming, relying
heavily on specialised SRE resources to manually review vast amounts of data to identify trends.
Solution
DevOps1 designed and implemented a bespoke secure SRE and Performance Testing platform tailored to the
agency's specific needs.
Primary activities
- Comprehensive Performance Strategy: Defined and executed performance test scenarios
that accurately simulated peak loads and complex transaction flows.
- Automated Execution: Implemented automated test execution to ensure performance
validation.
- AI-Driven Analysis: Deployed AI agents to automatically analyse test results,
identifying anomalies and trends that might escape human review.
Tooling & integration
- Grafana K6: Utilised for its developer-friendly, scalable load testing capabilities,
allowing for precise simulation of user behaviours.
- InfluxDB: Implemented as the high-performance time-series database to store and query
massive volumes of performance metrics.
- AWS Bedrock Agent: Harnessed Generative AI to act as an intelligent SRE assistant,
analysing performance data to provide detailed insights and trend analysis.
Benefits
The implementation of the AI-enhanced performance platform delivered significant operational and strategic
benefits:
- 80% Faster Analysis: By harnessing AI SRE knowledge, the team could provide detailed
analysis and understand complex trends up to 80% faster than traditional specialised SRE reviews.
- Proactive Risk Mitigation: Successfully identified and resolved potential
performance and quality requirements issues before go-live, preventing production incidents.
- Scalability Assurance: Validated the platform's capacity to handle the projected 3
million annual transactions with confidence.
- Enhanced Engineering Efficiency: Freed up specialised SRE resources from manual data
crunching, allowing them to focus on strategic improvements and architecture.