Reducing Latency & Boosting Reliability in an Event Streaming System

1.A System Losing Speed: Observations & Root Causes

1.1 Latency, Hidden Costs, and User Frustration

In event streaming-oriented systems, conflicts between real-time and batch processing can lead to slowdowns, incidents, and financial losses.

This was the case for a client who reached out to us after developing a credit insurance subscription application processing hundreds of thousands of customer requests per day — each representing amounts ranging from thousands to millions of euros. These requests are business-critical for their customers, as they often determine whether major commercial deals can go through. A delay of 2 to 3 minutes can be enough to lose a deal worth several million euros. At this scale, every second counts. This is why many customers opt for an additional service that guarantees an automatic response from the subscription application within seconds.

However, the data load is massive, making it extremely difficult to consistently meet SLA commitments. It becomes imperative to implement an efficient technical solution to ensure performance. The goal: avoid any bottleneck that could cause lost business opportunities or delayed decisions.

In a highly competitive credit insurance market, perceived performance is a key loyalty factor. Every second of latency creates friction, and every outage weakens the relationship — especially in its early stages.

Some companies even saw their request volumes gradually decrease, likely redirected toward competing platforms. At this point, tech is no longer a back-office concern — it becomes a business driver (or blocker).

This article explores how we re-engineered the system’s architecture by introducing a multithreading dual-stream model, coupled with advanced monitoring — producing measurable results with significant business impact.

1.2 A Complex Technical Context: Cloud Migration and Distributed Stack

The insurer had launched a strategic 10-year cloud migration plan to modernize its critical systems — including the credit insurance subscription app, previously hosted on on-prem infrastructure. This application, directly exposed to clients via a web interface, had to maintain high performance and seamless UX with no visible disruptions.

Today, it runs on a modern but distributed stack, including:

AWS ECS (Elastic Container Service) for application services
AWS RDS (PostgreSQL) as the target database
AWS DMS to migrate data from IBM DB2
AWS API Gateway to secure and expose services
AWS Kinesis for real-time event processing
A Python backend application with PostgreSQL queries for decision-making
Terraform for infrastructure as code
Grafana and Prometheus for monitoring metrics

Despite the coherent overall setup, production revealed major limitations. Lack of granular visibility into system behavior — incoming volumes, query times, ECS container load, API performance, external dependencies — made identifying bottlenecks extremely difficult.

Though crucial in distributed environments, monitoring was incomplete. The absence of metric correlation prevented smooth end-to-end understanding, increased incident investigation time, and ultimately harmed client experience and reactivity.

1.3 Inadequate Monitoring: How to Diagnose the Unknown?

A major challenge of distributed systems is the ability to understand what’s happening — in real time. In this case, the lack of granular, cross-layer monitoring prevented reliable and fast diagnosis.

The result: guesswork, uncertainty, and often, an inability to locate the root cause of slowness or failure.

In several instances, it was the end-users themselves who reported slowdowns — after the fact. This degraded the user experience, eroded satisfaction, and impacted loyalty in an already competitive landscape.

It was impossible to precisely correlate:

Incoming transaction volume in RDS (via API Gateway or DMS)
SQL query performance on RDS
Python application latency in ECS containers
Event processing time in Kinesis
Response time from external service dependencies (e.g., API calls)

This lack of visibility — both in the app’s core and its interactions with external services — made identifying bottlenecks slow, uncertain, and costly. Every technical issue became a tedious manual investigation, draining teams and extending time-to-resolution — all while continuing to hurt the user experience.

1.4 Bottlenecks: Sequential Processing + Weak Monitoring = Performance Drag

The subscription app didn’t just handle real-time individual requests — it also needed to process large batches, like mass renewals or contract changes. The issue? These operations were all run through a single, sequential pipeline shared with the client-facing flows.

This setup quickly hit a wall. The competition between batch and real-time traffic for system resources led to major slowdowns and a degraded customer experience — worsened by a lack of visibility into internal or external bottlenecks.

Real-world consequences:

Real-time transactions delayed by an average of 5 minutes due to large batch blocking
Daily technical incidents, increasing operational workload
15% drop in customer satisfaction
Estimated daily losses of €300,000 from abandoned subscriptions

These disruptions had a serious human and organizational cost. Technical and operational teams were overwhelmed by recurring incidents. The lack of proper monitoring made root cause analysis long and inconclusive, reducing team efficiency and motivation. Repetitive manual diagnostic tasks emphasized the urgent need for intelligent, automated observability tools.

This lack of flow prioritization and cross-cutting monitoring posed a structural risk. It wasn’t just a technical challenge — it directly affected service quality, customer retention, and competitiveness.

2.How a Multithreading Architecture and Advanced Monitoring Transformed this Event Streaming App

2.1 Multithreading Architecture: Separate to Prioritize, Process & Scale

The solution required a paradigm shift: moving from a single sequential processor to a multithreading architecture that intelligently separates and prioritizes traffic.

The application uses PostgreSQL logical replication (via a replication slot) to receive transactions in real-time. From there, it generates CloudEvents representing coverage decisions.

To eliminate bottlenecks and prioritize critical flows, two synchronized consumer threads were introduced. They both read from the same logical stream but apply opposite business predicate filters:

Batch thread: handles heavy volumes (renewals, contract updates) without strict latency constraints
Real-time client thread: handles transactional requests with optimized response time (down to 1.6 seconds from 5 minutes)

Each thread uses an opposite predicate (P / ¬P) on the same events, ensuring exclusive, exhaustive, and collision-free partitioning of the stream.

In short: two synchronous consumers, one shared stream, inverse filtering logic, and smooth orchestration for parallel processing without conflict.

Technical & business outcomes:

Clear separation of batch vs. real-time flows
Automatic prioritization of critical client ops
SLA-compliant response times
Dramatic improvement in user experience

2.2 Advanced Monitoring: The Invisible Key to Success

While multithreading solved the latency issues, monitoring was the true enabler of sustainability, quality, and scalability. Over 70% of project effort was invested in instrumentation, traceability, and measurement.

The goal: make bottlenecks visible, production reliable, and continuous improvement data-driven.

This meant implementing detailed, actionable metrics across all components — including external dependencies — enabling an end-to-end view of production behavior.

A practical example: smoothing the signal to reveal the essential

At the start of the project, analyzing incoming traffic (batch vs. client requests) produced hard-to-read time series due to high frequency and volume (>1M changes/day across >20K transactions and >300K CloudEvents).

A 5-minute sliding average was first introduced to smooth the data. While it clarified trends, it also masked critical spikes: certain transactions, although constant in volume, had extremely high frequency — which the average failed to show.

A second layer was added: total volume per hour. Combining both metrics enabled cross-analysis (average response time, CloudEvent count, CPU load, etc.) to identify real traffic peaks and trigger proactive actions — like autoscaling resources or dynamically prioritizing processing.

Immediate benefits:

Proactive anomaly detection
Precise root cause analysis
Real-time dashboards for product/support teams
Faster incident resolution
Fewer support tickets → more time for innovation

Monitoring became a value accelerator for both IT and business teams.

2.3 Measurable Results: Direct Impact on Performance, Quality & Profitability

Latency reduced by 98% → from 5 minutes to 5 seconds
Incidents nearly eliminated → from 1/day to 1/month
+20% increase in customer satisfaction score
Resolution time cut down to minutes
Support load halved, allowing greater product focus
Internal confidence and peace of mind restored

This project didn’t just “fix a system” — it strengthened a strategic product, refocused teams on growth, and instilled confidence in the brand from the very first customer interaction.

Conclusion: Toward an Outcome and performance -Driven IT

This project proves that well-designed tech — planned from day one for performance, monitoring, and evolution — can deliver real business impact, fast.

Multithreading cleared up system congestion and prioritized critical flows. But what truly made a lasting difference was the monitoring. It allowed us to trace, measure, understand — and most importantly, anticipate. With the right metrics — visible and actionable — teams regained control over a complex system that had previously felt like a ticking time bomb.

The ability to detect peaks, fine-tune filters, smooth out data, and cross-analyze signals to uncover hidden patterns — that changed everything. The result: a more stable system, a better user experience, and far more effective, aligned technical teams.

Facing similar challenges and looking for a performance-first approach?

Get in touch with our experts for an audit and advisory session on high-performance, scalable, and metrics-driven event-driven architectures.

Réduire la latence & booster la fiabilité d’un système “event streaming”

Accélérez et optimisez vos analyses de données avec la nouvelle fonctionnalité de Google BigQuery

Automatisez vos présentations PowerPoint avec L'IA et Python

Cas client : Comment améliorer la fiabilité du processus de livraison informatique

Déploiement Blue/Green : Comment le configurer avec AWS Lambda@Edge

Migration Cloud : comment allier sécurité des données et efficacité opérationnelle ?

Accelerate and Optimize Your Data Analytics with Google BigQuery’s New Feature

Case Study :How to Improve the Reliability of the IT Delivery Process

Blue/Green deployment: How to configure it with AWS Lambda@Edge

Case Study: How to automate your PowerPoint with Python & AI

Case Study - How to Ensure Data Security and Fast Delivery with Hashicorp Vault & Kubernetes

Cloud Migration: How to combine Data Security and operational efficiency?