Revolutionize Platform Engineering with Traffic Replay
Platform engineers focus on developing an internal developer platform to increase velocity during the software development process. Dedicated internal infrastructure allows teams to deliver new features faster by automating key tasks such as creating environments and running tests.
The most effective developer platforms let you deploy to production with confidence by verifying all changes meet consistent quality, security, and performance requirements. Attaining performance standards can be challenging, however, because it’s often hard to implement tooling that’s capable of measuring real-world performance. This article explains how traffic replication lets you regularly run realistic performance tests to enhance your platform engineering tools.
The Need for Frequent Performance Testing
Frequent performance tests are essential to ensure users continue to receive the experience they expect. Changes can be secure, reliable, and pass all functional tests, but they should still be blocked from deploying if they introduce latency or cause unacceptable response times. Allowing performance issues to persist will frustrate existing users and could discourage prospective buyers.
Performance regressions can be tricky to identify. With so many variables affecting performance (both actual and perceived), platform engineering teams may struggle to pinpoint the cause of observed issues:
- Code changes can introduce inefficiencies through suboptimal processing.
- User growth can lead to general slowdowns due to reduced server capacity.
- Users may experience different latency as they move between networks and geographic locations.
- Updates to dependencies within your supply chain can cause unseen performance problems, even when you’ve not made changes to your own code.
- The application could meet all performance targets but have low perceived performance, perhaps because of excessively long animations or screens that make users unnecessarily move through multiple steps.
- Competitors may optimize their own services, causing users to reevaluate their expectations and make unfavorable comparisons.
Failure to prioritize performance testing creates long-term pressure on platform engineering teams. Performance issues will accumulate, leading to tech debt that’s increasingly complex for engineering organizations and software developers to address. End users will feel performance suffering, which can push platform teams to alternative solutions. Moreover, slow performance is often caused by inefficient resource utilization, meaning your operating costs will be unnecessarily inflated. The combination of these factors creates a downward spiral that negatively affects the service capabilities of your business value.
Frequent performance testing helps mitigate these problems. Using tangible data to compare the performance of your releases lets you detect anomalies, evaluate trends, and make incremental adjustments to maintain and improve end users experience. But how can you collect this data and anticipate the effects of changes without actually deploying them to production?
Using Production Traffic Replay to Accelerate Performance Testing
Production traffic replication (also known as traffic replay) is the solution to achieving frequent performance testing within your internal developer platforms. This technique captures real traffic to your production environments, then lets you replay it on demand against your other environments—such as your development, test, and staging instances.
Replication enables fully automated performance testing that’s accessible to every engineer. Devs can use your existing platform engineering tools to start a new environment for their changes, then simulate real-life usage based on previously observed traffic conditions. This creates an opportunity to identify and resolve regressions before they’re experienced by users.
Benefits of Traffic Replication
Setting up traffic replication provides multiple benefits to developers and platform engineers alike. Listed are the the main advantages for your platform team.
Realistic Performance Analysis
Replays accurately represent real-world user activity. Your internal platforms can repeat actual scenarios that have occurred in the past to see how the performance of key user flows (such as login or checkout) changes between revisions.
Ability to Prepare for Performance Extremes
Preparing optimizations for an influx of sales traffic around Black Friday or the holidays? With traffic replication, you can test your systems by replaying peak traffic from the past—such as last year’s sale—to check if your improvements are likely to be effective.
Efficient Use of Resources
Traffic replication is a more efficient type of performance testing. You can capture and store meaningful production traffic once, then direct it to the specific temporary development environments that developers are actually interacting with. The use of data compression, network caches, and on-demand environments that self-destruct after use minimizes the overhead required to run realistic performance tests.
Self-Service Access for Developers
Integrating traffic replication with existing platform engineering frameworks gives developers self-service access to performance tests. They can use established tooling to start an environment, replay traffic, and obtain consistent test results. This improves the developer experience and enhances developer productivity by expanding the engineering role.
Easier to Automate
Automation is a staple for best engineering practices. Using traffic replication makes it simpler to automate predictable performance tests. Performance engineers often spend huge amounts of time manually creating and maintaining test scripts, only to find they don’t accurately reflect real-world usage. Replication tools like Speedscale avoid this by continually capturing performance data from actual user requests. You can then replay the traffic against your environments without having to handcraft your own test frameworks.
Reproducible Analysis
Production traffic replays permit reproducible analysis of performance problems. If an issue is identified in a particular endpoint, you can check the captured performance data to determine whether the slow response was an outlier or if all users are affected. Afterwards, you can replay the traffic in different environments and under varying conditions to confirm the extent of the problem. Members from outside your platform engineering team—such as security or QA specialists—can easily recreate the experiment too by running the same replays against their own environments.
Challenges and Considerations for Traffic Replication
Setting up traffic replication removes risk from code deployments and infrastructure changes by allowing you to simulate the performance impact ahead of time. Nonetheless, implementing replays still requires careful consideration to avoid potential pitfalls.
Data privacy and security are foremost among the challenges you’ll face. Although you should aim to replicate real production data, it won’t always be appropriate to replay it exactly as it was sent. Sensitive values—such as passwords, tokens, PII, and payment card details—need to be removed or mocked with fake replacements to preserve privacy and compliance. It’s best to use a solution that can handle these tasks automatically, such as Speedscale with its integrated value replacement and simulation support.
As discussed above, production traffic replication can improve the scalability of performance testing. However, it also comes with its own resource management challenges. The traffic data you ingest typically requires substantial storage capacity, so it’s important to clean up old records that are no longer required.
Similarly, replaying traffic into new environments can be expensive in both computational and I/O terms, depending on the number of events that need to be replicated. It’s important to plan for possible costs and evaluate different cloud solutions to ensure your replication tasks stay scalable. Replays must complete as quickly as possible so that developers can receive performance test results when they need them. This keeps the SDLC feedback loop short and efficient.
Implementing Traffic Replication
Traffic replication should be implemented through deep integration with your platform engineering processes. This enables a holistic approach to performance testing that’s within easy reach for all developers.
Tools and Technologies
You need to select the right replication tools to get the best results from traffic replication. All-in-one solutions such as Speedscale can save you time and money by providing everything you need to capture traffic, sanitize it, and run replays in your environments. Conversely, having to assemble your own solution using separate analysis, testing, and mocking services incurs a higher setup time and more management overheads.
Your chosen system should be included in your testing pipeline so that developers can access it through your internal developer platforms. Combining traffic replays with CI/CD pipelines lets you run performance tests before code is merged and deployed, guaranteeing that consistent performance baselines are maintained.
It’s also important to ensure you can replicate traffic for all the components and architectures you’re using. Cloud native apps and containerized workloads can require different tools to run on cloud providers versus on-premise data centers or legacy networks, while individual application types such as websites and APIs can also carry unique requirements for analyzing and replaying events.
Best Practices
Here are three best practices to follow as you configure traffic replication for your services.
1. Capture and Replicate Realistic Traffic
Traffic replication strategies should be oriented toward realistic traffic flows. This means excluding any outlying traffic that masks real trends in activity while allowing such traffic to be included when simulating peak load and other extreme events. On the other end of the scale, always replaying small volumes of artificially generated traffic won’t provide an accurate picture of your service’s actual performance.
2. Design Accurate Test Cases for Replication
Test cases need to be carefully designed for compatibility with traffic replication. Keeping tests simple allows for more precise analysis of the effects of each change. It also ensures you target real-world use cases that exhibit meaningful metrics you can improve upon. Focused tests also improve efficiency because you don’t need to replay as much traffic before you run them.
3. Focus on Meaningful User Flows
Traffic replication is most impactful when applied to your core user flows. Replaying traffic that targets your checkout system, chat message server, or photo upload system lets you track the performance of these critical components to understand where inefficiencies are occurring. Taking too high a view can cause performance issues to be masked or incorrectly attributed, if a few unoptimized endpoints are negatively affecting the average performance of your whole app.
Which Performance Metrics to Monitor?
Once you’ve implemented the tooling for traffic replication, you need to identify which performance metrics you’ll actually monitor. Most cloud platforms and APIs can use the following as a starting point:
- Request duration: The average time taken for each request to be served.
- Network latency: The average time taken for a request to pass through your cloud infrastructure to your software applications.
- Response time: The average time taken for your software applications to produce a response to a request.
- Error count: The proportion of requests that resulted in an error over a measured time period, typically expressed as a percentage.
You should extend this list with any custom performance metrics that are meaningful to your app.
After identifying your metrics, you can instrument your app to continually expose their values and collect performance data. Time series databases like Prometheus are a popular solution. This type of tool will establish your app’s baseline performance to provide a reference point as you deploy changes to test environments, replay traffic, and observe how your metrics are affected.
Conclusion
Platform engineering improves DevOps practices in software engineering by producing internal developer platform tools that automate key workflow tasks. As part of these efforts, it’s important to frequently run performance tests that allow you to catch any degradations introduced by new code changes. This article discussed how including traffic replay capabilities in platform engineering allows engineers to quickly bring up new app instances, send realistic traffic to them, and validate that performance has been preserved.
Speedscale is a production traffic replication service that’s ready to integrate with your platforms. It listens to traffic from your Kubernetes apps, automatically detects and mocks dependencies, and runs performance, regression, and chaos tests within your cluster—no external services required.
Try Speedscale for free to simulate real-world traffic conditions and learn your app’s true performance.