Innovate
Your Software
Faster without Risk

FeatBit, a Fast & Scalable Open Source Feature Flags Management Tool built with .NET; Ideal for Self Hosting.

Innovate Your Software Faster without Risk

A Fast, Scalable, and Open-source Feature Flags Management Service. For Cloud & Self-hosting.

How to Create Safe Testing Environments:Using Feature Flags for Controlled Testing

Last updated date:

Testing is a critical phase in software development, ensuring that new features work as intended and existing functionalities remain unaffected. This blog post explores the concept of safe testing environments and how feature flags can be used for controlled testing outside traditional environments.

← For a quick overview, refer to the 'On This Page' ToC (Table of Contents) on the left side to navigate to the sections of your interest.

Understanding Testing Environments

Test environments are specialized settings where software, applications, or systems are tested to ensure they function correctly under various conditions. These environments are crucial for identifying bugs, verifying functionality, and ensuring that the software meets the necessary quality standards. Here are some common types of test environments:

  1. Development Environment.
  2. Testing/QA (Quality Assurance) Environment.
  3. UAT environment.
  4. Staging Environment.
  5. Production Environment.

Development Environment

A Development Environment refers to the setup or configuration where software developers create and modify software applications. It's an essential aspect of the software development process, providing developers with the tools and resources needed to write, test, and debug code. A development environment is typically tailored to the needs of a project and the preferences of the developers. It can be a local setup on a developer's computer or a cloud-based environment. The key is to provide a space where developers can work on the software in isolation from the production environment, reducing the risk of unintended impacts on live systems.

Testing/QA (Quality Assurance) Environment

The Testing/QA (Quality Assurance) Environment is a specialized setup in the software development process, dedicated to rigorously testing software applications to identify and fix bugs, verify functionality, and ensure compliance with specified requirements. This environment is critical for maintaining the quality and reliability of software products.

Ideally, the Testing/QA environment closely resembles the production environment in terms of hardware, software, configurations, and data. This similarity helps in identifying issues that might occur in the live environment. However, it's isolated from actual users and production data to prevent any impacts on real-world operations. Sometimes the environment needs to simulate data sent by a third party to complete the test.

UAT environment

A UAT environment, or User Acceptance Testing environment, is a stage in the software development process where the end users of a software application test it in an environment that simulates the real-world or production environment. This is done to verify and validate that the software meets the business requirements and works as expected for the end users.

Staging Environment

The primary purpose of the staging environment is to simulate the production environment as closely as possible. This allows for the detection of any remaining issues that might not have been identified in earlier testing phases. The staging environment mirrors the production environment in terms of hardware, software, configuration, and often data. This includes the same operating system, database systems, network configuration, and any other relevant software components.

The difference with the UAT environment is that the scope here includes all technical aspects of the system. Testing is comprehensive, covering aspects such as performance, security, load and stress testing.

Before feature flags became popular, the staging environment was often viewed as a final checkpoint to catch any issues that might affect the user experience or functionality in the live environment.

Production Environment

The Production Environment in software development is the final setting where the software application is actually deployed and made available for use by end users. It's the environment where the software performs its intended tasks in the real world. Unlike development, staging, or testing environments, the production environment contains live data and interacts with real users. It is where the software delivers the value it was created for. This environment is optimized for stability, reliability, and performance. It must be robust enough to handle expected and unexpected user loads and workflows.

These years, the concept of "testing in production" has been brought to the development lifecycle. It becomes one of the must-have step in the whole testing lifecycle. This brings a final checkpoint to catch issues that might affect user experience or functionality in the live environment.

Testing in Production

Testing in Production refers to the process of testing new features, updates, or changes directly in the live environment where the end-users interact with the application. This approach acknowledges that no pre-production environment (like development or staging) can perfectly replicate the complexities and unpredictability of the production environment. Here are some key aspects of Testing in Production:

  1. Real-World Feedback: Testing in production offers immediate feedback on how changes perform under actual usage conditions with real users and data.
  2. Identifying Real Issues: It helps in identifying issues that may not surface in controlled testing environments, such as specific user behavior, real-world load scenarios, and interactions with other systems.
  3. Feature Flags: Often implemented using feature flags, which allow developers to enable or disable features without deploying new code. This provides a way to test new features on a subset of users.
  4. Canary Releases: Gradually rolling out changes to a small subset of users to gauge the impact and performance before a full rollout.
  5. Monitoring and Observability: Crucial for identifying issues as they occur in real-time. Tools for logging, performance monitoring, and user feedback are essential.
  6. Fallback Mechanisms: Ability to quickly revert changes if they cause issues, to minimize the impact on user experience.
  7. Dark Launching: Releasing features to production without exposing them to users, allowing teams to test the back-end aspects in the live environment.
  8. A/B Testing: Comparing two versions of a feature to see which performs better in the live environment.

Testing in Production is a valuable strategy, especially in fast-paced or continuously evolving environments. It provides real-world insights that are often impossible to replicate in pre-production environments, helping to create more resilient and user-friendly applications. However, it must be executed with careful planning and robust monitoring to minimize risks to the user experience and system stability.

Using Feature Flags for Controlled Testing in Production

What is a Feature Flag?

A feature flag, or feature toggle, is a technique allowing developers to enable or disable features without deploying new code. This method works across different environments, including development, staging, and production. Feature flags offer a unique advantage in production environments, enabling real-time feature control without redeploying or restarting applications. Using feature flags for controlled testing in production, rather than traditional controlled environments, provides several benefits:

  • Real User Interaction: Test how real users interact with new features.
  • Risk Mitigation: Gradually roll out features to mitigate potential risks.
  • Accurate Feedback: Collect feedback under real-world conditions.

Real-World Example of Feature Flag Usage:

Consider an e-commerce company that wants to introduce a new recommendation engine (or algorithm) to enhance the user experience by suggesting products based on their browsing history and purchase patterns. This change won't involve modifying the front-end code, API gateway, or endpoints. What engineers often do is:

  1. Create a new engine (or algorithm) in an isolated code file, class, or even as a new microservice.
  2. Change the internal code function behind the API interface to call the new recommendation engine.
  3. Deploy the updated version of the backend service.

However, what if the new algorithm doesn't perform well or causes critical errors when interacting with real-world business data? A prudent approach is to roll out the new recommendation engine progressively in production.

Here are some straightforward steps:

  1. Locate the code where the old recommendation engine is called. It might look something like this:
public async Task<RecommendResult> CallRecommendationEngine(Parameters params){
  // some code
  return RecommendationEngineS2S(params);
}

"S2S" is the name of the old algorithm,o we call the function RecommendationEngineS2S.

  1. Prepare a feature flag, which is essentially a variable. Use this feature flag variable to create an if/else condition with your old recommendation engine code. It might look like this:
public async Task<RecommendResult> CallRecommendationEngine(Parameters params){
  // some code
  var newRecommdationFlag = _featureFlags.StringVariation("recommendation-engines");
  if(newRecommdationFlag  == "S2S"){
    return RecommendationEngineS2S(params);
  }
}
  1. Write a function to call the new recommendation engine and name it RecommendationEngineDavinci, where Davinci is the name of the new engine. Then, add the code to call the new recommendation engine under the "Else" curly brackets.
public async Task<RecommendResult> CallRecommendationEngine(Parameters params){
  // some code
  var recommdationAlgoFlag = _featureFlags.StringVariation("recommendation-engines");
  if(recommdationAlgoFlag == "S2S"){
    return RecommendationEngineS2S(params);
  }
  else if(recommdationAlgoFlag == "Davinci"){
    return RecommendationEngineDavinci(params);
  }
}
  1. During code execution, you need to control the return value of _featureFlags.StringVariation("recommendation-engines"). If you're using a third-party feature flag tool, this is straightforward. The _featureFlags.StringVariation is a variable/parameter from the third-party's SDK, which will return the appropriate value based on your configuration in the feature flag management system (through a UI).

Controlling the testing in Production with mature Feature Flag Tools

You are already familiar with what a feature flag is and understand its basic usage in real-world scenarios. Now, let's explore how to use a mature feature flag management tool to control feature release and testing more flexibly. For this demonstration, the screenshots are from an open source feature flags tool FeatBit, to illustrate methods for controlling tests. (Here is a list of open-source feature flag management tools that might be of assistance to you.)

Decouple deployment from release of a new feature

When testing in production, it's crucial to differentiate between 'deployment' and 'release'. Deployment refers to the process where a new binary, package, or image is transferred to a machine, container, or device. The new deployed version is now running in production.. At this stage, new features are included in the binary but are not yet active. Release, on the other hand, occurs when these new features are activated and made available to the public.

This concept can be easily understood with the real-world example code (simplified and repeated below). In this instance, the 'Davinci' algorithm within our code is executed only if the return value of _featureFlags.StringVariation("recommendation-engines") equals 'Davinci'. This approach is what we refer to as 'Decoupling deployment from release'.

var recommdationAlgoFlag = _featureFlags.StringVariation("recommendation-engines");
if(recommdationAlgoFlag == "Davinci"){
    RecommendationEngineDavinci(params);
}

How to control return value of a feature flag? Here's a configuration screenshot in FeatBit's UI. If the feature flag "recommendation-engines" is turned off, it will return "S2S" (configured on the left side of the configure panel). If the feature flag is on and user is in QA group, it will return "Davinci". Otherwise, it will return "S2S".

But how do we control the return value of a feature flag? Here's an example using FeatBit's UI configuration. If the feature flag 'recommendation-engines' is turned off, it will return 'S2S' (as configured on the left side of the configuration panel). If the feature flag is on and the user is in the QA group, it will return 'Davinci'. Otherwise, it defaults to returning 'S2S'. (as configured on the right side of the configuration panel)

Using percentage rollout to release the feature progressively

Using a percentage rollout allows for the progressive release of a feature, with the option for an immediate rollback if any issues or errors arise. This strategy helps to minimize potential risks. The figure below demonstrates how to use FeatBit's feature flag UI panel for a controlled, phased release. To do this, you simply adjust the percentage for the return value of 'Davinci'. For example, setting it to 20% for the 'Davinci' Recommendation Engine means that 20% of requests, users, queries, etc., will use the Davinci engine for recommendations (in the code the feature flag will get "Davinci" as return value), while the remaining 80% will default to the S2S engine (in the code the feature flag will get "S2S" as return value).

For more precise control over the release process, especially when targeting specific user groups, you can use 'Targeting rules'. This feature lets you customize your targets, ensuring that your traffic splitting strategy impacts only those specific groups. An example is shown in the figure below, where the configuration is set to release the new Davinci recommendation engine to just 20% of users in California.

This principle is versatile and can be applied to various programming scenarios, including database migration. For more detailed information and examples, please refer to my related article on this topic.

Release features to specific team members

Your boss might express a desire to participate in testing a new feature. In this case, you can add him as an individual user who will have access to the feature. This can be easily configured as demonstrated in the figure below by adding his user account to the "Individual targeting" list.

Use reusable segmentation

You might have a QA (Quality Assurance) group, as many features need to be tested in production before being released to the public. Often, the QA testers are the same individuals, such as members of the QA team. Therefore, you can create a user segment called 'QA Group'. In this segment, you can add individual members or define targeting rules to determine the group's composition. It might look something like this:

Then, use this reusable segment in the customized rule section of your feature flag to quickly configure the traffic splitting strategy.

Schedule the feature release time

Schedule the feature release time to control the testing process, particularly when your target users are in different time zones. The figure below demonstrates how you can release a new feature to 5% of the public users in the Western U.S., ensuring that your team, also located in California, can respond to any issues promptly during business hours.

Robust Monitoring and Workflow to Minimize Risks

During a release, it's crucial to roll back a feature as soon as an error is detected. We often integrate our feature flag usage and change events with an intelligent observability tool (such as DataDog or New Relic One). This integration allows us to detect problems quickly and trigger a rollback when an alert occurs.

  • Feature Usage Events typically occur in front-end applications. When a user engages with a new feature, a usage event is sent to the Real User Monitoring (RUM) module of the observability tool.
  • Feature Flags Change Events are changes made by your teams (developers, product managers, marketing, etc.). These events are sent to Application Performance Monitoring (APM) as deployment or activity events. An intelligent APM observability tool may then trigger a callback action to switch off (or roll back) the problematic feature.

Here’s an example of how FeatBit integrates with New Relic One APM. FeatBit offers the following integration methods with New Relic One:

  1. FeatBit sends feature flag change events to New Relic as deployment events, which can be used for change tracking.
  2. FeatBit uses a trigger (or API) to toggle a flag's targeting on or off based on certain performance metrics thresholds.

For example, the figure above illustrates:

  1. A noticeable error rate increased in Web Transaction after a feature flag was turned on.
  2. A critical issue is identified.
  3. Subsequently, the feature flag is turned off automatically in response to the triggered alert.
  4. The response time peak and error rates return to normal.
  5. The critical issue is successfully resolved.

Conclusion

This blog explores the critical role of various testing environments in software development, emphasizing the innovative use of feature flags. We cover the essential stages "Development", "Testing/QA", "UAT", "Staging", and "Production" environments - each pivotal for ensuring software functionality and quality. The focus shifts to the Production environment, highlighting "testing in production" as a valuable strategy for capturing real-world feedback and identifying issues that may not surface in controlled settings. Feature flags emerge as a crucial tool, enabling developers to test new features in the live environment without full deployment. This method provides flexibility, minimizes risks, and allows for real user interaction analysis.

We demonstrate this with a practical e-commerce example, showcasing how feature flags can introduce new features safely. The integration of feature flags with observability tools like DataDog or New Relic One ensures quick error detection and efficient rollback, thereby enhancing the overall stability and user experience. In conclusion, the blog underscores the importance of controlled testing and feature flags in delivering high-quality, resilient software.