Quality Engineering, Planning & Strategy

Test Design: An Essential QA Skill

Niall Lynch

Last updated on Mar 29, 2025

QUICK SUMMARY

Test design is the overlooked foundation of successful testing in production. While companies focus on automation and processes, they neglect this critical skill. Learn why proper test design dramatically improves production testing outcomes and how to master it.

One of the key misconceptions in today's testing landscape is that testing in production requires only three skill sets: process expertise, knowledge of automated testing tools, and incident management capabilities. These are important, but conspicuous by its absence—not only in this list but in the minds of hiring managers—is something one would think would be the most fundamental:

The ability to design a practical test for production environments.

It would be comical if it weren't so alarming that almost no organization prioritizes key QA skill expertise when implementing testing in production strategies.

Whether a test is executed manually or through automation in a live production system, you must first know how to design one that delivers actionable insights without disrupting user experience, right?

It would be comical if it weren't so alarming that almost no organization prioritizes test design expertise when implementing testing in production strategies. Whether a test is executed manually in a controlled environment or automated in a live production system, you must first know how to design one that delivers actionable insights without disrupting user experience, right?

There's one obvious problem with excluding expertise in test design from your testing in production approach:

Isn't it essential to know that the person implementing tests in your production environment understands what constitutes a meaningful test?

Since automating poorly designed tests in production gives you unreliable data more quickly and creates unnecessary risk? And being "agile" while testing in production incompetently doesn't strike me as a real improvement, except perhaps to the scrum master.

The importance and discipline of test design, especially for testing in production scenarios, are long overdue for revival. Consider this my contribution to that effort.

What Is Essential for Effective Testing in Production?

The first step toward mastering testing in production is to make a few essential distinctions that will undoubtedly already be familiar to most QA people intuitively, but which are rarely presented systematically in production testing contexts.

Let's explore these fundamental concepts that will transform your testing in a production approach.

Explicit vs Implicit Functionality in Production Environments

Let's begin with the critical distinction between explicit and implicit functionality when testing in production.

The former is what most of us think of as "functionality." It refers to features and capabilities that are formally specified by the Product and whose formal specification guides its implementation in Engineering. When testing in production, these explicit functions are typically the focus of monitoring and observability tools.

Because of this, testing explicit functionality in production may seem straightforward, but even with well-defined features, this is not the case, as we will see later. At least the level of specificity possessed by explicit functionality makes it easier to build a scaffolding of production tests around it (or the illusion of that scaffolding).

Implicit functionality in a production environment is a very different animal.

It consists of behaviors and responses to user or environmental inputs that were not formally defined or anticipated. Failure to adequately design tests for this implicit functionality in production is, by far, the source of most critical bugs discovered after the product has been deployed (the other is inadequate testing in fringe hardware/software/device environments).

In other words:

Testing implicit functionality in production requires considerable ingenuity and imagination. It is the truly creative part of production testing strategies.

Testing implicit functionality in production environments requires a certain amount of fiendish cleverness to become good at, and sadly, that cannot be taught through standard QA processes.

No amount of agile methodology or automated testing frameworks will teach you how to test implicit functionality in production effectively, but proper test design can encourage it.

So, how do you define a test design strategy for implicit functionality in your production environment? Fortunately, it can be done. But before we dive directly into that question, let's explore a few further relevant distinctions critical for effective testing in production.

Positive vs Negative Testing

Key Takeaway: When testing in production, both positive and negative testing require special considerations that go beyond traditional QA approaches.

Most people working in QA understand the basic difference between these two testing types:

Positive testing in production: Validating that features work as designed in live environments
Negative testing in production: Strategically testing how your system responds to unexpected inputs without disrupting real users

Why Production Testing Is Different

Though these distinctions are conceptually clear, testing in production introduces unique challenges:

Higher stakes: Edge cases affect real customers and business operations
Real-world complexity: User behaviors rarely follow predictable patterns
Business continuity: Testing must not disrupt normal operations

Beyond the Specification

Specifications rarely provide everything needed for effective testing in production:

Being overly dependent on the spec—whether from Product or Engineering—limits your thinking. This is especially problematic when testing in production where real-world usage patterns often deviate from specifications.

This illustrates what I call "the empirical fallacy": waiting for something to explicitly tell you what to do before you can understand it.

Successful test design for production environments requires:

Proper functionality parameterization
Logical thinking about what's possible
Consideration of both user and environmental interactions

Real-World Example: Unexpected Behaviors

Back in the software Cretaceous (i.e., the 80s), I tested document conversion software and discovered surprising capabilities:

Show Image

Classic example: In WordPerfect, you could insert a line spacing command mid-paragraph affecting only subsequent lines, creating paragraphs with two different spacing values.

Modern equivalents when testing in production:

Unexpected user permission combinations in cloud applications
Unanticipated API request sequences in microservice architectures
Race conditions in high-concurrency environments

Now let's explore the three key parameters that will sharpen your production testing approach:

1. Feature Scope Testing

Definition: Identifying how users might deploy features in contexts or ways never imagined during design.

Why It Matters for Testing

Not all constraints are foreseen during development. In production environments, this oversight can have serious consequences.

Beyond Simple Validation

When testing in production, consider these boundaries:

Remember: Validating a feature in production isn't just confirming it works as designed, but verifying it cannot be used in unintended ways that might impact system stability or security.

2. Workflow Interruption Testing in Live Systems

Warning: This approach assumes you're already testing defined workflows in production. If not, address those fundamentals first!

Beyond the "Happy Path"

When testing in production, explore these critical workflow disruptions:

✅ Interrupted workflows: Process started but never completed
✅ Canceled operations: User explicitly cancels mid-process
✅ Restarted processes: User attempts the same action multiple times
✅ Backtracking: User returns to earlier steps with different inputs

Implementation Tips for Production Testing

Use techniques like feature flags to safely control the exposure of these tests in production environments.

Real-World Applications

Pro tip: These scenarios are rarely captured in specifications but represent real user behavior in production.

3. Sequentiality in Production Testing

The Challenge: In production systems with multiple concurrent processes, the sequence of operations can lead to unexpected failures that are difficult to reproduce in test environments.

Two Critical Sequentiality Patterns

Antecedent Contributing Conditions

Show Image

Definition: Events that must happen before a failing process
Characteristics: May only manifest after specific sequences that take days to naturally occur
Example in production: User permissions modified → cache expires → specific API called

Subsequent Contributing Conditions

Definition: Failures that only become apparent through later interactions
Manifestation: The failure remains masked until a later operation
Example in production: Data corruption occurs silently until a report is generated

Why Is This So Challenging?

Defining the scope of sequentiality testing in production requires:

Intimate knowledge of system interactions
State transition modeling for complex processes
Understanding of both intended and unintended state combinations

"Like those dual line spacing values in a single paragraph, these unexpected states can wreak havoc in production environments."

Essential Tools for Sequentiality Testing in Production

Chaos engineering to deliberately introduce controlled failures
Observability tools to track state changes across systems
Distributed tracing to follow request paths through microservices

Bottom line: Sequentiality issues represent one of the richest sources of serious bugs in modern production systems. Prioritize this aspect in your testing strategy.

The Bottom Line

Feature flags transform testing in production from a risky practice into a controlled, data-driven process. By decoupling deployment from release and providing instant rollback capabilities, they create a safety net that enables teams to validate software in real conditions without compromising user experience or system stability.

Feature Management Platforms in Production Testing

Feature management platforms have transformed testing in production from a risky endeavor into a controlled, systematic process. Yet many organizations fail to leverage these tools effectively within their test design strategy.

The Evolution of Risk Management in Production

Traditional approaches to testing in production involved binary decisions—either a feature was live for everyone or for no one. Feature management platforms fundamentally change this paradigm:

Granular control: Test features with specific user segments rather than all-or-nothing deployments
Instant remediation: Disable problematic features without code deployments or rollbacks
Progressive exposure: Gradually increase user exposure based on real-time performance data

Beyond Simple Feature Flags

While basic feature flagging has existed for years, modern feature management platforms provide critical capabilities that transform production testing:

When integrated properly into your test design strategy, feature management platforms enable:

Targeted risk containment - Limit exposure of new features to specific user segments
Real-user validation - Test with actual users rather than synthetic test data
Immediate remediation - Address issues without emergency deploys or rollbacks

Integration with Observability Tools

The true power of feature management platforms emerges when they're integrated with observability and monitoring systems:

Traditional Approach:

Detect issues after full deployment
Manual verification of test results
Post-mortem analysis after incidents

Feature Management Integration:

Correlate feature activation with system metrics
Automated performance monitoring per feature flag
Real-time insight into feature impacts

Real-World Impact: Transforming Production Testing

Organizations implementing feature management platforms report significant benefits:

Reduced incident severity: "We can test features in production well before a marketing launch. And if a feature causes problems on the day of the launch, we can just turn it off with a kill switch—no rollbacks." — Chris Guidry, VP of Engineering, O'Reilly Media.
Accelerated release cycles: Teams at IBM, TrueCar, and other leading companies leverage feature management for testing in production, enjoying what they describe as "safe, unceremonious releases."
Enhanced test coverage: Feature flags expose edge cases that are impossible to simulate in controlled environments.

Practical Implementation in Test Design

To effectively incorporate feature management platforms into your test design strategy:

Design verification points that align with feature flag transitions
Create fallback scenarios for each feature-flagged component
Establish performance thresholds that trigger automatic feature disablement
Define segment-specific test cases to validate behavior across user populations
Document flag dependencies to prevent cascading failures

Feature management platforms don't replace thoughtful test design—they amplify it. The quality of your production testing still depends fundamentally on how well you've designed your tests.

Common Pitfalls to Avoid

Even with robust feature management, several challenges remain:

Flag debt - Abandoned or forgotten flags creating technical debt
Insufficient monitoring - Failing to correlate flag state with system performance
Overconfidence - Reducing pre-production testing prematurely
Flag interdependencies - Creating complex, difficult-to-troubleshoot relationships

The Bottom Line

Feature management platforms provide powerful capabilities for testing in production, but they must be thoughtfully integrated into your overall test design strategy. The organizations gaining the most value don't just deploy these tools—they fundamentally rethink how test design works in a feature-flagged environment.

When properly implemented as part of a comprehensive test design approach, feature management transforms testing in production from a necessary evil into a competitive advantage.

Tools for Testing in Production: Implementation Challenges

While feature flags and management platforms form the foundation of testing in production, the broader tooling ecosystem presents its own set of implementation challenges. Selecting and configuring the right tools requires careful consideration of team capabilities, system architecture, and organizational needs.

Monitoring and Observability Tools

Effective testing in production depends on comprehensive visibility into system behavior:

Application Performance Monitoring (APM)
Benefits: Provides detailed performance insights across the application stack.
Challenges: Often requires significant instrumentation and can generate excessive data that overwhelms smaller teams.
Log Management Systems
Benefits: Essential for debugging and forensic analysis during production testing.
Challenges: Can generate overwhelming volumes of data without proper filtering and indexing strategies.
Distributed Tracing
Benefits: Offers end-to-end visibility across microservice architectures.
Challenges: Implementation complexity increases dramatically in heterogeneous environments with multiple technology stacks.

Alert Management Systems

Proper alerting is critical when testing new features in production:

Alert Aggregation Platforms
Benefits: Consolidate notifications from multiple monitoring systems.
Challenges: Great for incident management but can be disruptive if not carefully configured to avoid alert fatigue.
Incident Response Systems
Benefits: Streamline communication during production incidents.
Challenges: Require well-defined runbooks and integration points to be effective; can create overhead for simple deployments.

Implementation Considerations

When implementing tools for testing in production, teams should evaluate:

Scalability requirements - Will the tool handle your traffic volumes and data growth?
Integration capabilities - How easily does it connect with your existing toolchain?
Resource consumption - What overhead does the tool itself introduce?
Team expertise - Do you have the skills needed to maximize the tool's value?
Signal-to-noise ratio - Can you extract meaningful insights without drowning in data?

Common Implementation Pitfalls

Tool sprawl - Accumulating too many overlapping solutions
Incomplete instrumentation - Missing critical monitoring points
Alert fatigue - Generating excessive notifications that teams eventually ignore
Insufficient context - Failing to correlate metrics with user impact
Data silos - Creating isolated monitoring systems that don't share information

Finding the Right Balance

The most successful testing in production implementations achieve balance between:

Comprehensive monitoring vs. manageable complexity
Detailed insights vs. information overload
Automated responses vs. human decision points

When implementing testing tools in production environments, start small with focused objectives, then gradually expand coverage as your team develops expertise in both the tools and in interpreting the resulting data.

Load, Complexity, Latency

Consider the system macro factors of load, complexity, and latency in your test design. These may affect the execution or completion of a request, process, or event.

The relevance of system load should be obvious. How requests or transactions are processed during periods of high system load stress may fail (at any stage in the transaction process).

This should be a default part of your test planning. As we all know, load can also be generated as a result of the request itself—i.e., a data query that triggers the processing and transmission of vast amounts of data.

Complexity refers here primarily to the complexity of requests asserted against a system. This complexity may consist of the number of conditions specified (and their exclusions and exceptions), the number of databases (virtual or otherwise) implicated in the requests, or the processing topology of the system itself.

In the context of this discussion, I use latency to indicate the introduction of time lapses in the request process, which is not its usual meaning. I am referring here to user latency, not response latency from the system itself.

In other words, how does the feature or capability behave if the user comes to a specific step in the process and then stays paused there, doing nothing? Does the system time out (it probably should)? Should it prompt the user? Should it remain in that state until the end of time?

The answers to these questions may be provided in the specification, but the product under test may not behave that way, which is why we test to begin with.

User Roles

Of course, end users can interact with software systems in various roles. The same user can interact with the system in different roles, depending on their actions.

However, processes and services can also have different roles and associated privileges. In either case, be sure your test design and planning, whether for features or capabilities, considers and exercises all possible role states.

What's Next?

Looking for more test design tips? And want to boost your SaaS growth and leadership skills?

Subscribe to our newsletter for the latest insights from CTOs and aspiring tech leaders.

We'll help you scale smarter and lead stronger with guides, resources, and strategies from top experts!

What Is Essential for Effective Testing in Production?

Explicit vs Implicit Functionality in Production Environments

Positive vs Negative Testing

Why Production Testing Is Different

Beyond the Specification

Real-World Example: Unexpected Behaviors

1. Feature Scope Testing

Why It Matters for Testing

Beyond Simple Validation

2. Workflow Interruption Testing in Live Systems

Beyond the "Happy Path"

Implementation Tips for Production Testing

Real-World Applications

3. Sequentiality in Production Testing

Two Critical Sequentiality Patterns

Antecedent Contributing Conditions

Subsequent Contributing Conditions

Why Is This So Challenging?

Essential Tools for Sequentiality Testing in Production

The Bottom Line

Feature Management Platforms in Production Testing

The Evolution of Risk Management in Production

Beyond Simple Feature Flags

Integration with Observability Tools

Real-World Impact: Transforming Production Testing

Practical Implementation in Test Design

Common Pitfalls to Avoid

The Bottom Line

Tools for Testing in Production: Implementation Challenges

Monitoring and Observability Tools

Alert Management Systems

Implementation Considerations

Common Implementation Pitfalls

Finding the Right Balance

Load, Complexity, Latency

User Roles

What's Next?

What is Software Quality Management?

Creating a Quality Strategy

Top 4 Quality Engineering Trends in 2025