The Problem
Develop a validation test that is credible, persuasive, powerful, and manageable.
Scope: Faults of Interest
This testing technique is best used to detect errors in the
requirements analysis (as reflected in the running program), in the
implementation of something that was intended (a failure to deliver an
intended benefit or intended restriction), or in the interaction of two
or more features.
Simpler faults (e.g., a single feature doesn't work, a simple
benefit is simply missing or erroneously implemented) are best
discovered with other techniques, such as domain-based testing,
specification-based testing, or function testing.
Broader Context
Early in testing, relatively simple tests will fail a program. For
example, the program might fail in response to an input that is too
large, a delay that is too long, a typist who is too fast, etc. Random
input tests (such as dumb monkeys) might fail the program by repeatedly
triggering memory-leaking code or wild pointers. Mechanically derived
combinations of inputs or configurations (such as combinations derived
from all-pair test design) might be challenging for the program.
Eventually, the program can withstand the tests that are easy to
imagine, implement, and run. At this point, we can start asking whether
the program is any good (has value) rather than whether it is obviously
bad.
The objective of this type of testing is to prove that the program
will fail when asked to do real work (significant tasks) by an
experienced user. A failure at this level is a validation failure (a
failure to meet the stated or implicit program requirements.)
Immediate Context
"All" of the features have been tested in isolation. (More
precisely, all of the features that will be called within this scenario
have been tested on their own and as far as we can tell, none of them
has an error that will block this scenario test.)
The tester must have sufficient knowledge of the domain (e.g.
accounting, if this is an accounting program) and many of the ways in
which skilled users will use the program.
Forces / Challenges
The fundamental challenge of all software testing is the time
tradeoff. There is never enough time to do all of the testing, test
planning, test documentation, test result reporting, and other
test-related work that you rationally want to do. Any minute you spend
on one task is a minute that cannot be spent on the other tasks. Once a
program has become reasonably stable, you have the potential to put it
through complex, challenging tests. It can take a lot of time to learn
enough about the customers, the environment, the risks, the subject
matter of the program, etc. in order to write truly challenging and
informative tests.
There is often much time pressure to develop "realistic" tests
quickly, without paying the time cost needed for the tester to become
an expert. For example, the tester is encouraged to run genuine
customer data through the system (but without developing a ready
mechanism for determining whether the results are consistent with the
input data), to bring normal humans in for a "bug bash" or for
acceptance testing, or to rely on beta testing for realistic usage.
These are (or appear to be) cheap and easy ways to get complex test
cases that are meaningful to the user community. However, these tests
may not be very powerful (capable of revealing a defect if it is
there), they are likely to miss testing the error handling of the
software and the associated system, to miss important hardware/software
configurations, important timing issues, etc., and they may make it
hard for the tester to tell whether the program has passed or failed
the tests. Poorly designed scenario tests are widespread and
ineffective. It is demoralizing to develop a complex test, watch the
program fail it, bring the result back to the development team, and
have that result rejected or deferred because "no one would do that" or
"that's too extreme" (or some other convenient excuse that dismisses
the test as uninteresting).
Because complex tests are expensive, you can't develop very many of
them. What level of coverage will you get from these tests? What level
should you expect?
Solution: The Scenario Test
The ideal scenario test has four attributes:
1. The test is realistic (and therefore credible).
You know that this is something that a real user would attempt. You
might know this from use case analysis, focus groups, monitoring of
actual use of the program over time, discussions with experienced
customers, or from other models or sources of empirical evidence.
2. The test is complex.
It combines two or more features (or inputs or attributes—two or more
things that we could test separately) and uses them in a way that seems
as though it should be challenging for the program.
3. It is easy to tell whether the program passed or failed the test.
If a person has to spend significant time or effort to determine
whether the program passed or failed a series of tests, she will take
shortcuts and find ways to less expensively guess whether the program
is OK or not. These shortcuts will typically be imperfectly accurate
(that is, they may miss obvious bugs or they may flag correct code as
erroneous.)
4. At least one stakeholder who has power will consider it a serious failure if the program will not pass a given scenario.
In practice, many scenarios will be weak in at least one of these
attributes, but people will still call them scenarios. The key message
of this pattern is that you should keep these four attributes in mind
when you design a scenario test and try hard to achieve them.
^How can we make this test more realistic?
^Can we add more issues / features / extreme values to the test that we
are designing (that is, can we make the test more complex), while still
having a realistic result?
^How (and how quickly) can we tell whether the program passed or failed?
^Who would care if the program failed this test?
Resulting Context
I'll leave this for now, in favor of "Risks"
Risks
The key remaining problem is coverage. There is nothing inherent in
scenario testing that assures good coverage (by any measure) and it is
common to hear that a testing effort driven by scenarios achieved only
30% line coverage.
It is also often the case that scenario tests don't look carefully
enough at common user errors or failure scenarios or the situations
caused by disfavored users (see example 1 below).
Rationale / History
In the Art of Software Testing, Glenford Myers described a series
of ineffective tests. 35% of the bugs reported from the field had been
exposed by a test but the tester didn't notice or didn't appreciate the
failure, and so the bug escaped into the field. These (and many others
with the same problems) appear to be scenario-like tests (such as
transaction-flow tests that use customer data), but the expectation is
that testers will do their own calculations to determine whether the
printouts or other output are correct. The testers do some checking, a
small sample of the tests, but miss defects in the many tests whose
results were not checked. The complexity of the tests makes it much
harder to work out expected results and check the program against them.
The push toward ease of checking results stems from this. You might
make it easy to check results by providing an oracle, or a set of
worked results, or internal consistency checks, or in various other
ways. The point in this pattern (and in discussions like Myers') is
that you must pay attention to this issue or the tests will expose
defects that no human recognizes.
Another historical problem is the creation of complex tests that
appear artificial. It is extremely demoralizing to spend up to a week
building and running a complex test, find a bug, and discover that no
one thinks that it is important. The stress on writing/designing with a
stakeholder in mind comes from this concern. When I teach this in
classes (or to clients that I've consulted to), the most common
objection is that testers don't necessarily know what failures will
catch the attention of what people. Sometimes (for example, when you
work for an independent test lab that has little contact with the
client), it is very hard to learn anything about the stakeholders. But
even if this is difficult, it is worth asking the question—what would
interest a reader of a test result report? How can I design this test
to yield results that would be more compelling? What would marketing
care about? (For example, can you set up a test that looks like
something that might be done by the company's single largest customer?
Or by a not-so-friendly journalist?) What has been driving our tech
support manager crazy? Asking the questions might help you change
aspects of the test design in ways that do not go to the integrity of
the test but that slightly or significantly change the persuasive value
of the result. In my experience, as you show that you are paying
attention to the interests of others, you get feedback that makes you
more and more aware of those interests.
The problem of real-life testing is the problem of credibility. Any
complex test is open to dismissal (no one would do that, corner case,
artificial, etc.). Designing tests based on use cases, customer support
data, examples of actual things done with competitors products or with
your product previously, etc., makes these tests much more credible.
Another issue is that the population of possible scenario tests is
virtually infinite. Some population of tests should be designed to
reflect real uses, because otherwise, a product that is otherwise
over-engineered may well fail when customers try to do things that seem
entirely reasonable to them and to reasonable third parties. Hans
Buwalda's Soap Operas are superb examples of real-life focus. (These
write the description of the test case into a plausible story.)
These issues run throughout the test design literature and
conference talks. All that this definition of scenario testing does is
to gather the concepts together in a way that reflects practices that I
have seen as strong and effective in several companies (domains
including telephony, consumer game software, business-critical
financial application software, and office productivity software.)
There is another somewhat related use of the term "scenario." A
scenario under this definition is an instance of a use case, or an
instance of a concatenation of use cases. We don't have a 1:1 mapping
of this type of scenario to the scenario defined here (and in my
practice for at least a decade) but the relationship is worth noting. A
use case is, by definition, customer realistic. The instance may or may
not be complex—many of the examples that I've seen are very simple
tests, but others are fully complex scenarios.
Examples of Scenario Tests
Scenario 1. A Security Test
Imagine testing the security features of a browser. You do an analysis
of the user community, including favored and disfavored users (see
Gause & Weinberg's book, Exploring Requirements). Disfavored users
include the population of hackers and crackers. Your product's design
objectives include making it more difficult for disfavored users to
perform the tasks that they want to do.
You study the types of attacks that have been successful against
browsers before and learn from CERT that 65% of the successful attacks
against systems involve buffer overruns. You also learn that load
testing often has the result of degrading performance unevenly. A part
of the system might crash under load, or might be run at a lower
priority than the rest. You also note that denial-of-service attacks
(in which target systems are put under heavy load) are increasing in
frequency and publicity in the mainstream press.
Therefore, you focus your testing on a combinations of buffer-overrun
and load attacks. Can a skilled hacker disable part of your security
system by flooding you with a carefully selected pattern of inputs.
(When I say, "carefully selected pattern", I mean that your system can
be sent many different commands. The system might respond very
differently to millions of requests to process forms than millions of
requests to display the home page. Great load testing tries to generate
patterns of use that would reflect real-life; great security-related
load testing tries to generate patterns that might be attempted by
crackers.) It takes a lot of knowledge to design the security-oriented
load tests, and you might interview several people, read many log
files, run experiments on the impact of many different patterns on
performance of various features and of the system as a whole,
reliability, the logs, etc. You might find several bugs in these
simpler tests, but this work (though productive) is in preparation for
your primary scenarios. You do similar work on buffer overruns, sending
many types of input, trying to trigger failures caused by excessive
input or by calculations that result in excessive intermediate or final
results. Again, you might find errors as you go, but your primary goal
is to identify ways in which the system protects itself from extremes
and then to target them by overworking them or by overworking routines
that would otherwise have called them.
Your final series of tests generate massive loads in carefully
designed ways, and that include files, requests or inputs that could
trigger overflows. The tests also include probes—if a buffer has been
overrun, you should be able to see results—either a system is crashed
and no longer available or you can gain control, and now do something
that you couldn't do before. These probes and diagnostic messages in
the server logs are your primary means of detecting failures. When you
detect a failure, you might report it directly, or you might subject it
to further analysis, troubleshooting, and replication under
increasingly simpler conditions.
Your final failure report lays out the security risk, the details
of the attack needed, and the consequences if someone does this. If
necessary, you include in the initial report (or more likely, at a bug
review meeting) newspaper reports of attacks that weren't hugely
dissimilar to yours, plus CERT data, discussions of strategy in
magazines like 2600, examples gleaned from the RISKS forum and various
security-related books and mailing lists, plus other evidence that your
tests aren't absurd.
These failures may stand on their own (security failures are of great
concern to a lot of companies), but if people respond to your reports
with "Who cares" and "No one would do that", then you keep in mind that
someone in your company cares about failures like this. Maybe it's your
head of marketing, maybe it's the company president, the lead PR
staffer, or a senior engineer. You might cc your report to that person,
or seek out that person's advice ("how do I report this more
effectively?" or "this looks important to me, but what do you think?")
or appeal to that person if the bug is deferred ("Can you review this
report for me? Is this an important problem? What more would we have to
do in order to make other readers understand its importance?")
Eventually, if the problem is serious, this person will help you make
management confront it, understand it and make a rational business
decision about it.
Scenario 2. Configuration Testing
You are testing a product that will run on a network. It is
supposed to work on the usual browsers, operating systems (MS, Linux,
UNIX, Apple), with the usual devices (video cards, printers, rodents)
and communications devices (modems, ethernet connectors, etc.). The
cross-product of all possible valid configurations yields 752 million
possible tests, which would take about 40 minutes each in setup and
teardown time, plus 50 minutes each in compatibility test time if you
run the tests by hand.
You have already done single-issue compatibility testing, such as
printer compatibility when running under a very standard set of other
configuration variables (latest IE version, Win 2000, best selling
video card, Dell computer with original equipment, lots of memory, lots
of free disk space, the Dell-recommended best-selling ethernet card,
etc.). You can't test every instance of every device, but you pick the
key printers, the key video cards,etc.
You have also done some fairly mechanical combinations, such as
using the most memory-intensive video and printer devices and drivers
together. You found some problems this way. Some were fixed. Others
were blamed on the manufacturer of a peripherals ("This is their bug,
let them fix it!") or dismissed ("No one would set up their system that
way.")
Your challenge now is to set up a set of tests that will exercise
the system software and the peripherals in a way that is both harsh
(likely to expose problems) and realistic. If you pick strange-looking
combinations of devices and system software, you search through
customer records to show that there exist real humans who have
configurations like these, or through manufacturer's sell-you-a-system
websites to show that they do sell or will sell systems configured this
way. The tests you use include actual customer files (go to tech
support) and files based on them (but made harsher), and sequences of
tasks that are arguably plausible.
Your next challenge is to determine whether the program has passed
or failed. It's not enough to boot the program and set it up with the
new configuration. You have to try things, to do tasks that will depend
in some way on the configuration settings, and then you have to
interpret the results. Your strategy might involve test scripts with
checks against expected results, test oracles, printouts that show
exactly what the tester should expect to see on the screen, or other
methods for detecting a failure. There are so many tests, and they are
so broken up by setup and teardown, that you show respect in your test
design for the fact that testers will become inattentive (perhaps
because they are tired or very bored) after several hours or days of
this type of testing. Comparisons must therefore be of obvious things,
quick, or automate
d.
|