Good Test Cases Copyright © Cem Kaner 2003. All rights reserved. Page 4 Let’s narrow our focus to the test group that has two primary objectives: Find bugs that the rest of the development group will consider relevant (worth reporting) Get these bugs fixed. Even within these objectives, tests can be good in many different ways. For example, we might say that one test is better than another if it is: More powerful. I define power in the usual statistical sense as more likely to expose a bug if it the bug is there. Note that Test 1 can be more powerful than Test 2 for one type of bug and less powerful than Test 2 for a different type of bug. More likely to yield significant (more motivating, more persuasive) results. A problem is significant if a stakeholder with influence would protest if the problem is not fixed. (A stakeholder is a person who is affected by the product. A stakeholder with influence is someone whose preference or opinion might result in change to the product.) More credible. A credible test is more likely to be taken as a realistic (or reasonable) set of operations by the programmer or another stakeholder with influence. “Corner case” is
Good Test Cases Copyright © Cem Kaner 2003. All rights reserved. Page 5 an example of a phrase used by programmers to say that a test or bug is non-credible: “No one would do that.” A test case is credible if some (or all) stakeholders agree that it is realistic. Representative of events more likely to be encountered by the customer. A population of tests can be designed to be highly credible. Set up your population to reflect actual usage probabilities. The more frequent clusters of activities are more likely to be covered or covered more thoroughly. (I say cluster of activities to suggest that many features are used together and so we might track which combinations of features are used and in what order, and reflect this more specific information in our analysis.) For more details, read Musa's (1998) work on software reliability engineering. Easier to evaluate. The question is, did the program pass or fail the test? Ease of Evaluation. The tester should be able to determine, quickly and easily, whether the program passed or failed the test. It is not enough that it is possible to tell whether the program passed or failed. The harder evaluation is, or the longer it takes, the more likely it is that failures will slip through unnoticed. Faced with time-consuming evaluation, the tester will take shortcuts and find ways to less expensively guess whether the program is OK or not. These shortcuts will typically be imperfectly accurate (that is, they may miss obvious bugs or they may flag correct code as erroneous.) More useful for troubleshooting. For example, high volume automated tests will often crash the system under test without providing much information about the relevant test conditions needed to reproduce the problem. They are not useful for troubleshooting. Tests that are harder to repeat are less useful for troubleshooting. Tests that are harder to perform are less likely to be performed correctly the next time, when you are troubleshooting a failure that was exposed by this test. More informative. A test provides value to the extent that we learn from it. In most cases, you learn more from the test that the program passes than the one the program fails, but the informative test will teach you something (reduce your uncertainty) whether the program passes it or fails. o For example, if we have already run a test in several builds, and the program reliably passed it each time, we will expect the program to pass this test again. Another "pass" result from the reused test doesn't contribute anything to our mental model of the program. o The notion of equivalence classes provides another example of information value. Behind any test is a set of tests that are sufficiently similar to it that we think of the other tests as essentially redundant with this one. In traditional jargon, this is the "equivalence class" or the "equivalence partition." If the tests are sufficiently similar, there is little added information to be obtained by running the second one after running the first. o This criterion is closely related to Karl Popper’s theory of value of experiments (See Popper 1992). Good experiments involve risky predictions. The theory predicts something that many people would expect not to be true. Either your favorite theory is false or lots of people are surprised. Popper’s analysis of what makes for good experiments (good tests) is a core belief in a mainstream approach to the philosophy of science.
More useful for troubleshooting. For example, high volume automated tests will often crash the system under test without providing much information about the relevant test conditions needed to reproduce the problem. They are not useful for troubleshooting. Tests that are harder to repeat are less useful for troubleshooting. Tests that are harder to perform are less likely to be performed correctly the next time, when you are troubleshooting a failure that was exposed by this test.
Good Test Cases Copyright © Cem Kaner 2003. All rights reserved. Page 5 More informative. A test provides value to the extent that we learn from it. In most cases, you learn more from the test that the program passes than the one the program fails, but the informative test will teach you something (reduce your uncertainty) whether the program passes it or fails. o For example, if we have already run a test in several builds, and the program reliably passed it each time, we will expect the program to pass this test again. Another "pass" result from the reused test doesn't contribute anything to our mental model of the program. o The notion of equivalence classes provides another example of information value. Behind any test is a set of tests that are sufficiently similar to it that we think of the other tests as essentially redundant with this one. In traditional jargon, this is the "equivalence class" or the "equivalence partition." If the tests are sufficiently similar, there is little added information to be obtained by running the second one after running the first. o This criterion is closely related to Karl Popper’s theory of value of experiments (See Popper 1992). Good experiments involve risky predictions. The theory predicts something that many people would expect not to be true. Either your favorite theory is false or lots of people are surprised. Popper’s analysis of what makes for good experiments (good tests) is a core belief in a mainstream approach to the philosophy of science.
Good Test Cases Copyright © Cem Kaner 2003. All rights reserved. Page 6 the essential consideration here is that the expected value of what you will learn from this test has to be balanced against the opportunity cost of designing and running the test. The time you spend on this test is time you don't have available for some other test or other activity. Appropriately complex. A complex test involves many features, or variables, or other attributes of the software under test. Complexity is less desirable when the program has changed in many ways, or when you’re testing many new features at once. If the program has many bugs, a complex test might fail so quickly that you don’t get to run much of it. Test groups that rely primarily on complex tests complain of blocking bugs. A blocking bug causes many tests to fail, preventing the test group from learning the other things about the program that these tests are supposed to expose. Therefore, early in testing, simple tests are desirable. As the program gets more stable, or (as in eXtreme Programming or any evolutionary development lifecycle) as more stable features are incorporated into the program, greater complexity becomes more desirable. More likely to help the tester or the programmer develop insight into some aspect of the product, the customer, or the environment. Sometimes, we test to understand the product, to learn how it works or where its risks might be. Later, we might design tests to expose faults, but especially early in testing we are interested in learning what it is and how to test it. Many tests like this are never reused. However, in a test-first design environment, code changes are often made experimentally, with the expectation that the (typically, unit) test suite will alert the programmer to side effects. In such an environment, a test might be designed to flag a performance change, a difference in rounding error, or some other change that is not a defect. An unexpected change in program behavior might alert the programmer that her model of the code or of the impact of her code change is incomplete or wrong, leading her to additional testing and troubleshooting. (Thanks to Ward Cunningham and Brian Marick for suggesting this example.) Test Styles/Types and Test Qualities Within the field of black box testing, Kaner & Bach (see our course notes, Bach 2003b and Kaner, 2002, posted at www.testingeducation.org, and see Kaner, Bach & Pettichord, 2002) have described eleven dominant styles of black box testing: Function testing Domain testing Specification-based testing Risk-based testing Stress testing Regression testing User testing Scenario testing State-model based testing High volume automated testing
Exploratory testing Bach and I call these "paradigms" of testing because we have seen time and again that one or two of them dominate the thinking of a testing group or a talented tester. An analysis we find intriguing goes like this: If I was a "scenario tester" (a person who defines testing primarily in terms of application of scenario tests), how would I actually test the program? What makes one scenario test better than another? Why types of problems would I tend to miss, what would be difficult for me to find or interpret, and what would be particularly easy? Here are thumbnail sketches of the styles, with some thoughts on how test cases are “good” within them. tests.
|