Should one test internal implementation, or only test public behaviour?
Given software where ...
- The system consists of a few subsystems
- Each subsystem consists of a few components
- Each component is implemented using many classes
... I like to write automated tests of each subsystem or component.
I don't write a test for each internal class of a component (except inasmuch as each class contributes to the component's public functionality and is therefore testable/tested from outside via the component's public API).
When I refactor the implementation of a component (which I often do, as part of adding new functionality), I therefore don't need to alter any existing automated tests: because the tests only depend on the component's public API, and the public APIs are typically being expanded rather than altered.
I think this policy contrasts with a document like Refactoring Test Code, which says things like ...
- "... unit testing ..."
- "... a test class for every class in the system ..."
- "... test code / production code ratio ... is ideally considered to approach a ratio of 1:1 ..."
... all of which I suppose I disagree with (or at least don't practice).
My question is, if you disagree with my policy, would you explain why? In what scenarios is this degree of testing insufficient?
In summary:
- Public interfaces are tested (and retested), and rarely change (they're added to but rarely altered)
- Internal APIs are hidden behind the public APIs, and can be changed without rewriting the test cases which test the public APIs
Footnote: some of my 'test cases' are actually implemented as data. For example, test cases for the UI consist of data files which contain various user inputs and the corresponding expected system outputs. Testing the system means having test code which reads each data file, replays the input into the system, and asserts that it gets the corresponding expected output.
Although I rarely need to change test code (because public APIs are usually added to rather than changed), I do find that I sometimes (e.g. twice a week) need to change some existing data files. This can happens when I change the system output for the better (i.e. new functionality improves existing output), which might cause an existing test to 'fail' (because the test code only tries to assert that output hasn't changed). To handle these cases I do the following:
- Rerun the automated test suite which a special run-time flag, which tells it to not assert the output, but instead to capture the new output into a new directory
- Use a visual diff tool to see which output data files (i.e. what test cases) have changed, and to verify that these changes are good and as expected given the new functionality
- Update the existing tests by copying new output files from the new directory into the directory from which test cases are run (over-writing the old tests)
Footnote: by "component", I mean something like "one DLL" or "one assembly" ... something that's big enough to be visible on an architecture or a deployment diagram of the system, often implemented using dozens or 100 classes, and with a public API that consists of only about 1 or a handful of interfaces ... something that may be assigned to one team of developers (where a different component is assigned to a different team), and which will therefore according to Conway's Law having a relatively stable public API.
Footnote: The article Object-Oriented Testing: Myth and Reality says,
Myth: Black box testing is sufficient. If you do a careful job of test case design using the class interface or specification, you can be assured that the class has been fully exercised. White-box testing (looking at a method's implementation to design tests) violates the very concept of encapsulation.
Reality: OO structure matters, part II. Many studies have shown that black-box test suites thought to be excruciatingly thorough by developers only exercise from one-third to a half of the statements (let alone paths or states) in the implementation under test. There are three reasons for this. First, the inputs or states selected typically exercise normal paths, but don't force all possible paths/states. Second, black-box testing alone cannot reveal surprises. Suppose we've tested all of the specified behaviors of the system under test. To be confident there are no unspecified behaviors we need to know if any parts of the system have not been exercised by the black-box test suite. The only way this information can be obtained is by code instrumentation. Third, it is often difficult to exercise exception and error-handling without examination of the source code.
I should add that I'm doing whitebox functional testing: I see the code (in the implementation) and I write functional tests (which drive the public API) to exercise the various code branches (details of the feature's implementation).
The answer is very simple: you are describing functional testing, which is an important part of software QA. Testing internal implementation is unit-testing, which is another part of software QA with a different goal. That's why you are feeling that people disagree with your approach.
Functional testing is important to validate that the system or subsystem does what it is supposed to do. Anything the customer sees should be tested this way.
Unit-test is here to check that the 10 lines of code you just wrote does what it is supposed to do. It gives you higher confidence on your code.
Both are complementary. If you work on an existing system, functional testing is the first thing to work on probably. But as soon as you add code, unit-testing it is a good idea also.
My practice is to test the internals through the public API/UI. If some internal code cannot be reached from the outside, then I refactor for removing it.
I don't have my copy of Lakos in front of me, so rather than cite I will merely point out that he does a better job than I will of explaining why testing is important at all levels.
The problem with testing only "public behavior" is such a test gives you very little information. It will catch many bugs (just as the compiler will catch many bugs), but cannot tell you where the bugs are. It is common for a badly implemented unit to return good values for a long time and then stop doing so when conditions change; if that unit had been tested directly, the fact that it was badly implemented would have been evident sooner.
The best level of test granularity is the unit level. Provide tests for each unit through its interface(s). This allows you to validate and document your beliefs about how each component behaves, which in turn allows you to test dependent code by only testing the new functionality it introduces, which in turn keeps tests short and on target. As a bonus, it keeps tests with the code they're testing.
To phrase it differently, it is correct to test only public behavior, so long as you notice that every publicly visible class has public behavior.