Sounds like a question you should put up in Bruno's forum.
Firstly, you shouldn't confound terms. Sample rate and bit depth are two different aspects of the same thing. Rate being how often a sample is measured, and bit depth the level of detail of each measurement. Dithering only has to do with the latter, when converting from a higher to lower bit depth.
An AD converter measures a voltage at regular time intervals (ideally). Each measurement is quantized to the nearest bit value, and the time at which the measurement happens is correlated to a clock.
To answer your question, you are half right. Only half, because you are taking two variables into account when only one is necessary. Bit depth has nothing to do with your question, so put that aside for now.
When it comes to sample rate, you're right. If you are going to downsample to CD quality, it makes much more sense to sample at an integer multiple of 44.1 (like 88.2).
It helps to think of two grids with the same bit depth, but different sample rates. Think of bit depth as resolution in the y direction, and sample rate as resolution in the x direction (where x is time, and y is a measurement of voltage)
If your sample rate is an integer multiple of the destination rate (88.2 vs 44.1 in this case), then there are no mathematics which have to be imposed on the waveform. Since every other sample lines up perfectly between grids in the time domain, the SRC only has to drop the sample in between. If your sample rate is 3 times 44.1, then you would drop every 2 samples. The reasoning is that if your integer is n, then every 1/n will line up perfectly with the waveform you are downsampling. Ditch the extra samples, and the analog circuitry of your DAC constructs a smooth waveform which lines up exactly with your original file up to a certain frequency.
To see why a non-integer multiple grid doesn't work as well, it helps to think in only 1 dimension, namely time. For instance, let's take 96K and 44.1K. Let's say it takes 2,000 samples before our grids land on the same point in time (a rough figure for the sake of argument). Then, since only 1 in 2000 points on our source grid matches temporally with our destination grid, every other point has to calculated mathematically. This is where bit depth comes in.
Instead of using what was actually measured, we have to calculate a value for the destination grid which corresponds to its location relative to the source grid. If the destination grid sample point is exactly half way between two source grid sample points, then we would take the average of the bit depth between those points. However, this is rarely the case. Which number is taken (measurement) has to be weighed against how close the measurement is to the source grid sample point, so as to provide as accurate a picture as possible (a weighted average).
Obviously, this requires computation of a value that may or may not have actually existed in the original wave being measured.
In short, sampling the original wave at a multiple of the destination sample rate requires no interpolation, since the grids line up perfectly at the multiple. All that is required is dropping the samples in between. But converting between sample rates which require a number of cycles between each grid before they line up again requires computation to approximate the values that don't line up in the time domain.
So, stick to a sample rate multiple of your destination. Or, record at your destination sample rate (if your release is CD quality, then record at 44.1 instead of 48). You will end up with real measurements instead of estimations of what occurs between them.