This is a fascinating misconception
It's quite possible to take music and compress it until no instant of it demonstrates greater than, say, 6db dynamic range.
To conclude that you then need only one bit (or whatever- four?) would be the fascinating misconception.
Test this out- do that at 24 bit. Then take whatever mp3 encoding you need to just barely cause the output volume to start breaking up and forcing greater than a 6 db dynamic swing. Oops, that would take something like 16kbps encoding, if that. So allow the dynamic swing to be basically the same, and encode the 24 bit heavily compressed file at 64kbps.
Does it sound the same to you?
Here's what's happening, and why the 'even volume = low bits' is simply wrong:
In order to record music, you need to not only record the loudest harmonics at any given moment. You also need to record the overtones, or the sound will be audibly, DBT-test-ably different.
These overtones will be turning up at much lower levels. You'll have your 6 db dynamic range blasting away, but the bit where the guy half-misses the hi-hat during the guitar chord will be a series of high-frequency overtones that could be as far down as -100 db in some cases. Get rid of all the -100 db overtones in that hi-hat event and it'll sound different, because it's a very complex sound with many inharmonic components.
Yes, you're changing the sound even if you go to 16 bit on a heavily compressed, no-dynamic-range track, though at the 16 bit level it's still going to be challenging to listen for a difference (I find it's more a felt difference, that high-bit-depth stuff is more about a sense of ease and texture than it is about big glaring differences- how could it be not, look at the volume levels of the subtle overtones in question?)
If you're talking about being able to use 12 or 8 bit or something for heavily compressed tracks and pretending that there are mathematical reasons why this would have to be indistinguishable from 24 bit because the continuous output volume doesn't vary, then you're simply overlooking the nature of complex waveforms as a series of sines not all of which are going to be anywhere near full scale. You have to get the quiet ones right too, even if the body of the waveform never shuts up. If it's a real-world signal, it's going to have all levels of harmonics present at pretty much all times, and the threshold at which you can ignore the quietest ones doesn't significantly alter when you play sound at a continuous volume. Unless you play it so loudly that you render the listener deaf- and that hasn't been the argument here