Only time will tell
Why is there a short wait if you press a button on your headphone remote or your AirPods to pause the music? Because the interface has to let a bit of time pass to figure out if you’re going to press the button again, making it a double press (advance to next track) instead of a single press.
This kind of disambiguation delay is everywhere for simple gestures.
Why is there a short wait if you press a button twice in that situation? The double press processing also has to be delayed, because there is a chance it might become a triple press (go to previous track).

Why is there a short wait if you press a button to go to the next track on your car’s steering wheel? It’s a delay of a different kind, but the same principle: the function cannot kick in on press down, because press down and hold mean “fast forward.” So, software has to wait for button up event to go to the next track (which feels a bit slower than button down), or for enough time to pass so we’re certain it’s a button-down hold rather than a slow press. Here, both interactions experience a penalty for coexisting.
The most infamous of those disambiguation delays exists in mobile browsers. Since every double tap can zoom into the page ever since that famous 2007 iPhone presentation, every single tap on a link or elsewhere has to be delayed by about 300ms. This has been a source of contention since it does make the web feel a bit slower, and today browsers suspend double tapping on sites designed for mobile, trading zooming affordances for higher interaction speed – after all, you can still zoom in by pinching. But if you always wondered why older websites tend to be a bit sluggish to interact with, now you know.
Different tradeoffs are possible. In the Finder, clicking on icons isn’t slowed down even though double clicking exists, because selecting an icon is compatible with opening it! So in effect it’s not a choice between a faster A and a slower B – it’s A or A+B.
Even in the iPhone presentation above, you can see the interface highlights the link on double tap, to at least make it feel snappier, at the expense of the highlight being “wrong” and potentially distracting – or even confusing – when you end up double tapping. (You can imagine smartphones pausing on the first remote/headset button press, too. It feels like it would be compatible with advancing to the next track, but I think it might also feel too “choppy,” too chaotic, in practice.)
Lastly, why is there a short wait if you press a button on your hotel TV to increase the volume? Oh, I think that one is just sluggish for no good reason.