I think it's that "I want to have a cup of coffee" includes all the process of "having a cup of coffee". (e.g. drink a bit --> rest a while --> drink again)
cause most of the time when people have a cup of coffee, they don't just drink the whole thing at once, but they drink a bit and enjoy time passing or have a little chat with someone else then drink again. so it takes time.
but "drink a cup of coffee" indicates only the action of drinking it. rather an instant thing.
It's just my opinion. but it makes sense.
Why doesn't (1) work? Because having or drinking coffee is not an instantaneous action that can take place at the moment I enter the room.