How to synchronize collaborative music performances with Google+ Hangouts

Someone just asked me via email how she can synchronize her singing over a Google+ Hangout with a musician on the other end, when it seemed like there was a time delay that was tripping them up. She wanted to know how to eliminate the time delay, or if Google was planning to eliminate it at some point.  Here is my reply, posted here in case it helps somebody else.


Great question! It was my mother's birthday last week, and the family got together in four different venues over Google Hangout to sing "When I'm 64" to her for her birthday :-)  I called it before we even started the hangout: the latency (delay) would cause us to all keep slowing down to let each other catch up, then realize that everybody was getting further behind, so we all needed to speed up and skip ahead, and then we would slow down again, etc.  Sure enough, every few seconds we seemed to have singing synchronization issues. It made the whole thing a lot funnier, but it wouldn't work for your situation at all!

In the general case, this is not solvable for the same reasons that Einstein said that all simultaneity is relative: when it takes a non-zero amount of time to send information from point A to point B, and back again to point A, it's impossible for both point A and point B to agree on a global concept of "now". You simply cannot reduce the latency to zero for network connections, and much less for running a complex streamed application like a Hangout over the network, and the further apart you are in the world, the greater the expected latency.

The way that this has been solved in the past (e.g. by that massive virtual orchestra / virtual choir project that has been run over YouTube a couple of times before) was to pre-record the music, and have each singer play the sound in their headphones while singing / playing. Then they each separately recorded their videos and sent them to someone who mixed them down into a single track, offline, after they had all finished recording their separate tracks.  i.e. they simply avoided the problem entirely by not performing simultaneously :-)

If I were you, I would simply experiment with performing simultaneously, maybe you can practice having one of you (the one on the recording end) singing exactly on time with the other person, and the other person playing/singing at exactly 2x the one-way delay time ahead of the other person. The trick would be to have the performer that is playing ahead (not on the recording end) set the tempo and basically pay no attention to the person on the recording end (i.e. don't try to slow down to let them catch up). As long as the person on the recording end is on-time and keeps up with the one that is leading the piece, nobody will know about the synchronization issues.

If you're not recording locally, but rather broadcasting the Hangout live, you both need to split the delay equally, so that each of you sings/plays at exactly 1x the one-way time delay ahead of the other person (or ahead of what you hear coming out of your speakers). Actually, you probably need to play 0.5x the latency ahead of what you hear coming out of your speakers, because each connection is routed through Google's servers and then back out to the other person, and it's from Google's servers that the two different video signals are mixed and then broadcast out to the rest of the world.

I hope this makes sense. There's really no way around this for live hangouts though! (But you might be able to make it work for recordings.)


  1. Of course you're right that there are theoretical reasons why this can't be done perfectly in all network conditions. But I think there's room for tool support that would do pretty good job most of the time.

    How about a shared (visual and/or auditory) metronome? You could even synchronize it with NTP-like techniques, if that level of precision matters. From a user's perspective, advice like "split the delay equally" is hard to follow without some kind of reference.

    Of course, since hangouts work to create an illusion of simultaneity, getting users to understand delays could be difficult. But many musicians already have an understanding of time delays, just from the speed of sound--that's part of the reason for having a conductor, after all.

  2. The point is that whether or not you have a metronome, the other person is always going to sound a fraction of a second behind you. (I'm estimating 100-200ms, but I have no idea, I haven't measured it, it could be as high as 500ms.) So if you're listening to the other person -- at all -- you're going to run into issues. A metronome would only solve part of the problem by at least making you keep pace, it wouldn't allow you to directly respond to the other person in realtime, and it's still going to mess with your brain that the other person always feels like they're lagging behind you.

  3. Just saw Luke's reply, but I think this is still relevant:

    I think the metronome idea is a great one, but you don't need to sync it to NTP, because again you run into the same issues when you add latency to the mix. The issue is the mix/output: Let's say I'm in a hangout with B and C, if C has higher or variable latency compared to B then it will be out of sync with B from my perspective even if the metronome is perfectly "in sync". But if the *video feeds* are synchronised to the metronome in my session (as host), allowing a bit of a buffer for variations in guest latencies, then we can all perform in sync (at least from my perspective), and I can record the performance "live" (sort of). This is effectively what Luke is saying "set the tempo and pay no attention", but in the sync'ed video feed version it can be effectively utilised by multiple parties, not just two. Ideally all the guest participants probably need the entire audio feed muted, it would be incredibly off-putting, as Luke pointed out. They only need to know their part and stay in time.

    As for live streaming, the video stream sync and audio mixing could be done server-side rather than host-side, with the same result.

    The other area where this won't (can't?) work is multi-party improvisation. There's something to think about.

  4. Right, there are two issues: video synchronization at the server (assuming streaming/broadcast from the server, or recording at the server), and the fact that we are psychologically affected by hearing something delayed that is supposed to be synchronized with us. On the latter point, if you have ever heard a very loud delayed echo of your own voice in your cellphone, you'll know exactly what I mean, and how hard it is to switch off the part of your brain that tries to synchronize with it. It can completely disrupt thought.