On Zero-Shot Multi-Speaker Text-to-Speech Using Deep Learning