I'm not using a text-to-image framework! Just python torch and clip! and this is a very golfed example so it's meant to be hyper-specific and not really generalizable, text-to-article you can do with off-the-shelf gpt models (search: Huggingface transformers) and text-to-music is a very new emerging field with some using diffusion models, some using transformers, some using models that pick out samples, some using mixes! There isn't really a specific one I'd point towards for SOTA at the minute but check out MuBERT it's very cool