Maark Abbott, on 21 January 2023 - 08:20 AM, said:
That's all they are and all they will be - a novelty. A computer program mashing two things that someone fed it together cannot be creative, will never have the capacity to be creative. You might get some amusing noise, but it's the Shakespeare's Monkeys at Typewriters scenario.
Again, that's not how generative neural networks work: they're not 'mashing two things that someone fed it'. They're learning the underlying patterns to generate novel examples from (generally) noise. Typically they effectively infer what constituent properties are significant (for example, in visual art brightness or weirdness, as well as many less human explicable properties or patterns), treat each property as a dimension, and construct an n-dimensional latent space (for Stable Diffusion it's > 500 dimensional).
But you did identify two obvious ways it can be creative: through randomization within the bounds of a style or genre, and by combining different styles or genres.
If trained on the creative evolution of a specific style, it can continue that creative evolution. It can also learn from a wide variety of genres to construct a general space of human 'musical possibility'. So instead of being uniformly random, it can randomize in a way that fits the inferred trajectory through its space of musical possibility.
Computers are so fast now that the problem isn't so much randomly generating the complete works of Shakespeare---in a world where Shakespeare never existed---as it is recognizing that it has. One option is to put vast quantities out into the world and have people rate them, explicitly or implicitly, and then train on that feedback.
More generally, neural networks can approximate any optimization function based on the training data. So they can be optimized for any measure of 'creativity' as applied to patterns inferred from the training data. (For example, neural network based chess computers excel at creativity, coming up with effective moves no human would have thought of. They do this partly by playing against each other a large number of times, with the optimization essentially being 'win'.)
However, they may be limited by what is included in the training data. Ironically this might exclude new technologies or aural aspects of human experience that have yet to be incorporated. Though given wide enough variety---including exposure to a wide range of aural phenomena that humans have liked in the past---it could infer much of the space of possible sounds that humans might like, or might want to incorporate in music (as so many already have with the integration of samples, noise art, chanted, declaimed, or spoken parts, etc.).
But actually simulating even a single human brain with anything approaching full accuracy is beyond the scope of current computers. So is simulating the whole of human culture and experience as it changes through time. However, how much actually 'needs' to be simulated to effectively predict what humans will like is an open question. Again though, AI can dynamically adapt to changes in taste by receiving feedback from humans on its ongoing releases---what people listen to (and continue listening to) or rate highly, etc.