When the page loads, you immediately see a drawing of a girl. She has silver hair, and is wearing something that calls to mind a Japanese serafuku. Or, rather, your brain begins to starts to interpret it as a school girl uniform, until you catch up to your eyes and realize that the flesh-colored blotches on her chest cannot be hands. In fact, she doesn’t have hands; her arms simply fuse into each other, leaving you wondering why a service that bills itself as ‘AI’ would use a picture that makes it so very obvious that their program has no understanding of human anatomy.
This is the experience of opening the webpage for EndlessVN, a service that seeks to do for visual novels what Novel AI does for literature. Interestingly, while most services would stick to one neural net program, EVN seeks to recreate the experience of a visual novel by stapling several programs together, having different programs handle the text, the pictures, and the music. I’ve tried out the free version, in case you’re wondering, but it was too slow for me to do anything; but before I talk about EndlessVN itself, I want to talk about picture generators.
Picture generators are much like text predictors, in that they both allow the user to probabilistically generate a kind of data. But while text predictors spit out words, picture generators fill pixels with colors, using the colors of the pixels around each one to determine the RGB value of each individual pixel.
A quirk of this method is that, while a human is good at creating an impression of person using nothing but lines and space and would be hard pressed to create a photo-realistic face of someone that doesn’t exist, the program is the opposite. The program relies on the the fact that a photo will have patterns of texture on a human’s skin, hair, and clothing to tell where a hard edge, like where the face ends in a picture and the wall behind it begins, would be. This isn’t possible when it’s imitating drawings, which are dominated by solid blocks of color, whether those colors are supposed to represent the foreground or the background.
But even putting aside current image generators difficulties replicating the anime style, I don’t think Endless VN needs a block of pixels of a given size for all of it’s images. It works fine for backgrounds, for the characters, I feel that you need a completely different paradigm.
The fundamental problem with getting a program to draw a character is that you want individual drawings to be consistent. If a character is blonde and has green eyes, you want every picture of them to be blonde and have green eyes. If a character is wearing clothing for a particular scene, you want them to wear the same clothing for the entire scene. And if you want a character to have a cowlick coming off of the back of their head, you want every picture of them to have a cowlick coming off the back of their head.
In other words, there’s a difference between creating a design for a character, and making drawings of that character in various poses. I suspect that you would need different kinds of programs for each. The first would be concerned with generating variations around a set of attributes, such that not every blue-eyed beauty with freckles looks like every other blue-eyed beauty with freckles, and the second would focus on moving the body parts around, and giving the drawings some facade of emotion. I think that both of these would need some understanding of the internal structure of the human body, even if the end result looked like drawings, and I rather suspect that such an internal structure would need to be hard-coded.
But hard-coding internal structures into neural nets isn’t an idea that’s limited to pictures. As it stands, GPT doesn’t really know when someone is talking, just when words are between quotation marks. If it were possible to hard-code some idea of what a character is into it, and allowing it to create personalities in the same way our first program above, it would bring us so much closer to the dream of a program that can simulate an entire, arbitrary world.
Some GPT services have already started to put character profiles right in the training data, so that when the user goes to describe someone in the form of [name: / appearance: / personality:…], the program has something to latch onto. Even still, text predictors still have difficulty keeping characters straight. And like understanding the word ‘not’, I suspect that this is for mechanical reasons, that no amount of training data can actually overcome.