Stable Diffusion 3 BEATS both DALL-E 3 and Midjourney v6

Hey everyone, we’ve got another mind-blowing announcement in the world of AI. I’ve known about this for a little bit now. Before OpenAI’s Sora was even released.

Stable Diffusion 3

Stability AI released Stable Diffusion 3, this is the most capable AI image generator we have ever seen to date. The testing we’ll see today absolutely trumps Dolly 3.

Stability AI just released Stable Diffusion 3, our most capable text-image model that utilizes a diffusion Transformer architecture for greatly improved performance in multi-prompts, image quality, and spelling abilities. And I mean when we say better performance, improved performance, they aren’t joking. There is a waitlist to get access to this thing which you can join here. But eventually, it is going to be open source, just like all the other Stable Diffusion models.

Prompt texts

Here’s the first image we see: epic anime artwork of a wizard atop a mountain at night casting a cosmic spell into the dark sky that says Stable Diffusion 3 made out of colorful energy. And I mean, it follows the prompt to a T, but the spelling is on point and it’s beautifully integrated as well into that anime art style.

Here we have another example, a cinematic photo of a red apple on a table in a classroom. On the blackboard are the words “go big or go home” written in chalk. It follows the prompt to a T yet again. The only thing that I can see wrong with this image maybe is the stem’s a little bit weird on the apple, but other than that, I mean, it’s just absolutely ridiculously good, perfect spelling.

A painting of an astronaut riding a pig wearing a tutu, holding a pink umbrella on the ground. Next to the pig is a robin bird wearing a top hat, and in the corner are the words “Stable Diffusion.” Yeah, they didn’t add that in post, that’s just generated by the AI. So it’s just absolutely insane, the level of prompt detail that we’re getting here.

Not only are we getting the painting, the pig’s wearing the tutu, the astronaut’s on top holding the pink umbrella, robin bird wearing the top hat, and even correct spelling, all in the same photo. Just so you guys get an idea of how capable this model really is.

Dolly 3

Let’s just try this in Dolly 3 real quick. So here guys, we can see Dolly 3 inside of Microsoft Bing Image Creator. These images might be pretty decent, but they do not possess the same level of coherency that we see with Stable Diffusion 3.

In this example, we do have a pink umbrella, but it’s also attached to the astronaut in a weird way. The pig is wearing the tutu, but the robin does not have a top hat. Where’s the word Stable Diffusion? Nowhere to be found.

Stable Diffusion 3 better than Dolly 3

Again, Dolly 3 is very good, but it appears Stable Diffusion 3 is just better, plus it’s going to release open source, meaning people can build off of it. This is going to be a massive, possibly the biggest leap in image generation we have ever seen.

Just to show off some of the realism, we have a studio photograph close-up of a chameleon over a black background. And I mean, taking a look at these close-up details, this thing is just no slouch, absolutely competitive.

This uses a new type of diffusion Transformer that is similar to Sora’s architecture, according to Emad, who is the CEO of Stability AI. So this takes advantage of Transformer improvements and can not only scale further as the models get bigger but can accept multimodal inputs.

sound to image

So we could do sound to image or something like that, potentially. That is crazy. Here we’ve got some even more amazing examples: a photo of a ’90s desktop computer on a work desk. On the computer screen, it says “welcome.” On the wall in the background, we see beautiful graffiti with a text “sd3” very large on the wall. Just perfect coherency.

It’s just absolutely ridiculous how far these image generators have come and how good they can get at adhering to prompts. Take a look at this lovely one: resting on the kitchen table is an embroidered cloth with the text “good night” and an embroidered baby tiger. Next to the cloth, there is a lit candle, the lighting is dim and dramatic.

perfect coherence to prompt

I mean, absolute perfect coherence to this prompt and it just gets even more impressive. Check this one out, guys: three transparent glass bottles on a wooden table. The one on the left has red liquid and is labeled number one. The one in the middle has blue liquid and is labeled number two. The one on the right has green liquid and is labeled number three.

And we absolutely get that and it’s pretty realistic, definitely on a wooden table, and it understands left, right, and center. It’s a lot going on here that it has to get correct. Dolly 3 admittedly can do this as well, unlike the last prompt that we saw. So they’re at least on the same level in terms of prompt coherency.

Mid Journey V6

For this specific prompt, Mid Journey V6 is another story, though for the most part, it’s really quite good, producing these beautiful realistic results, especially like this one down in the corner below. However, you can see one of our images here just isn’t quite right. We have 233. While there’s a larger focus on aesthetics here, Stable Diffusion 3 is going to be open source, and that means that we can build models that are a little bit more aesthetic. It can be worked on, it can be built off of, that’s what open source is, and that’s why it’s so darn powerful.

The shelf life of these open source models is extremely large. There are people today that are still using SDXL. This is going to be such a game-changer. This is really the one that made me hang my mouth open and just go, “Oh my God.”

Prompt

A photo of a red sphere on top of a blue cube. Behind them is a green triangle. On the right is the dog, and on the left is the cat. Perfect response here. This is incredibly hard for any AI image generator to get correct. This is pretty much solid proof that Stable Diffusion 3 has better prompt understanding and coherency than any other image generator on the market that we’ve ever seen.

Dolly 3 isn’t good

Just to show you guys, this is Dolly 3 and it’s not even close. It is able to do the red sphere on the blue cube, the green triangle is kind of behind them but it’s a little bit messy and weird. No cats anywhere to be seen in this image. Similar thing going on but we have two triangles instead and we only see dogs. Same thing going here, we have a random spaceship as well. This one is probably the closest we get. I mean we have a cat down here but again both dogs just not exactly perfect. This is not even close to being the same level of adherence that we see in this image right here. It’s just absolutely bonkers.

Nate in the comments here points out another image from Dolly 3 and then one from SDXL just showing how much of a jump this is.

Mid Journey detail to prompts

Oh, and by the way, guys, here is Mid Journey for those of you wondering. I mean it’s not even close. Not even slightly. Not able to have that same level of adherence to prompt. Sure, Mid Journey has very realistic and aesthetically pleasing images but again, open-source Stable Diffusion 3, you can take this model because it’s open source and make it one that is aesthetically pleasing. You can make it into one that produces more realistic images. You can fine-tune them, you can train them.

Stable Diffusion detail to prompts

Take a look at some more examples here again from the CEO of Stability AI. First up, we have “Welcome to the Future.” It’s a pretty surreal image to see generated. We’ve got like some meat salami crab things dancing like no one’s watching. I mean I would barely be able to tell that it’s generated by AI. It is in a super low resolution, however. And a nice image of a fighter jet, again, just more ridiculousness. We have three clowns sitting for dinner. It says “Stable Diffusion” in the background. You get the idea, this thing is just a powerhouse.

It is unlike anything we’ve seen before, absolutely better at prompt coherency than Dolly 3. The images really just keep coming, so if you want to see even more examples, this will all be linked down below. It really is just so damn cool.

Announcement

So like I said, guys, I did get a sneak peek at this prior so I knew it was coming. Trust me, this one was not easy to keep from you guys. Taking a deeper dive into their announcement, we note that the model is not yet broadly available today, and I did check, I don’t have any specific access yet but I might get access before you guys. So I’ll try to get access as soon as possible and do a full testing video for you. But yeah, there is a waitlist for an early preview as with previous models. It’s crucial for gathering insights to improve its performance and safety ahead of a full open-source release.

Democratization of quality AI access

So the models currently range from 800 million to 8 billion parameters, and this approach aligns with the core values of Stability AI, which is the democratization of AI access that provides users with a variety of options for scalability and quality to best meet their creative needs. The democratization of creativity, they want this thing to be able to run on people’s computers at home for entirely free, and I really, really respect that outlook and their attempts to do that because you can tell the goal here is not to get rich and make a ton of money. They have a value here and that is the democratization of quality AI access.

AI image generator Stable Diffusion 3

This really is a significant game-changer, unequivocally, guys, we can say this AI image generator is the best one we’ve ever seen in terms of prompt understanding and text generation. It is leagues above the rest and it’s truly mind-blowing.

As I mentioned earlier, we might get, you know, more realistic details out of Mid Journey, but guess what, Mid Journey is not open source, you have to pay for Mid Journey. This will be free when it releases open. People will be able to use this commercially and build upon it, meaning you will be able to have super hyper-realistic models that can also produce text and do insane prompt coherency. This is a new level of architecture, this is something we haven’t seen before.

Get excited because the future is going to be awesome, and I think that 2024 is going to be the year of Stable Diffusion 3. I don’t know if anything’s going to be able to top that in terms of image generation this year.