For a while, like many people, I’ve been loosely tracking the developments of AI technology in the creative space. I tried a few prompts in Dalle-2, played with LensaAI’s app filters during that one week when it took over social media, etc. This was all very casual, and it didn’t seem like anything had quite matured as a serious creative force… yet. But it was coming, this tsunami wave that promised to do what I do better than me, with just a couple clicks, and I wanted to be out there in the water before I felt like the tide was really going out.
That moment came two weeks ago, when I saw Adobe release a beta version of photoshop with image generator tools for extending photos, filling selections, and filling selections with text prompts. The showcased results, upon first glance, were not only incredible, but had very concrete practical application in contrast to, say… making a random image of Tom Brady and a kitten having a picnic. That’s fun, but it’s not anything that someone might pay for. Reframing a photo so that the subject’s face is in a more pleasing composition, on the other hand, now that was a practical application that clients would begine to expect, and made me feel like, oh god, it’s here, get to the boat.
So I experimented with the photoshop tools, and they are incredible in how accurately they assess a given composition for generating extensions or removing unwanted subjects. However, as of this writing, the resolution of what they give you back is often lower, or blocky, blurry, etc- basically the more and bigger you go, the more concept art/draft painting it gets. The other issue is that the generator processes in “the cloud” and aggressively flags things as “violation of community guidelines”- even really benign stuff, which, when you think about a massive, wealthy corporation needing to protect itself from “this is offensive”, litigation, it makes sense. …but still… this genuinely throttles the tools and their usefulness. However, beyond that, it was still so equal parts promising and terrifying, I had to go further.
My next step was to try Midjourney, a web based image generator that was among the original crop to break out on the scene, for some straight up text based art making.
Midjourney required using chat prompts within another app, Discord, and while I already had discord and got up to speed on it fairly quickly, it’s not the most accessible route for casual curious types.
I put in some fantasy art related prompts for a project i’ve been working on, and quickly got some really beautiful results back. But also… not what I wanted, exactly. I adjusted my prompts, read up on best practices, and continued testing. Midjourney allows you to create sub-variations on your variations, and I tried a lot of those, too, but I kept getting… not what I wanted, exactly. The main issue was that I was trying to create story-telling scenes, with multiple characters and very specific and unique settings, which is honestly asking a lot, especially once I really began to understand how all this works under the hood- but we’ll get to that later…
I refined and refined my approach, and as I did, and the consistency of the generations started improving, I began to really see what it could not do. Namely, multiple main subjects in the plural. For instance: “A woman being led by an armored guard toward a dungeon full of prisoners” consistently gave back an image where there were multiple people, but they were all guards, and all wearing armor:
Occasionally I’d luck into a random image that aligned more with my vision, but also… not what I wanted, exactly. And further variations based upon those rare ones ultimately looped back around into “Oops, all heroes!” Furthermore, while the guidelines-flags in Midjourney are not the sort of ‘grossly blanket to a near random level,’ like Adobe’s, and generally accommodating and appropriate, again, I wanted some HBO, Game of Thrones, Graphic Novel blood on the ground after-the-battle kinda stuff, and it won’t do that. You can randomly see a “bloody sword”, but you can’t request it.
Also at this point, I ran out of free generations, and yet I had only just begun to understand, and I wanted more understanding. And I wanted… the unrated, unfiltered, unencumbered original recipe sauce. I knew it HAD to be out there. Especially because of the way Adobe informs you of flags, it’s not when you prompt, but at the return, i.e., it flags what it HAS MADE, and says, “oh, we can’t actually show this to you.” but it was there… and so, I wanted to see if there was a generator that I could install on my own PC, no flags, no filters. no rules, just right…
Enter “Stable Diffusion,” as my proverbial bloomin’ onion- One of the OGs, this generator had been rumored to be the baby-daddy of Midjourney, and it’s got very “open-source, chaotic, nerds making extensions and plugins just to impress other nerds” kinda vibes. And I could install it locally. on my own computer. So i did…
If investigating Midjourney was the rabbit hole from my photoshop experiments, Stable Diffusion was like going down to the “Limbo” level in Inception, where you get so lost in the possibilities that you forget what anything even means or why you came there in the first place. I spent hours running applications at windows’ command line level, editing python code, training stacks, requesting API keys, applying LORAs and LyCORISs- all kinds of things that I had only the vaguest fingers-crossed idea of how they worked, but I kept going deeper and deeper, training models, adding scripts and extensions, ingesting jillions of youtube, reddit, and stack overflow research through my eyeballs… All, I could barely remind myself, in efforts to get my “free” bottomless image-mimosas app to make stuff that looked like Midjourney, but also, “exactly what I wanted.”
The results, well: it’s pretty rough out-of-the-box, and you quickly learn that the people behind the paid-generators have done a ton of work to nuance and make your prompts look exciting on the first go-round. But with some effort, you can get really pleasing images out of Stable Diffusion, and you can get them consistently. And you can paintbrush a spot, and generate for just that spot into another image, and you can upscale, and you can reframe, and you can add custom trained-data, and looks, and styles, and it goes on and on and on, well past the point where you are confident of what any one thing is actually contributing to a given result. And through this process of experimenting and trying new approaches and scripts and ideas, I ran thousands of iterations. And yet it was still… not what I wanted, exactly.
This all brought me back to photoshop, where I used my long-cultivated skills to modify these images, add external elements & fx, touch up paint, and prep to go into cinema 4d, so I can add the final touches that will better achieve what I want, exactly.
After plunging myself down to the “pearl-diver” depths of this AI ocean, time, life-obligations, and a general desire to return to the tactile elements of living, [prompt: sun, sky, trees, birds, friends:1.4, laughter, wind, scents, computer:-0.8] have brought me back to the surface.
And so, emergent and reflecting upon the trip, I have a few thoughts:
The biggest one is that, contrary to popular conception and marketing, etc., there is no actively thinking “intelligence” involved in these generators.
Dalle, Midjourney, Stable Diffusion, Leonardo-AI… basically they all do the same trick, starting with a randomly numbered noise pattern, and assigning the user-chosen words (text prompt) to each pixel. Then the noise pattern defuses into certain globs of shapes (think of it like a word cloud where important words get bigger) driven by a set formula, and re-referencing this word-salad against code created from a repository of 15 billion 512px x 512px photos that some group in Germany collected a couple years ago, that was then all keyword tagged by actual humans, mostly in cheap labor markets far across the world, and then blend-mask-shaping all of that down for a user/web-app defined period of steps. There’s a certainly more to it, but the basics are: keywords> match set of photos within repository> apply noise> blur & enlarge noise> blend photos using keywords (at pixel level color and luminance value), repeat x times.
To put it another way: often times in the past I have altered photos of myself, or friends, or family- removing blemishes or ex-spouses, adding fun elements, or insetting people into movie stills and other artwork. The way to go about doing that was by first hopping on a web browser, typing words related to the elements I wanted for the composition, and then looking through the “google images” that came up, and then selecting shots that match the lighting, composition, etc, and then masking, changing color balance, contrast, etc to merge the downloaded images into the original photo.
The generative “AI” image processors basically do all of that middle->end work, with exponential, upon exponential, factors of speed and precision. What’s great is that because it’s actually based on numbered noise, and not an artificially intelligent artist, you can freeze a specific look that a given noise-number generates, and then adjust other values so that that specific look now applies to your new words, and the randomness goes away.
Also, understanding that these tools are not thinking how to describe your prompt, but simply executing code based on it, really helps when you run into the various glitches, snags, and known issues that come with the territory-
Take, for instance, “text within the image” (not your prompt). let’s say you want to put a famous person in a roadside billboard. The person looks ok in your generated image, but the signage is a mess of letter-ish looking lines.
If we think of the generator as an intelligent being that can understand what you type, and generate amazing things, we expect it to understand how to write phrases as well, obviously. But somehow it can barely pull off even the most basic English word, and so this seems like a really strange problem to occur.
But when we understand that there is no thinking, there is just prompt-based image search/blend/refine, then suddenly it’s a lot easier to see why those letters come out the way they do – you’re basically seeing what 20 photos of billboards look like baked together…
The same thing goes for fingers. You’ll often get back images with 5+ digit “tentacle hands,” and this is because the program is just looking at a bunch of images marked with “hands” and fusing them together. Well, our fingers are almost never static, and when we understand additionally all the various angles and such of the photos that the program is referencing, it becomes very understandable why they are so frequently weird looking.
These issues and others like it are being addressed more and more, but I’ve decided to write about it all now, because the foundation of the steps is going to stay the same for a while. These generators are basically in the “fix-it in post” stage, where coders and trainers will come-up with ways to update-fix an issue, without “rewriting” or “reshooting” the core of how everything works.
My other main thought, basically loops back to utility. I went down this road because I felt like, “I’m going to need to know, this is the future.” And it is. But what future?
I definitely have extensive crystal ball thoughts, but those involve the chat, research, and task based AI tools, which I have not discussed previously in this post, so I’ll save the big picture for another one.
But within the specifics of the above, I do expect this current revolution to sort of be the baseline for the next decade or so. Right now, the foundation is set, the contributors have all built their initial end-user products for market, and we’re in the early adoption period. You, me, and every other “end-user” will take to certain things, and not to others. Obviously, marketing will consistently push “revolutionary” v2,3,4,5 etc. every 3 months, but all of that will be, in truth, iterative and evolutionary. Like fission, or rockets, or the internet, the foundation of how all of this works is something that has been “figured out” a certain way, which is taking hold.
In fact, I would mostly liken this to nuclear fission power- exponentially more powerful than any before it, truly “game changing” on a global scale, but we didn’t make “Ion energy” immediately after, and fusion, or in this case, true intelligence, is still sort of an ever-dangling carrot.
New standards will get set, workflows, expectations, all the rest, and soon new industries and careers will flourish and attract workers to meet those expectations, while others fade away. I see a small culling of workers happening because of how fast it is, but at the same time, it just raises the bar, and so it doesn’t mean that an accountant is now suddenly any more capable of being a graphic designer with one click than they were before, if, that graphic designer applies their talent to these tools. We’re all going to get accustomed to what “basic/random” AI art looks like, and people who couldn’t discern or care to strive for the best looking art before, are still not going to now, while those that do will generate a bunch of things and say, “eh, not what I wanted, exactly” and go in improve it. And it so the human talent to curate, adjust, and revise is still going to be very much a thing.