The nightmare fuel you see in the preview image for this story was created when I asked a sophisticated neural network, “What would a malevolent artificial intelligence think about high-end audio?”
I’ll leave it to you, dear reader, to interpret the answer, but while you’re chewing on that, perhaps it’s worth exploring how and why I arrived at a place where I’m asking neural networks about our hobby.
Apologies for the sausage-making, but there’s been a disturbing trend in the domain of online journalism for years now that’s reaching a boiling point. Lawyers and other copyright trolls are purchasing the rights to images and suing anyone who uses them, or anyone who has ever used them, for ridiculous sums of money.
I’ve heard rumors that a former employer of mine was recently hit with a multi-thousand-dollar lawsuit because a previous owner of the publication used a marketing image from a screen manufacturer, and said manufacturer had Photoshopped a now-copyrighted image onto the screen somewhere around a decade ago. It’s a tangled mess, I know—the short version is that any online publication is wise these days to shoot its own photographs or use things with clear Creative Commons licensing.
Or, you know, have a team of illustrators. But needless to say, as wonderful as our own art director, Karen Fanas, is, she doesn’t have time to whip up a new illustration every time I write a story. She’s done some amazing work for me in the past—like when I asked her to render an image of some old-school balancing scales with a receiver on one side and a stack of separates on the other. But what would I have even asked her to illustrate to accompany my recent editorial about the benefits of lighting control in a two-channel listening room?
To answer that question in the bygone era that was a month ago, I turned to a rudimentary human-in-the-loop neural network workflow for translating text to images—specifically, DALL-E Flow running on a Google Colab machine. Its machine-learning algorithms are at least a year behind the state-of-the-art, and its UI is a hair-pulling, fit-pitching nightmare, requiring you to write or modify code that looks something like this just to advance from one step to the next:
Good results are hard-won, the process is labor intensive, and when things go wrong, they go laughably wrong. But that cobbled-together assemblage of neural networks still allowed me to illustrate last month’s story with a couple of hours’ work, and I didn’t have to worry about copyright trolls.
The times they are a-changin’ quicker than I can keep up with
Fast-forward a month, though, and I’ve gained access to one of the most talked-about modern (i.e., circa 2022) neural networks in the field of text-to-image generation: Midjourney. You may have heard that name bandied about as an alternative to DALL-E 2 and Imagen, if you care at all about neural networks. It’s much more advanced technology than DALL-E Flow, and it’s a lot easier to operate since you’re mostly interacting with a chatbot in Discord in a sort of conversational way. And the quality of images I’m getting from Midjourney is to DALL-E Flow what the PlayStation 5 is to the Atari 2600.
After a short beta test, I purchased a subscription with no intention of writing an article about it. I merely planned on using Midjourney to illustrate the occasional editorial that I couldn’t decorate with my own photography or safe stock images.
Like any tool, though, to understand it I needed to play with it, to try to break it, and as such, I started my time with the Midjourney chatbot just asking it to /imagine (the prompt for language input that can also include links to inspirational images and key commands like aspect ratio, visual style, focal length, etc.) a “well-made stereo receiver.”
The first results were promising. And it isn’t hard at all to take results like that and iterate on them, letting the system know that the upper left, say, is closer to what you’re looking for, but you’d like the algorithms to improvise with that attempt as a new center of gravity. I tried a few more, asking Midjourney to design me a stereo system made of metal and wood. The finished designs were, to say the least, fascinating.
I asked it what it thought a rich person’s stereo system might look like, and it gave me the collage below as a quartet of images to use as starting points for future iteration. Notice, though, that it’s not painting existing products. You might see knobs reminiscent of a beloved brand, or faceplates that look familiar if you squint at them sideways. Surely, this evokes something like the Platonic ideal of a fancy high-end system. But none of this is simple copypasta.
After a bit more fiddling, I started sharing some of the images I was conjuring with Brent Butterworth, editor of SoundStage! Solo and co-host of the SoundStage! Audiophile Podcast. I assumed that, like me, he would be most interested in the potential for illustrating stories that might otherwise be difficult to illustrate, especially those dealing with more nebulous topics that might benefit from more abstract imagery.
Like, without wasting our art director’s precious time or, you know, getting a degree in illustration, or potentially violating copyright, how could I develop an engaging graphic to accompany an article about room acoustics? It used to be that I’d whip up something functional and rudimentary in PaintShop Pro and hope my embarrassment subsided soon enough. Now I can prompt and explore and manipulate the imagination of a neural network and get little works of original art that still make a point.
Brent did not take the bait I wanted him to take, though. I half expected him to say, “Hey, I’m writing an article about high-current amps! How would this thing illustrate the concept of electrical current?” Or maybe, “Get it to draw some high-end headphones!”
But nope. Brent being Brent, his first reaction was to ask me for a composite sketch of your typical high-end audiophile journalist. And what I got back was as hilarious as it was telling.
From there, Brent goaded me into prodding this neural network to get a sense of what it thinks of our hobby in general. And the results of these investigations were equally telling but not quite as funny. The more I explored, the more I realized that A.I. thinks our hobby is pretty boring and stagnant. Ask it for images of stereo integrated amplifiers, and it returns a nigh-endless stream of cookie-cutter designs: black and brown and silver boxes with knobs on the front. Ask it to render some audiophile loudspeakers, and they’re all lookalike monoliths—rectilinear cabinets with some circular drivers thrown in haphazardly. (Mind you, Midjourney’s understanding of driver placement isn’t much worse than what was the norm in the 1970s and ’80s, but it’s certainly not any better).
Are we selling stereos to robots now? Who cares?
You could question the validity of this sort of navel-gazing, of course, and probably with good reason. But allow me to defend these explorations, if you will. Part of my goal here at SoundStage! Access is to figure out how to reach people who aren’t already audio hobbyists—to convince them that there’s real value in having a good stereo system instead of relying entirely on TV speakers or cheap earphones or smart digital assistants like Amazon Echo for music listening.
And an important part of evangelizing any area of interest is understanding how it’s already perceived. There are some of you, I’m sure, thinking, “Aha! But who really cares what a robot thinks about hi-fi? We’re not trying to sell hi-fi to robots!”
True, but when we ask a neural network like Midjourney or DALL-E 2 or Imagen (once that one’s accessible) what it thinks something looks like, what we’re really asking is what general consensus arises from the training data fed to that neural network. The differences between them are not necessarily the patterns they recognize, but how they recombine them and how many parameters they can handle (the answer is usually in the millions or billions).
It’s important to remember that, as powerful as these artificial intelligences are, they are not sentient. Language processing and pattern recognition are not the same as actually comprehending language. Knowing how to create a 500-dimensional model that then allows a bot to create a wholly original image from scratch that nonetheless captures the essence of something in the real world is not the same as having an aesthetic sense, and it’s surely not the same thing as being an artist.
When we ask Midjourney what it thinks a high-end audiophile writer looks like, it’s not giving you its impression in response. It’s sort of synthesizing the aggregate opinion of whatever images were fed to it that happened to be labeled with some combination of words that evoke “high end,” “audiophile,” and “writer.”
So when you ask an A.I. what it thinks about hi-fi, what it’s really telling you is what the group consensus is among the people who created the links between certain images and certain words. In other words, what it’s telling you is what the population of the internet generally thinks of hi-fi, at least according to the sampling of data it was trained on.
When I ask it for pictures of speakers without giving it specifics, it’s not rendering big electrostats and Magneplanar speakers and funky multi-cabinet beasts like the Wilson Alexx in its initial seeds, because those things are outliers and are, as such, treated like noise instead of signal. Its concept of a stereo integrated amplifier doesn’t resemble anything like the Marantz Model 40n or Cambridge Audio Evo 150 or NAD M10 V2 because most people would look at those things and say, “That doesn’t look like any integrated amp I’ve ever seen!” The only way a system like Midjourney would be able to imagine something that looks like the Evo 150 would be to pretty much just replicate the Evo 150. And that would defeat half the purpose of these things.
But again, it gives us something to think about when trying to figure out how to preach to the unconverted. If “boring black box with some circles on the front” is inherent to the general perception of what a loudspeaker is, perhaps the way to reach new people is to focus on speakers that aren’t that. I’ve been drawn to odd and beautifully designed integrated amps lately because they’re something different, and that’s kinda it. But maybe designs of that sort are the key to reaching people who think a stereo system is something that exists on a design spectrum ranging from “meh” to “eww.”
Some cause for optimism
At any rate, there is a small ray of hope in all this exploration. Rather than specifically asking Midjourney to draw a high-end audio writer or someone who frequents audiophile groups on Facebook, I just asked it to paint an audiophile. Just that. An audiophile. And what it came back with was this:
Notice that those semi-abstract forms skew kinda young. That’s hopeful. Note, too, that the first thing it came up with is a young woman, which is particularly striking given that these neural networks have built-in biases against women and people of color.
Don’t ask me what’s going on in that bottom right image, by the way. I haven’t a clue. Ignoring outliers like that, though, I think all of this exploration is instructive, not so much in the way that it tells us what super-intelligent neural networks think about hi-fi—or any given subject—but as a reflection of what we think about it.
Again, the way these things work is by pattern recognition and pattern synthesis. To recognize patterns, these neural networks have to observe patterns. And when I go searching for patterns in other fields of interest, I find all sorts of variety. Ask Midjourney to draw or paint some bikes, and you get everything from quaint penny-farthing-esque things to recumbents that look like they’re from the future.
But for us, it’s all black boxes with a bit of wood and metal and some knobs. And again, I’m not saying that’s true or representative; I’m merely saying that’s the perception I’m seeing so far. That’s the pattern I’m recognizing in the pattern-recognition data. That’s, as best I can tell, part of the inertia we face when we try to reach the youths. And I’m not saying that’s the entire answer, but I can’t help thinking it’s part of the answer.
. . . Dennis Burger
dennisb@soundstagenetwork.com