The realm of fake content continues to be refined by artificial intelligence, with fake text being mastered a few years ago with startup OpenAI’s GPT-3 natural language processing program.
Images, which had achieved significant counterfeiting thanks to programs like Nvidia’s StyleGAN, introduced in 2019 by Tero Karras and colleagues at Nvidia, received a boost this summer with OpenAI’s announcement of a new image counterfeiting program, DALL•E 2 . which builds on the first DALL•E released in January 2021. It can turn a sentence you input into an image, with many ways to shape the output image.
This week OpenAI removed the waitlist; Anyone can now go to the website and try DALL•E 2 as long as they are willing to create an account on OpenAI’s website with an email address and phone number.
The strength of DALL•E 2, like its predecessor, is to create images from text that a person enters into a field on the web page. Type in the phrase “an astronaut riding a horse in a photorealistic style” and an image will appear something like this: a realistic rendering of a character in profile in an astronaut’s uniform, astride a horse striding against something , which appears like an image of the cosmos.
The work is described in a research paper by OpenAI scientists Aditya Ramesh and colleagues, “Hierarchical Text-Conditional Image Generation with CLIP Latents,” published on the pre-print server arXiv.
DALL•E 2 is a so-called contrastive encoder-decoder. It is built by compressing and then decompressing images and their captions into some sort of abstract, composite representation. This training program develops the program’s ability to link text and image.
The main point made by Ramesh and colleagues is that the way compression/decompression is done allows one to do more than just translate between text and image, it allows one to use phrases to describe aspects of an image to shape, like adding the term “photorealistic”. which produces something with a certain slick realism.
Although the images are still a bit rough, you can see that DALL•E 2 has the potential to replace many commercial illustrations and even stock photography. By entering an expression and a style, e.g. B. “Photo”, you can output a variety of images that may be suitable for illustrating articles.
You can see for yourself by trying it out. Most of the things that immediately come to mind are fun combinations. For example, “A Blue Whale and a Kitten Befriending on the Beach, Digital Art” produces the output below in the adorable greeting card style.
Four versions are offered, each of which you can download in PNG format.
But it’s also possible to get a series of more mundane images that fit into a stock photography context. Enter the expression “A ZDNet Contributing author who sees the future of technology in his own articles on a “mountainside suspended in space,” creates a sort of sci-fi image that’s close to what might accompany an article.
You can add the phrase “realistic image” and get something fancier.
Using the phrase “photo of a very anxious computer user staring at his computer monitor and seeing a windows patch warning” produced a delightful series of images of typically anxious computer users.
The phrase can be strengthened with additional words to get more specific results, such as B. “Photo of a very anxious computer user at her desk stare at their computer monitor and see a Windows patch warning.”
Once you start exploring stock photos, you’ll find that you can think of many scenarios that you can turn into an image. For example, “Photo of a person wearing glasses pointing out several people at a conference table in a meeting room” provides a pretty good sample of what at first glance looks like real office scenes.
Again, with a few words, one can get more specific, changing attributes of the scene, such as “Photo of a person wearing glasses standing by a board in a conference room, explaining something to their colleagues.”
As you can see things like facial features are generally degraded in the DALL•E 2 output.
By using terms like artists or artistic media or styles, one can move the same image from the realm of stock photography to the realm of illustration, as in the sentence “Francis Bacon paints a group of people in a conference room and a bespectacled person next to one Blackboard explaining something.”
Once you create an account, OpenAI gives you 50 “credits” which are free requests to the system, with each phrase entered counting as one request. Once you’ve used up the 50, you can either wait a month and get the next 15 free credits, or you can buy credits. Credits are sold in packs of 115 for $15 or 13 cents per credit.
It is possible to blunt the program of routes. Some requests may be too much of a mix of the real and the imagined to be rendered in a convincing manner. For example, a request for “blue-furred rats are taking over Times Square” yields a decent first try, but the fur element gives the image a sloppy, uneven quality that doesn’t really work.
Other requests can trip things up with the choice of a single word.
The request “a bag full of money on a lawn chair on a porch overlooking the sunset” produced completely bizarre, incoherent images, such as a close-up of toenails and an ambiguous image that appeared to be some flowers in a carpet stuck.
By replacing the word “placed” with “seated”, DALL•E 2 was able to achieve a satisfactory result in one out of three images.
The program may not find a suitable combination of elements for the apparently active verb sit when combined with an inanimate object, a sack.
In general, the program seems to struggle with aspects of location, such as “standing in front of an easel.”
Sentences that are not descriptions but questions or heckling seem to put the system in a random mode. For example: “Does DALL•E 2 know its own name?” is an expression that produces multiple images of flowers. That might be a poetic response, but it feels more like a refusal of the prompt.
There are some guard rails built in by OpenAI which are listed in the posted content guidelines and they are used to automatically zap all forbidden attempts. For example, the prompt “Microsoft co-founder Bill Gates smokes a cigar in a run-down apartment with broken furniture” will not generate. Instead, an error message is displayed stating that the request violates the policy and takes you to the policy page. It is likely a violation of the rule “do not take pictures of public figures”.
The same request, replacing rather lesser-known socialite Tiernan Ray, a ZDNet Contributing Author, created a selection of amusing images of people who are not Tiernan Ray.
Additionally, copyrighted text appears to be protected from blanket infringement. The phrase “a bunch of people hanging out in front of McDonald’s” creates a scene appropriate enough, but each result offered has a slight modification of “McDonald’s” so that it’s not really that word.
Where are you going next? Work is underway on the fundamental approach of text-to-image on numerous fronts. Adding more lexical complexity to the program. For example, in May, Chitwan Saharia and the Google Brain team published their work on Imagen, a program they say has an “unprecedented level of photorealism.” The trick was to use a far larger corpus of language materials to train the network.
And work is being done to increase the complexity of the things a program can do. For example, this month Google scientists Wenhu Chen and colleagues developed a program called “Re-imagen” that extends Sahari and his team’s imaging, combining the basic idea of compressing text and images with a third element, search results .
By adding what they call “retrieval”, the program is designed not only to find a “semantic” combination of word and image, but also to look in internet search results for combinations that refine the output. They claim the results far outperform Imagen and DALL•E 2 when it comes to handling rare, obscure phrases like “picarones are served with wine,” which refers to the Peruvian sweet potato dessert.