Why Generating Infographics is so Hard for GenAI?
I encountered some surprising behavior of DALLE when tinkering with the prompts that proposed by Ethan Mollick in the post about constructing expert prompts that get the most out of chatGPT. He humorously calls a repository of these prompts to be grimoires. The idea of having expert prompts to create expert GPTs is brilliant. With some prompting, one can turn the default boring chatGPT into a professional expert in some field. I realized that the prompt that he created for a tutorGPT is HUGE. It’s almost 400 words long. No-one could have come up with it without a lot of tinkering, and hacking one’s way through a lot of experiments. Having seen how long this prompt is, I started to wonder whether I have been prompting the right way. Most of my prompts are relatively short like: tell me this, summarize this.., how to do this? Sometimes I get some tasks done right. Sometimes GPT sounds so dumb it takes Google to solve the problem. The observation that a detailed prompt is more effective than a shorter one is also shared by Jaime Teevan, Chief Scientist at Microsoft. She argues that we, humans know a lot of contexts, which we need to explicitly tell chatGPT otherwise it doesn’t know what we’re trying to do.
Writing effective prompts is an art that grade school, or my doctoral program for that matter, hasn’t prepared me for. Effective prompt writing is a completely new art, new form of writing that takes “the AI imagination” to do it right. The AI imagination is defined as the ability of a person to understand what the limits of AI is, and thus is able to construct appropriate prompts to harness the power of genAI simply through words. This imagination is a skill. It’s not an innate ability. This is a disciplinary skill akin to what an undergraduate student might get after they take method classes such as “Historical Imagination,“ “Sociological Imagination,“ or “The Art of Mathematical Reasoning.“
What does it take to write prompts effectively then? It’s about giving contexts. It’s about describing the role of the GPT. It has to assume a role. It has to know the boundaries of its character: What it can do or what it cannot do. The more details we provide, the better the character will behave.
So I have been doing things pretty much the wrong way. I haven’t packed enough details. I have been a pretty lazy prompt writer. I only want an answer instead of being happy writing long prompt with a lot of imagination packed into it. The more you put into an AI, the more you get out of it. AI-engineering is hard work!
Long story short, while I have been playing with the tutorGPT that Mollick designed, I found that the tutorGPT doesn’t really do a good job at teaching me a new concept. I asked questions like teaching me about Matrix decomposition, or chain rules in probability. It would generate texts fine, though I didn’t fact check the answers with a mathematician, I assume that the answers were correct.
As a visual learner, and that I remember from my teaching days in graduate school, I know that a picture, an infographics, or a diagram is worth a thousand words. The concept will edge in the learner’s mind in very different ways. So after each explanation, I ask chatGPT to generate a figure or a diagram or an image to encapsulate the idea. Here are what it generated for matrix decomposition:
The image is pretty impressive. I love the idea of representing two types of matrix decomposition (LU decomposition & Singular Value Decomposition) by dividing a stack of pancakes. But when examining the illustration further, I found it’s odd: the words seem very very wrong. “Simigmag,” “Sigmma” are supposed to be sigma. SMUA is SVD I think.
I kept iterating through a few examples, here is the illustration for the workflow of a UX researcher:
All words on this image are misspelled: “data colleriction,” “Anaysis,“ “IIInterwiees,“ etc. This gave me a good laugh.
Here’s another mathy one, an illustration of the chain rule in probability:
Same problem: the words are misspelled everywhere: “decommpiision”, “prooibiity,” “proviity,” “probiity”, “proiity.”
I reiterated the problem, and ask chatGPT to fix the chain rule figure, and this is what it got the second time around:
This text accompanying the re-generated illustration is as follows:
The revised illustration, ensuring correct spelling and accurate depiction of the chain rule in probability decomposition, is now available. This image visually represents the sequence of events A, B, and C, with clear and correct labeling, and demonstrates the multiplication of their probabilities in the correct sequence, adhering to the chain rule.
So instead of revising the same digital illustration which is pretty cool already, it generates a completely new illustration. Fine. I get it. I do the same thing: instead of fixing the bugs in my production code, I’d rather generate something completely new. I can sympathize. But then upon closer inspection, words are still misspelled everywhere: “prtitery,” “chaclulabte,” “probility.” Regeneration doesn’t really fix the important details.
Overall, the infographics generated are pretty impressive from the first glance. The “negative space,” or the white space, or the area of space that is left out empty is utilized well. The core concepts might be well illustrated, but the words are SO WRONG!!
Why getting words right in an infographics is so difficult?
This seems to be a hard problem that image generation models are still trying to solve. There is even a reddit post about this issue. Redditors joked that Dalle doesn’t have a spell check because it’s using a diffusion model. I wish spell check and diffusion model can be merged! What seems to be so easy for human eyes become hard if the underlying generation model is a diffusion model. One observation seems to be that all text-to-image models have encountered the same no-spell check issue, but the models with larger parameters seem to do better than the smaller ones. Another reason I found is that DALLE is likely using a character-based model instead of a word-based model. The model only considers individual character within a word rather than a word as a whole. So when generating a picture only individual words are considered instead of the entire word (the context). Others have pointed out that DALLE 3 seems to have done better than DALLE 2, but it doesn’t mean that this problem no longer exists with a better image generation model.
The next question is rather a hypothesis: in this particular exercise, I am asking chatGPT to generate something akin to an illustration in a textbook or a lecture note. It fails to generate the complete image with correct words. Does that mean this is when humans should be involved? This reminds me of the research Repairing Innovation by Elizabeth Anne Watkins, and Madeleine Clare Elish. They found that when AI algorithms fails in hospital, and the human workers such as nurses and doctors are those who are on the trenches that carry out the necessary labor to fix what goes wrong . Are we seeing a rise of “repairing innovation” work done by humans in the age of Generative AI when AI fails to deliver? Some users of DALLE humorously suggested that we can simply use Photoshop to fix the words that DALLE couldn’t get right. That must be fun for those who are Photoshop experts. At least they didn’t have to come up with the design from 0, and only have to fix certain aspects of the design. If the total number of work hours that they put on fixing it is smaller than the number of hours that they create something from scratch, maybe it’s a productivity win (loosely defined).
The final set of questions I have is when can this problem be fixed? How would it be fixed? What it takes to fix this issue? It seems to be a hard problem: generating texts but in the form of pixel. I bet someone is working on a dissertation that tries to solve this issue. If anyone knows of such a research, please let me know.