What the new GPT-4 AI can do
- Science
- March 16, 2023
- No Comment
- 4
Technology research firm OpenAI has just released an updated version of its text-generating artificial intelligence program called GPT-4, demonstrating some of the language model’s new capabilities. Not only can GPT-4 produce more natural-sounding text and solve problems more accurately than its predecessor. In addition to text, it can also process images. But the AI is still prone to some of the same problems that have plagued previous GPT models: showing biases that cross guard rails designed to prevent it from saying offensive or dangerous things and “hallucinating,” or confidently telling untruths invent that are not found in their training data.
On Twitter, Sam Altman, CEO of OpenAI, described the model as the company’s “most powerful and aligned” yet. (“Aligned” means it follows human ethics.) But “it’s still flawed, still limited, and it still seems more impressive the first time you use it than after more time with it,” he wrote in the tweet.
Perhaps the most significant change is that GPT-4 is “multimodal,” meaning it works with both text and images. Although it cannot output images (like generative AI models like DALL-E and Stable Diffusion), it can process and respond to the visual input it receives. Annette Vee, an associate professor of English at the University of Pittsburgh who studies the intersection of computation and writing, observed a demonstration that challenged the new model to figure out what’s funny about a humorous image. Being able to do this means “understanding the context in the picture. It’s about understanding how an image is composed and why, and connecting it to society’s understanding of language,” she says. “ChatGPT couldn’t do that.”
A device with the ability to analyze and then describe images could be of tremendous value to people who are visually impaired or blind. For example, a mobile app called Be My Eyes can describe the objects around a user and help those with limited or no vision interpret their surroundings. The app recently integrated GPT-4 into a “virtual volunteer” that, according to a statement on OpenAI’s website, “can generate the same level of context and understanding as a human volunteer.”
But GPT-4’s image analysis goes beyond describing the image. In the same demonstration Vee observed, an OpenAI representative sketched an image of a simple website and fed the drawing GPT-4. Next, the model was asked to write the code needed to create such a website – and it did. “It basically looked like the picture. It was very, very simple, but it worked pretty well,” says Jonathan May, a research associate at the University of Southern California. “Well that was cool.”
Even without its multimodal capability, the new program outperforms its predecessors in tasks that require logical thinking and problem solving. OpenAI says it put both GPT-3.5 and GPT-4 through a variety of tests designed for humans, including a simulation of the bar exam, the SAT and Advanced Placement tests for high school students, the GRE for college graduates, and even a few of the sommelier exams. GPT-4 achieved human-level results on many of these benchmarks, and consistently outperformed its predecessor, although it didn’t pass everything with flying colours: it did poorly on English language and literature exams, for example. Still, its extensive problem-solving ability could be applied to any number of real-world applications – e.g. B. managing a complex schedule, spotting errors in a block of code, explaining grammatical nuances to foreign language learners, or Identification of security vulnerabilities.
In addition, OpenAI claims that the new model can interpret and output longer blocks of text: more than 25,000 words at a time. Although previous models were also used for long-form applications, they often lost track of things. And the company is touting the new model’s “creativity,” which it describes as its ability to produce different types of artistic content in specific styles. In a demonstration comparing how GPT-3.5 and GPT-4 mimicked Argentine author Jorge Luis Borges’ style in English translation, Vee noted that the newer model provided a more accurate attempt. “You have to know enough about the context to be able to judge it,” she says. “A student might not understand why it’s better, but I’m an English professor… If you understand it from your own field of knowledge and it’s impressive in your own field of knowledge, then that’s impressive.”
May also tested the model’s creativity herself. He attempted the playful task of ordering it to create a “backronym” (an acronym achieved by starting with the abbreviated version and working backwards). In this case, May asked for a cute name for his lab that would spell “CUTE LAB NAME” and also accurately describe his field of research. GPT-3.5 failed to generate a relevant label, but GPT-4 succeeded. “The result was ‘Computational Understanding and Transformation of Expressive Language Analysis, Bridging NLP, Artificial Intelligence And Machine Education,'” he says. “’Machine Education’ isn’t great; The “intelligence” part means that there is an extra letter included. But honestly, I’ve seen a lot worse.” (For context, the actual name of his lab is CUTE LAB NAME or the Center for Useful Techniques Enhancing Language Applications Based on Natural And Meaningful Evidence). In another test, the model showed the limits of its creativity. When May asked it to write a particular type of sonnet – he asked for a form used by the Italian poet Petrarch – the model, unfamiliar with that poetic institution, defaulted to Shakespeare’s preferred sonnet form.
Of course, it would be relatively easy to fix this particular problem. GPT-4 just needs to learn one additional poetic form. When people break the model in this way, it helps the program develop: it can learn from whatever unofficial testers put into the system. Like its less fluent predecessors, GPT-4 was originally trained on large amounts of data, and this training was then refined by human testers. (GPT stands for Generative Pretrained Transformer.) But OpenAI has kept secret how it made GPT-4 better than GPT-3.5, the model that powers the company’s popular ChatGPT chatbot. According to the paper, published along with the release of the new model, “Given the competitive landscape and safety implications of large-scale models such as GPT-4, this report does not provide further details about the architecture (including model size), hardware, training calculation, data set construction, training methodology, or similar.” OpenAI’s lack of transparency reflects this new competitive Generative AI environment, where GPT-4 must compete with programs like Google’s Bard and Meta’s LLaMA. However, the paper goes on to suggest that the company plans to eventually share such details with third parties “who can advise us on how to balance the competitive and safety considerations … against the scientific value of further transparency.”
These security considerations are important because smarter chatbots can do harm: Without guard rails, they could be instructing a terrorist on how to build a bomb, issuing threatening messages for a harassment campaign, or providing misinformation to a foreign agent trying to influence a choice. Although OpenAI has set limits on what its GPT models are allowed to say to avoid such scenarios, determined testers have found ways around them. “These things are like cops in a china shop — they’re powerful, but they’re ruthless,” scientist and author Gary Marcus told Scientific American shortly before the release of GPT-4. “I don’t believe [version] four will change that.”
And the more human-like these bots become, the better they can trick people into believing that there’s a sentient agent behind the computer screen. “Because it mimics [human reasoning] All right, by language we believe that – but under the hood it’s in no way similar to humans,” Vee warns. If this illusion leads people to believe that an AI agent is using human-like arguments, they are more likely to trust its answers. This is a significant problem as there is still no guarantee that these answers are correct. “Just because these models say something doesn’t mean it’s what they say [true]’ says May. ‘There is no database of answers that these models draw from.’ Instead, systems like GPT-4 generate an answer word by word, with the most plausible next word informed by their training data – and that training data may be out of date . “I don’t think GPT-4 even knows it’s GPT-4,” he says. “I asked him and he said, ‘No, no, there is no such thing as GPT-4. I’m GPT-3.’”
Now that the model has been released, many researchers and AI enthusiasts have the opportunity to explore GPT-4’s strengths and weaknesses. Developers who want to use it in other applications can request access, and anyone who wants to “talk” to the program must subscribe to ChatGPT Plus. For $20 per month, this paid program allows users to choose whether they want to talk to a chatbot running on GPT-3.5 or one on GPT-4.
Such exploration will no doubt uncover other potential uses – and bugs – in GPT-4. “The real question should be, ‘How are people going to feel about this two months from the initial shock?'” says Marcus. “Part of my advice is, let’s dampen our initial excitement by recognizing that we’ve seen this film before. It’s always easy to make a demo of something; It’s difficult to turn it into a real product. And if it still has these problems – around hallucinations, not really understanding the physical world, the medical world, etc. – that will still somewhat limit its usefulness. And it will still mean you have to be careful about how it’s used and what it’s used for.”