What the New GPT-4 AI Can Do

2023-03-19

关注

Tech research company OpenAI has just released an updated version of its text-generating artificial intelligence program, called GPT-4, and demonstrated some of the language model’s new abilities. Not only can GPT-4 produce more natural-sounding text and solve problems more accurately than its predecessor. It can also process images in addition to text. But the AI is still vulnerable to some of the same problems that plagued earlier GPT models: displaying bias, overstepping the guardrails intended to prevent it from saying offensive or dangerous things and “hallucinating,” or confidently making up falsehoods not found in its training data.

On Twitter, OpenAI CEO Sam Altman described the model as the company’s “most capable and aligned” to date. (“Aligned” means it is designed to follow human ethics.) But “it is still flawed, still limited, and it still seems more impressive on first use than it does after you spend more time with it,” he wrote in the tweet.

Perhaps the most significant change is that GPT-4 is “multimodal,” meaning it works with both text and images. Although it cannot output pictures (as do generative AI models such as DALL-E and Stable Diffusion), it can process and respond to the visual inputs it receives. Annette Vee, an associate professor of English at the University of Pittsburgh who studies the intersection of computation and writing, watched a demonstration in which the new model was told to identify what was funny about a humorous image. Being able to do so means “understanding context in the image. It’s understanding how an image is composed and why and connecting it to social understandings of language,” she says. “ChatGPT wasn’t able to do that.”

A device with the ability to analyze and then describe images could be enormously valuable for people who are visually impaired or blind. For instance, a mobile app called Be My Eyes can describe the objects around a user, helping those with low or no vision interpret their surroundings. The app recently incorporated GPT-4 into a “virtual volunteer” that, according to a statement on OpenAI’s website, “can generate the same level of context and understanding as a human volunteer.”

But GPT-4’s image analysis goes beyond describing the picture. In the same demonstration Vee watched, an OpenAI representative sketched an image of a simple website and fed the drawing to GPT-4. Next the model was asked to write the code required to produce such a website—and it did. “It looked basically like what the image is. It was very, very simple, but it worked pretty well,” says Jonathan May, a research associate professor at the University of Southern California. “So that was cool.”

Even without its multimodal capability, the new program outperforms its predecessors at tasks that require reasoning and problem-solving. OpenAI says it has run both GPT-3.5 and GPT-4 through a variety of tests designed for humans, including a simulation of a lawyer’s bar exam, the SAT and Advanced Placement tests for high schoolers, the GRE for college graduates and even a couple of sommelier exams. GPT-4 achieved human-level scores on many of these benchmarks and consistently outperformed its predecessor, although it did not ace everything: it performed poorly on English language and literature exams, for example. Still, its extensive problem-solving ability could be applied to any number of real-world applications—such as managing a complex schedule, finding errors in a block of code, explaining grammatical nuances to foreign-language learners or identifying security vulnerabilities.

Additionally, OpenAI claims the new model can interpret and output longer blocks of text: more than 25,000 words at once. Although previous models were also used for long-form applications, they often lost track of what they were talking about. And the company touts the new model’s “creativity,” described as its ability to produce different kinds of artistic content in specific styles. In a demonstration comparing how GPT-3.5 and GPT-4 imitated the style of Argentine author Jorge Luis Borges in English translation, Vee noted that the more recent model produced a more accurate attempt. “You have to know enough about the context in order to judge it,” she says. “An undergraduate may not understand why it’s better, but I’m an English professor.... If you understand it from your own knowledge domain, and it’s impressive in your own knowledge domain, then that’s impressive.”

May has also tested the model’s creativity himself. He tried the playful task of ordering it to create a “backronym” (an acronym reached by starting with the abbreviated version and working backward). In this case, May asked for a cute name for his lab that would spell out “CUTE LAB NAME” and that would also accurately describe his field of research. GPT-3.5 failed to generate a relevant label, but GPT-4 succeeded. “It came up with ‘Computational Understanding and Transformation of Expressive Language Analysis, Bridging NLP, Artificial intelligence And Machine Education,’” he says. “‘Machine Education’ is not great; the ‘intelligence’ part means there’s an extra letter in there. But honestly, I’ve seen way worse.” (For context, his lab’s actual name is CUTE LAB NAME, or the Center for Useful Techniques Enhancing Language Applications Based on Natural And Meaningful Evidence). In another test, the model showed the limits of its creativity. When May asked it to write a specific kind of sonnet—he requested a form used by Italian poet Petrarch—the model, unfamiliar with that poetic setup, defaulted to the sonnet form preferred by Shakespeare.

Of course, fixing this particular issue would be relatively simple. GPT-4 merely needs to learn an additional poetic form. In fact, when humans goad the model into failing in this way, this helps the program develop: it can learn from everything that unofficial testers enter into the system. Like its less fluent predecessors, GPT-4 was originally trained on large swaths of data, and this training was then refined by human testers. (GPT stands for generative pretrained transformer.) But OpenAI has been secretive about just how it made GPT-4 better than GPT-3.5, the model that powers the company’s popular ChatGPT chatbot. According to the paper published alongside the release of the new model, “Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.” OpenAI’s lack of transparency reflects this newly competitive generative AI environment, where GPT-4 must vie with programs such as Google’s Bard and Meta’s LLaMA. The paper does go on to suggest, however, that the company plans to eventually share such details with third parties “who can advise us on how to weigh the competitive and safety considerations ... against the scientific value of further transparency.”

Those safety considerations are important because smarter chatbots have the ability to cause harm: without guardrails, they might provide a terrorist with instructions on how to build a bomb, churn out threatening messages for a harassment campaign or supply misinformation to a foreign agent attempting to sway an election. Although OpenAI has placed limits on what its GPT models are allowed to say in order to avoid such scenarios, determined testers have found ways around them. “These things are like bulls in a china shop—they’re powerful, but they’re reckless,” scientist and author Gary Marcus told Scientific American shortly before GPT-4’s release. “I don’t think [version] four is going to change that.”

And the more humanlike these bots become, the better they are at fooling people into thinking there is a sentient agent behind the computer screen. “Because it mimics [human reasoning] so well, through language, we believe that—but underneath the hood, it’s not reasoning in any way similar to the way that humans do,” Vee cautions. If this illusion fools people into believing an AI agent is performing humanlike reasoning, they may trust its answers more readily. This is a significant problem because there is still no guarantee that those responses are accurate. “Just because these models say anything, that doesn’t mean that what they’re saying is [true],” May says. “There isn’t a database of answers that these models are pulling from.” Instead, systems like GPT-4 generate an answer one word at a time, with the most plausible next word informed by their training data—and that training data can become outdated. “I believe GPT-4 doesn’t even know that it’s GPT-4,” he says. “I asked it, and it said, ‘No, no, there’s no such thing as GPT-4. I’m GPT-3.’”

Now that the model has been released, many researchers and AI enthusiasts have an opportunity to probe GPT-4’s strengths and weaknesses. Developers who want to use it in other applications can apply for access, and anyone who wants to “talk” with the program will have to subscribe to ChatGPT Plus. For $20 per month, this paid program lets users choose between talking with a chatbot that runs on GPT-3.5 and one that runs on GPT-4.

Such explorations will undoubtedly uncover more potential applications—and flaws—in GPT-4. “The real question should be ‘How are people going to feel about it two months from now, after the initial shock?’” Marcus says. “Part of my advice is: let’s temper our initial enthusiasm by realizing we have seen this movie before. It’s always easy to make a demo of something; making it into a real product is hard. And if it still has these problems—around hallucination, not really understanding the physical world, the medical world, etcetera—that’s still going to limit its utility somewhat. And it’s still going to mean you have to pay careful attention to how it’s used and what it’s used for.”

参考译文

新的GPT-4 AI可以做什么

科技研究公司OpenAI刚刚发布了一个名为GPT-4的文本生成人工智能程序的更新版本，并展示了该语言模型的一些新功能。GPT-4不仅能生成更自然的文本，解决问题也比之前的版本更准确。除了文本，它还可以处理图像。但人工智能仍然容易受到一些困扰早期GPT模型的相同问题的影响:显示偏见，超越旨在防止它说冒犯或危险的事情和“产生幻觉”的护栏，或自信地编造训练数据中没有发现的谎言。在推特上，OpenAI首席执行官萨姆·奥特曼(Sam Altman)称该模型是该公司迄今为止“最有能力和最一致的”模型。(“对齐”意味着它是按照人类伦理设计的。)但他在推特上写道:“它仍然有缺陷，仍然有局限性，第一次使用时似乎比你用了更多时间后更令人印象深刻。”也许最重要的变化是GPT-4是“多模式”的，这意味着它可以同时处理文本和图像。虽然它不能输出图片(像DALL-E和稳定扩散等生成式AI模型一样)，但它可以处理并响应接收到的视觉输入。匹兹堡大学(University of Pittsburgh)研究计算和写作交集的英语副教授安妮特·维(Annette Vee)观看了一个演示，在演示中，新模型被要求识别一个幽默图像的有趣之处。能够做到这一点意味着“理解图像中的上下文”。它是理解图像是如何组成的，为什么组成，并将其与社会对语言的理解联系起来，”她说。“ChatGPT无法做到这一点。”一种能够分析并描述图像的设备对视力受损或失明的人来说非常有价值。例如，一款名为Be My Eyes的移动应用程序可以描述用户周围的物体，帮助视力低下或没有视力的人理解周围的环境。该应用程序最近将GPT-4整合到一个“虚拟志愿者”中，根据OpenAI网站上的一份声明，“可以产生与人类志愿者相同水平的上下文和理解”。但是GPT-4的图像分析不仅仅是描述图像。在Vee观看的同一场演示中，OpenAI的一位代表画了一张简单网站的草图，并将其提交给GPT-4。接下来，该模型被要求编写生成这样一个网站所需的代码——它做到了。“它看起来基本上和图片上的一样。这非常非常简单，但效果很好，”南加州大学研究副教授乔纳森·梅(Jonathan May)说。“所以这很酷。”即使没有多模态功能，新程序在需要推理和解决问题的任务上也优于之前的程序。OpenAI表示，它已经通过各种为人类设计的测试运行了GPT-3.5和GPT-4，包括模拟律师资格考试、高中生的SAT和大学先修课程考试、大学毕业生的GRE考试，甚至还有几场侍酒师考试。GPT-4在许多基准测试中都取得了人类水平的分数，并始终优于它的前任，尽管它并没有在所有方面都取得好成绩:例如，它在英语语言和文学考试中表现很差。尽管如此，它广泛的解决问题的能力可以应用到任何数量的现实应用中——比如管理复杂的时间表，在代码块中发现错误，向外语学习者解释语法上的细微差别，或者识别安全漏洞。此外，OpenAI声称新模型可以解释和输出更长的文本块:一次超过25000个单词。尽管以前的模型也用于长期应用程序，但它们经常不知道自己在谈论什么。该公司还大肆宣扬新模式的“创造力”，称其能够以特定风格制作不同类型的艺术内容。在一个比较GPT-3.5和GPT-4如何模仿阿根廷作家豪尔赫·路易斯·博尔赫斯的英语翻译风格的演示中，Vee指出，最新的模型产生了更准确的尝试。她说:“你必须对背景有足够的了解才能做出判断。”“一个本科生可能不明白为什么这样更好，但我是一个英语教授....如果你从你自己的知识领域理解它，并且在你自己的知识领域里让人印象深刻，那就是令人印象深刻。梅还亲自测试了模特的创造力。他尝试了一个有趣的任务，让它排列成一个“backronym”(一个由缩写版本开始并向后工作而得到的首字母缩写词)。在这种情况下，梅要求为他的实验室起一个可爱的名字，拼出“可爱的实验室名字”，也能准确地描述他的研究领域。GPT-3.5生成标签失败，GPT-4生成成功。他说:“它提出了‘表达性语言分析的计算理解和转换，NLP、人工智能和机器教育的桥梁’。”“‘机器教育’并不伟大;“intelligence”这部分意味着里面多了一个字母。但说实话，我见过更糟糕的。”(作为背景，他的实验室的实际名称是CUTE lab name，或基于自然和有意义的证据增强语言应用的有用技术中心)。在另一项测试中，该模型显示出其创造力的局限性。当梅要求它写一种特殊的十四行诗时——他要求的是意大利诗人彼特拉克使用的一种形式——这个模型不熟悉这种诗歌结构，默认使用莎士比亚偏爱的十四行诗形式。当然，解决这个问题相对简单。GPT-4只需要学习一种额外的诗歌形式。事实上，当人类以这种方式刺激模型失败时，这有助于程序的发展:它可以从非官方测试人员进入系统的所有东西中学习。与不那么流畅的前辈一样，GPT-4最初是在大量数据上进行训练的，然后由人类测试人员进行改进。(GPT代表生成式预训练变压器。)但OpenAI一直对GPT-4如何比GPT-3.5做得更好保密，GPT-3.5是该公司流行的聊天机器人ChatGPT的动力模型。根据与新模型发布同时发布的论文，“考虑到竞争格局和GPT-4等大型模型的安全影响，本报告没有包含有关架构(包括模型大小)、硬件、训练计算、数据集构造、训练方法或类似内容的进一步细节。”OpenAI缺乏透明度反映了这种新竞争的生成式人工智能环境，GPT-4必须与谷歌的Bard和Meta的LLaMA等程序竞争。不过，这篇论文还表示，该公司计划最终与第三方分享这些细节，“他们可以就如何权衡竞争和安全因素提供建议……违背了进一步透明化的科学价值。” 这些安全考虑很重要，因为更聪明的聊天机器人有能力造成伤害:如果没有护栏，它们可能会向恐怖分子提供如何制造炸弹的指示，为骚扰活动炮制威胁信息，或向试图影响选举的外国特工提供错误信息。尽管OpenAI已经限制了GPT模型可以说什么，以避免这种情况，但坚定的测试人员已经找到了绕过它们的方法。“这些东西就像瓷器店里的公牛——它们很强大，但它们鲁莽，”科学家兼作家加里·马库斯在GPT-4发布前不久告诉《科学美国人》。“我不认为(版本四)会改变这一点。”这些机器人变得越像人类，它们就越能让人们误以为电脑屏幕后面有一个有知觉的智能体。Vee警告说:“因为它通过语言很好地模仿了(人类的推理)，我们相信，但在本质上，它的推理方式与人类的推理方式完全不同。”如果这种错觉欺骗人们，让他们相信人工智能正在进行类似人类的推理，他们可能会更容易相信它的答案。这是一个重要的问题，因为仍然不能保证这些响应是准确的。“仅仅因为这些模型说了什么，并不意味着它们说的是(真实的)，”梅说。“这些模型没有答案的数据库。”相反，像GPT-4这样的系统一次生成一个单词的答案，下一个最合理的单词由他们的训练数据通知-训练数据可能会过时。“我相信GPT-4甚至不知道它是GPT-4，”他说。“我问它，它说，‘不，不，没有GPT-4这种东西。我GPT-3。现在该模型已经发布，许多研究人员和人工智能爱好者有机会探索GPT-4的优点和缺点。想要在其他应用程序中使用它的开发人员可以申请访问权限，任何想要与该程序“交谈”的人都必须订阅ChatGPT Plus。用户每月只需支付20美元，就可以选择与运行在GPT-3.5和GPT-4上的聊天机器人交谈。这样的探索无疑会发现GPT-4更多的潜在应用和缺陷。“真正的问题应该是‘在最初的震惊之后，两个月后人们会有什么感觉?’”马库斯说。“我的部分建议是:让我们缓和最初的热情，意识到我们以前看过这部电影。制作演示内容总是很容易的;把它变成真正的产品是很困难的。如果它仍然存在这些问题——幻觉，不能真正理解物理世界，医学世界等等——这仍然会在一定程度上限制它的效用。这仍然意味着你必须仔细关注它的使用方式和用途。”

您觉得本篇内容如何

评分

声明：转载此文是出于传递更多信息之目的。若有来源标注错误或侵犯了您的合法权益，请与我们联系，我们将及时更正、删除，谢谢。