Comprehensive interpretation of OpenAI's multi-modal GPT-4: improved accuracy, support for Microsoft's new Bing

星球君的朋友们

Odaily资深作者

2023-03-15 07:24

This article is about 5096 words, reading the full article takes about 8 minutes

Interpret the invincibility and limitations of OpenAI GPT-4.

AI Summary

Expand

Interpret the invincibility and limitations of OpenAI GPT-4.

Original title: "Heavy Burst! OpenAI officially launched multi-modal GPT-4 "

first level title

Original compilation:Alpha Rabbit Research Notes

highlights

GPT-4 can accept both image and text input, while GPT-3.5 only accepts text.

GPT-4 achieves performance on various professional and academic benchmarks"human level". For example, it passed the mock bar exam with scores in the top 10% of test takers.

It took OpenAI 6 months to repeatedly adjust GPT-4 using the experience gained from the adversarial test project and ChatGPT."best result ever"。

In simple chat, the difference between GPT-3.5 and GPT-4 may be insignificant, but when the complexity of the task reaches a sufficient threshold, the difference comes out, and GPT-4 is more reliable and creative than GPT-3.5 Force, able to handle more subtle instructions.

GPT-4 can illustrate and interpret relatively complex images, such as identifying a Lightning Cable adapter (pictured below) from a picture plugged into an iPhone.

Image understanding capabilities are not yet available to all OpenAI clients, which OpenAI is testing with partner Be My Eyes.

OpenAI acknowledges that GPT-4 is not perfect and still suffers from confusion on fact-checking questions, making some reasoning errors and occasional overconfidence.

first level title

Official document

OpenAI has officially launched GPT-4, which is OpenAI's latest milestone in scaling deep learning. GPT-4 is a large multimodal model (capable of accepting image and text type input, giving text output), although GPT-4 is not as capable as humans in many real world scenarios, it can be used in various professional and academic On benchmarks, it exhibits near human-level performance.

Example: GPT-4 passed a simulated bar exam with scores in the top 10% of all test takers. In contrast, the GPT-3.5 score is about the bottom 10%. Our team spent 6 months iteratively tweaking GPT-4 using my adversarial testing project and related experience based on ChatGPT. The result is that GPT-4 achieves the best results ever in terms of factuality, steerability, and refusing to go outside of guardrails. It's not perfect yet)

Over the past two years, we refactored the entire deep learning stack and partnered with Azure to co-design a supercomputer for the workload from the ground up. A year ago, OpenAI trained GPT-3.5 as a first for the entire system"test run", specifically, we found and fixed some bugs and improved the previous theoretical foundation. As a result, our GPT-4 trains, runs (confidently: at least for us!) unprecedentedly stable, and becomes our first large model whose training performance can be accurately predicted in advance. As we continue to focus on reliable scaling, an intermediate goal is to hone methods to help OpenAI continue to anticipate and prepare for the future, which we believe is critical to safety.

first level title

ability

It may not be easy to spot the difference between GPT-3.5 and GPT-4 in simple small talk. However, when the complexity of the task reaches a sufficient threshold, their differences come out. Specifically, GPT-4 is more reliable, more creative, and able to handle finer-grained instructions than GPT-3.5.

To understand the differences between the two models, we tested them on a variety of benchmarks, including simulating tests originally designed for humans. By using the latest public tests (for Olympiad and AP, etc.) and including the purchase of the 2022-2023 version of the practice test, we have not specially trained the model for this type of test. Of course, there are few problems in the test is present during the training process of the model, but we consider the following results to be representative.

We also evaluate GPT-4 on traditional benchmarks designed for machine learning models. GPT-4 substantially outperforms existing large language models and is neck-and-neck with most state-of-the-art (SOTA) models that include benchmark-specific or additional training protocols.

Since most existing ML benchmarks are written in English, to get an initial glimpse of capabilities in other languages, we used Azure Translate to translate the MMLU benchmark: a set of 14,000 multiple-choice questions across 57 topics, into various languages. In 24 of the 26 languages tested, GPT-4 outperforms GPT-3.5 and other large models (Chinchilla, PaLM) in English, and this excellence also includes languages like Latvian, Welsh, and Sri Lankan. Vahili and more.

first level title

visual input

GPT-4 can accept text and image prompts, which parallels the text-only setup. For example, you can let the user specify any visual or language task, it can generate text output (natural language, code, etc.), the given input includes documents with text and photos, diagrams or screenshots, GPT-4 shows the same Similar capabilities for plain text input. In addition, it can also be applied to the test time technology developed for the plain text language model, including a few shots and CoT Prompting, but the current image input is still a research preview, and there is no public product like the C-side.

The following picture shows a"Lightning Cable "The packaging of the adapter has three panels.

Panel 1: A smartphone with a VGA connector (the big blue 15-pin connector usually used on computer monitors) plugged into its charging port.

Panel 2:"Lightning Cable "There is a picture of the VGA connector on the adapter packaging.

Panel 3: A close-up of the VGA connector, terminating in a small Lightning connector (used to charge iPhones and other Apple devices).

The hilarious nature of this image comes from plugging a large, outdated VGA connector into a small, modern smartphone charging port.. so it looks ridiculous

first level title

Controllable AI

We have been working hard to achieve every aspect of the plan outlined in the article on defining AI behavior, including the controllability of AI. Instead of the fixed speech, tone and style of a classic ChatGPT personality, developers (and soon all ChatGPT users) can now"system"first level title

limitation

Despite its impressive capabilities, GPT-4 suffers from limitations similar to earlier GPT models. On top of that, it's still not completely reliable (say, it will generate"hallucination", and an inference error occurs). When using the output of a language model, especially in high-stakes situations, great care should be taken (e.g. human review is required, high-stakes usage should be avoided entirely) and it needs to be matched to the needs of the specific use case.

While all sorts of things still exist, GPT-4 significantly reduces hallucinations (meaning network illusions, in this case serious nonsense) compared to previous models (which themselves are constantly improving). In our internal adversarial factual evaluation, GPT-4 scores 40% higher than our state-of-the-art GPT-3.5.

Controllable AI

GPT-4's base model only slightly outperforms GPT-3.5 on this task; however, after post-training with RLHF (applying the same procedure we used for GPT-3.5), there is a large gap. The model will have various biases in its output, and we have made progress in these areas, but there is still more work to be done. According to our recent blog post, our goal is to make the AI systems we build have sensible default behaviors that reflect a wide range of user values, allow these systems to be customized over wide ranges, and gain public input on those ranges.

first level title

Risks and Mitigations

We have been iterating on GPT-4 to make it more secure and consistent from the beginning of training. Our efforts include selection and filtering of pre-training data, evaluation, inviting experts to participate, improving model security, monitoring, and execution.

GPT-4 carries similar risks as past models, such as producing harmful advice, wrong code, or inaccurate information. However, the additional capabilities of GPT-4 also lead to new risk surfaces. To clarify the specifics of these risks, we engaged more than 50 experts in AI docking risks, cybersecurity, biorisks, trust and safety, and international security to adversarially test the model. Their participation allows us to test the model's behavior in high-risk domains that require expertise to evaluate. Feedback and data from experts in these domains informed our mitigation and improvement models. For example, we've collected additional data to improve GPT-4's ability to reject requests about how to synthesize dangerous chemicals.

GPT-4 incorporates an additional safety reward signal into RLHF training by training the model to reject requests for such content, thereby reducing harmful output (as defined by our usage guidelines). Rewards are provided by GPT-4's classifier, which is able to judge how security boundaries and security-related hints are completed. To prevent models from rejecting valid requests, we collect diverse datasets from different sources (e.g., labeled production data, human red teams, model-generated hints) and apply security rewards on allowed and disallowed categories Signal (presence of positive or negative value).

Our mitigations substantially improve many of the security properties of GPT-4 compared to GPT-3.5. Compared to GPT-3.5, we reduced the propensity of the model to respond to requests for illegal content by 82%, while GPT-4 responded 29% more often to sensitive requests, such as medical advice and self-harm, in line with our policy %

Overall, our model-level interventions increase the difficulty of inducing undesirable behavior, but still"prison Break"to produce content that violates our usage guidelines. As the risks to AI systems increase, achieving extreme reliability in these interventions will become critical. What is important now is to supplement these limitations with time-of-deployment security technologies, such as finding ways to monitor.

first level title

training process

Like previous GPT models, the GPT-4 base model is trained to predict the next word in a document and is trained using publicly available data (such as internet data) as well as data we license. These data are drawn from extremely large corpora and include correct and incorrect solutions to mathematical problems, weak and strong reasoning, contradictory and consistent statements, and a wide variety of ideologies and ideas.

Thus, when prompted with a question, the underlying model can respond in a variety of ways that may be far from what the user intended. To align it with the user's intent, we fine-tune the model's behavior using reinforcement learning with human feedback (RLHF).

first level title

predictable expansion

A big focus of the GPT-4 project is to build a deep learning stack that scales predictably. The main reason is that for very large training runs like GPT-4, it is not feasible to do a lot of model-specific tuning. We have developed and optimized the infrastructure to have very predictable behavior at multiple scales. To test this scalability, we accurately predicted in advance the final loss of GPT-4 in our internal codebase (not part of the training set) by inferring from a model trained using the same method, but using the computational The amount is 10000 times less.

first level title

Open AI Assessment

We are open-sourcing OpenAI Evals, our software framework for creating and running benchmarks that evaluate models like GPT-4, while checking their performance sample-by-sample. We use Evals to guide the development of our models (including identifying shortcomings and preventing regressions), and our users can apply it to track the performance of different model versions (which will now be rolled out regularly) and evolving product integrations. For example, Stripe already uses Evals to supplement their human evaluations to measure the accuracy of their GPT-powered documentation tools.

Because the code is open source, Evals supports writing new classes to implement custom evaluation logic. However, from our own experience, many benchmarks follow some"template", so we've also included the most useful templates inside (including a"Model Grading Evals"template - we found that GPT-4 has a surprising ability to check its own work). In general, the most efficient way to create a new assessment is to instantiate one of these templates and provide the data. We're excited to see what others can build with these templates and Evals more broadly.

We want Evals to be a tool for sharing and crowdsourcing benchmarks that best represent a wide range of failure modes and difficult tasks. As a follow-up example, we have created a logic puzzle assessment with ten hints that GPT-4 failed. Evals is also compatible with implementing existing benchmarks; we have included several notebooks implementing academic benchmarks and some variations integrating CoQA (a small subset) as examples.

first level title

ChatGPT Plus

ChatGPT Plus users will get usage-capped GPT-4 permissions on chat.openai.com. We will adjust the exact usage cap based on actual demand and system performance, but we expect capacity to be severely constrained (although we will expand and optimize over the next few months).

secondary title