October 17, 2023 — Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar
We’re open-sourcing Fuyu-8B - a small version of the multimodal model that powers our product.
October 17, 2023 — Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar
We’re open-sourcing Fuyu-8B - a small version of the multimodal model that powers our product.
We’re releasing Fuyu-8B, a small version of the multimodal1 model that powers our product. The model is available on HuggingFace. We think Fuyu-8B is exciting because:
Today, we’re releasing Fuyu-8B with an open license (CC-BY-NC)—we’re excited to see what the community builds on top of it! We also discuss results for Fuyu-Medium (a larger model we’re not releasing) and provide a sneak peek of some capabilities that are exclusive to our internal models.
Because this is a raw model release, we have not added further instruction-tuning, postprocessing or sampling strategies to control for undesirable outputs. You should expect to have to fine-tune the model for your use-case.2
Adept is building a generally intelligent copilot for knowledge workers. In order to do this, it’s important for us to be able to understand user context and to take actions on behalf of users. Both of those goals rely heavily on image understanding. Users expect what’s visible on their screen to be accessible to the copilot, and important data is often presented most naturally as an image – think charts, slides, PDFs, etc. In order to take actions, we often need to literally click on buttons or scroll through menus. It would be nice if all these actions were doable via API, but many business-relevant software has no API or an incomplete API, and controlling software via UIs allows us to keep the user in the loop.
Therefore, we need a model that can understand both images and text. Although a lot of progress is being made on this front, nothing is available that suits our precise needs. Existing multimodal models are complicated, both from an architectural perspective and a training perspective. These complications are a liability when it comes to understanding model behavior, scaling models up, and deploying to users.
On the architecture side, other multimodal models involve a separate image encoder, the output of which tends to be connected to an existing LLM via either cross-attention or through some kind of adapter that feeds directly into the LLM’s embedding-space. PALM-e, PALI-X, QWEN-VL, LLaVA 1.5, and Flamingo all look more-or-less like this. These models also tend to work on a fixed image resolution. At inference time, all images at greater resolution than this must be downsampled, and all images whose aspect ratio doesn’t match must be padded or distorted.
On the training side, other multimodal models tend to have a large number of separate training stages. The image encoder will be trained separately from the LLM on its own tasks, often using a contrastive training objective, which is complicated to implement and reason about. Then, as in e.g. PALI-X, the image encoder and the text decoder (frequently with a bespoke connector network) will be trained together on images at a low resolution for some period of time. At this point, a choice must be made about whether to freeze the weights of each of the components while training. Finally, some models are trained with an extra high-resolution image phase (without which they won’t perform well on high-res images).
When scaling up models, it’s difficult to reason about how to independently scale each of the above components. Should marginal parameters be allocated to the encoder or the decoder? To which of the training steps should we give the next chunk of compute? We’ve instead designed a model without these complications.
Architecturally, Fuyu is a vanilla decoder-only transformer with the same details as Persimmon-8B - there is no image encoder. Image patches are instead linearly projected into the first layer of the transformer, bypassing the embedding lookup. We simply treat the normal transformer decoder like an image transformer (albeit with no pooling and causal attention). See the diagram above for more details.
This simplification allows us to support arbitrary image resolutions. To accomplish this, we just treat the sequence of image tokens like the sequence of text tokens. We remove image-specific position embeddings and feed in as many image tokens as necessary in raster-scan order. To tell the model when a line has broken, we simply use a special image-newline character. The model can use its existing position embeddings to reason about different image sizes, and we can use images of arbitrary size at training time, removing the need for separate high and low-resolution training stages.
Together, these changes have dramatically simplified our training and inference experience.
To sanity-check the architectural changes underlying Fuyu-8B, we chose four of the most commonly-used image-understanding datasets: VQAv2, OKVQA, COCO Captions, and AI2D. VQAv2 and OKVQA are natural image question-answering datasets, COCO is a captioning dataset, and AI2D is a multiple-choice dataset involving scientific diagrams. We compare our models to PALM-e, PALI-X, QWEN-VL, and LLaVA 1.5.
The Fuyu models perform well according to these metrics, even though they are heavily focused on natural images. Fuyu-8B improves over QWEN-VL and PALM-e-12B on 2 out of 3 metrics despite having 2B and 4B fewer parameters, respectively. Fuyu-Medium performs comparably to PALM-E-562B despite having fewer than a tenth as many parameters! PALI-X still performs best on these benchmarks, but it’s larger and fine-tuned on a per-task basis. Note that, since these benchmarks are not our main focus, we didn’t perform any of the typical optimizations (e.g. non-greedy sampling, fine-tuning for a long time on each dataset specifically, etc).
Eval Task | Fuyu-8B | Fuyu-Medium | LLaVA 1.5 (13.5B) | QWEN-VL (10B) | PALI-X (55B) | PALM-e-12B | PALM-e-562B |
---|---|---|---|---|---|---|---|
VQAv2 | 74.2 | 77.4 | 80 | 79.5 | 86.1 | 76.2 | 80.0 |
OKVQA | 60.6 | 63.1 | n/a | 58.6 | 66.1 | 55.5 | 66.1 |
COCO Captions | 141 | 138 | n/a | n/a | 149 | 135 | 138 |
AI2D | 64.5 | 73.7 | n/a | 62.3 | 81.2 | n/a | n/a |
While interacting with these benchmarks we also noticed serious issues. We’ve developed an in-house eval suite that corresponds more closely to the capabilities we care about, but we thought it was worth elaborating on some of those issues here, given the ubiquity of these benchmarks.
Question Answering Benchmarks
The question-answering datasets are quite flawed - they use a complicated scoring mechanism, require you to respond in a specific format, and are often annotated incorrectly.
Consider the following two images:
For the image on the left from the OKVQA dataset, when asked the question “What instrument is the toy bear playing?”, the model responds “snare”—which is clearly true! However, it gets a score of 0, because all of the reference answers are simply “drum”. Similarly, for the VQAv2 image on the right, when asked “What type of foods are in the image?”, the model corectly responds “fish, carrots”, but it also gets a score of 0 because the reference solution list doesn’t contain those words.
Captioning Benchmarks
It’s also common to evaluate image models using the COCO Captions benchmark. The score used for this benchmark (CIDEr) is based on n-gram similarity to a group of reference captions, which are often poor. We haven’t found performance on this benchmark corresponds particularly well to our internal evaluations. In fact Fuyu-Medium is slightly worse by this metric than Fuyu-8B!
For the image below, our model gives the caption “A nighttime view of Big Ben and the Houses of Parliament.” This is correct, but it gets a score of 0.4 because it doesn’t match any of the reference captions (a good score is over 100).
The Fuyu models have several cool capabilities that we preview here, including chart, diagram, and document understanding.
Since our product is geared towards assisting knowledge workers, it’s important for our model to be able to understand charts and diagrams. Here are some examples.
Fuyu can understand complex visual relationships, such as in the below chart, where it has to trace connections between actors and shows and count them to answer the question.
It can also answer nontrivial, multi-hop questions given traditional charts.
Fuyu can also understand documents — both complex infographics and old PDFs:
Finally, the model can understand complex relational queries about scientific diagrams:
Our internal models (based on Fuyu) have extra capabilities related to our product. In particular,
Since these capabilities are built off of the Fuyu model class (and underly our upcoming product release), we thought it would be interesting to preview some of them.
We’ve trained our internal models to do the following two tasks given an image of a UI:
bbox_to_text
)text_to_bbox
)Consider the following 1920x1080 image from one of our validation sets:
The blue boxes represent bounding box coordinates that have been passed to the model for the bbox_to_text
task.
For this example, the model correctly predicted the text contents of every blue bounding box.
The red boxes represent predicted bounding boxes and green boxes represent target bounding boxes for the text_to_bbox
task.
The model is good enough at bounding box prediction that the red and green boxes overlap almost completely.
The model can also locate things on the screen based on informal text commands, as well as answer detailed factual questions about the contents of UIs:
Or consider the below example, where the model can interact with Google Maps to correctly answer questions3.
—
Both the model weights and some example code are on HuggingFace. We look forward to seeing what you build with it, and please reach out if you have any questions. Stay tuned for more on our product alpha, which will incorporate these and other changes and is coming soon!
—
If you use this model in your work, please use the following BibTeX citation:
@misc{fuyu-8b,
author = {Bavishi, Rohan and Elsen, Erich and Hawthorne, Curtis and Nye, Maxwell and Odena, Augustus and Somani, Arushi and Ta\c{s}\i{}rlar, Sa\u{g}nak},
title = {Introducing our Multimodal Models},
url = {https://www.adept.ai/blog/fuyu-8b},
year = {2023}
}
—