Meta Trains AI To Self-Learn By Observing Speech, Vision & Text—Like Humans

Photo 183881120 / Ai © Alexey Novikov |


In order for artificial intelligence (AI) to achieve all the feats that it does, hours upon hours of learning must be done in the background. 

Often, this is done manually. Algorithms learn to recognize, say, objects by learning what they are through millions of examples that come with labels. Or, conversation is learnt through transcribed text, as explained by TechCrunch.

However, in building next-generation AI capable of things that have never been done before, this way of learning is becoming outdated. To manually produce labeled diagrams, for example, in a bid to create huge learning databases isn’t efficient. 

So researchers at Meta, previously Facebook, are building upon something that is a little more befitting of “next-gen AI.” 

This will be a model that can learn independently through a variety of mediums, including spoken, written, and visual. 

The framework, called data2vec, doesn’t just predict “modality-specific targets” such as words, visuals, or “units of human speech.” Instead, it predicts “representations of the input data, regardless of the modality,” removing the limits of just predicting either a word or an image.

For example, the AI might be given some books or images to learn from, and at the end of it, it’d be able to learn any of those things instead of choosing “either or.”

It’s also much closer to the way humans learn something: drawing from different sources to build a bigger, fuller picture of the concept, rather than solely relying on one type of information. 

“The core idea of this approach is to learn more generally: AI should be able to learn to do many different tasks, including those that are entirely unfamiliar,” the developers write in a blog post


“Self-supervision enables computers to learn about the world just by observing it and then figuring out the structure of images, speech, or text. Having machines that don’t need to be explicitly taught to classify images or understand spoken language is simply much more scalable.”

“People experience the world through a combination of sight, sound and words, and systems like this could one day understand the world the way we do,” Meta CEO Mark Zuckerberg commented in a Facebook post.

There is an open source code made available for data2vec, as well as a few pre-trained models. 


We created data2vec, the first general high-performance self-supervised algorithm for speech, vision, and text. When applied to different modalities, it matches or outperforms the best self-supervised algorithms. Read more and get the code:

— Meta AI (@MetaAI) January 20, 2022



[via TechCrunch and Meta, cover image via Alexey Novikov |]

Add a Comment

Your email address will not be published. Required fields are marked *