Meta and Google researchers’ new data curation method could transform self-supervised learning

by | May 31, 2024 | Technology

Time’s almost up! There’s only one week left to request an invite to The AI Impact Tour on June 5th. Don’t miss out on this incredible opportunity to explore various methods for auditing AI models. Find out how you can attend here.

As AI researchers and companies race to train bigger and better machine learning models, curating suitable datasets is becoming a growing challenge.

To solve this problem, researchers from Meta AI, Google, INRIA, and Université Paris Saclay have introduced a new technique for automatically curating high-quality datasets for self-supervised learning (SSL). 

Their method uses embedding models and clustering algorithms to curate large, diverse, and balanced datasets without the need for manual annotation. 

Balanced datasets in self-supervised learning

Self-supervised learning has become a cornerstone of modern AI, powering large language models, visual encoders, and even domain-specific applications like medical imaging.

June 5th: The AI Audit in NYC

Join us next week in NYC to engage with top executive leaders, delving into strategies for auditing AI models to ensure optimal performance and accuracy across your organization. Secure your attendance for this exclusive invite-only event.

Unlike supervised learning, which requires every training example to be annotated, SSL trains models on unlabeled data, enabling the scaling of both models and datasets on raw data.

However, data quality is crucial for the performance of SSL models. Datasets assembled randomly from the internet are not evenly distributed.

This means that a few dominant concepts take up a large portion of the dataset while others appear less frequently. This skewed distribution can bias the model toward the frequent concepts and prevent it from generalizing to unseen examples.

“Datasets for self-supervised learning should be large, diverse, and balanced,” the researchers write. “Data curation for SSL thus involves building datasets with all these properties. We propose to build such datasets by selecting balanced subsets of large online data repositories.”

Currently, much manual effort goes into curating balanced datasets for SSL. While not as time-consuming as labeling every training example, manual curation is still a bottlene …

Article Attribution | Read More at Article Source

[mwai_chat context=”Let’s have a discussion about this article:nn
Time’s almost up! There’s only one week left to request an invite to The AI Impact Tour on June 5th. Don’t miss out on this incredible opportunity to explore various methods for auditing AI models. Find out how you can attend here.

As AI researchers and companies race to train bigger and better machine learning models, curating suitable datasets is becoming a growing challenge.

To solve this problem, researchers from Meta AI, Google, INRIA, and Université Paris Saclay have introduced a new technique for automatically curating high-quality datasets for self-supervised learning (SSL). 

Their method uses embedding models and clustering algorithms to curate large, diverse, and balanced datasets without the need for manual annotation. 

Balanced datasets in self-supervised learning

Self-supervised learning has become a cornerstone of modern AI, powering large language models, visual encoders, and even domain-specific applications like medical imaging.

June 5th: The AI Audit in NYC

Join us next week in NYC to engage with top executive leaders, delving into strategies for auditing AI models to ensure optimal performance and accuracy across your organization. Secure your attendance for this exclusive invite-only event.

Unlike supervised learning, which requires every training example to be annotated, SSL trains models on unlabeled data, enabling the scaling of both models and datasets on raw data.

However, data quality is crucial for the performance of SSL models. Datasets assembled randomly from the internet are not evenly distributed.

This means that a few dominant concepts take up a large portion of the dataset while others appear less frequently. This skewed distribution can bias the model toward the frequent concepts and prevent it from generalizing to unseen examples.

“Datasets for self-supervised learning should be large, diverse, and balanced,” the researchers write. “Data curation for SSL thus involves building datasets with all these properties. We propose to build such datasets by selecting balanced subsets of large online data repositories.”

Currently, much manual effort goes into curating balanced datasets for SSL. While not as time-consuming as labeling every training example, manual curation is still a bottlene …nnDiscussion:nn” ai_name=”RocketNews AI: ” start_sentence=”Can I tell you more about this article?” text_input_placeholder=”Type ‘Yes'”]

Share This