DataComp-LM: In search of the next generation of training sets for language models

BibTex

Copy

@Article{Li2024DataCompLMIS,
 author = {Jeffrey Li and Alex Fang and Georgios Smyrnis and Maor Ivgi and Matt Jordan and S. Gadre and Hritik Bansal and E. Guha and Sedrick Scott Keh and Kushal Arora and Saurabh Garg and Rui Xin and Niklas Muennighoff and Reinhard Heckel and Jean-Pierre Mercat and Mayee Chen and Suchin Gururangan and Mitchell Wortsman and Alon Albalak and Yonatan Bitton and Marianna Nezhurina and Amro Abbas and Cheng-Yu Hsieh and Dhruba Ghosh and Josh Gardner and Maciej Kilian and Hanlin Zhang and Rulin Shao and Sarah Pratt and Sunny Sanyal and Gabriel Ilharco and Giannis Daras and Kalyani Marathe and Aaron Gokaslan and Jieyu Zhang and K. Chandu and Thao Nguyen and Igor Vasiljevic and S. Kakade and Shuran Song and Sujay Sanghavi and Fartash Faghri and Sewoong Oh and Luke S. Zettlemoyer and Kyle Lo and Alaaeldin El-Nouby and Hadi Pouransari and Alexander Toshev and Stephanie Wang and Dirk Groeneveld and Luca Soldani and Pang Wei Koh and J. Jitsev and Thomas Kollar and Alexandros G. Dimakis and Y. Carmon and Achal Dave and Ludwig Schmidt and Vaishaal Shankar},
 booktitle = {Neural Information Processing Systems},
 journal = {ArXiv},
 title = {DataComp-LM: In search of the next generation of training sets for language models},
 volume = {abs/2406.11794},
 year = {2024}
}

Transform this paper into an audio lecture

Get an engaging lecture and Q&A format to quickly understand the paper in minutes, perfect for learning on the go.

Audio lecture

Q&A format

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Dark mode

DataComp-LM: In search of the next generation of training sets for language models