arXiv — Open Access Scientific Research

Attention Is All You Need

Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin (2017)

We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable.

12,847

Citations

cs.CL

1706.03762

10.48550/arXiv.1706.03762

Scaling Laws for Neural Language Models

Kaplan, McCandlish, Henighan, Brown, Chess, Child, Gray, Radford, Wu, Amodei (2020)

We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude.

4,291

Citations

cs.LG

2001.08361

10.48550/arXiv.2001.08361

Constitutional AI: Harmlessness from AI Feedback

Bai, Kadavath, Kundu, Askell, Kernion, Jones, Chen, Goldie, Mirhoseini, McKinnon, et al. (2022)

We experiment with methods for training a harmless AI assistant through a process we call Constitutional AI. The method involves both a supervised learning and a reinforcement learning from human feedback phase, using a set of principles to guide the model.

2,103

Citations

cs.AI

2212.08073

10.48550/arXiv.2212.08073

Denoising Diffusion Probabilistic Models

Ho, Jain, Abbeel (2020)

We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion models and denoising score matching.

8,412

Citations

cs.LG

2006.11239

10.48550/arXiv.2006.11239

Language Models are Few-Shot Learners

Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, Neelakantan, Shyam, Sastry, Askell, et al. (2020)

We demonstrate that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. We train GPT-3, an autoregressive language model with 175 billion parameters.

15,230

Citations

cs.CL

2005.14165

10.48550/arXiv.2005.14165

Deep Residual Learning for Image Recognition

He, Zhang, Ren, Sun (2015)

We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions.

21,547

Citations

cs.CV

1512.03385

10.48550/arXiv.1512.03385