Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned, in which Anthropic finds that larger models are harder to redteam and releases a redteaming dataset. A little more clarity about why and how to worry about alignment from a technical perspective from Richard. Anthropic’s SoLU Interpretability Paper, with more intense math than last time but I’m managing. Set Sail for Fail, a really long nintil post about worrying about AI risk, with some hands tied behind his back. Jack Clark’s tweet thread on AI Policy. A Mechanistic Interpretability Analysis of Grokking from Neel Nanda, who I made explain fourier bases to me.
LLM.int8() and Emergent Features, basically we should open the box (weights) and then think about solutions. Efficient Training of Language Models to Fill in the Middle, a teaser of how we can get more out of our data. Language Model Cascades, a framework for thinking about and building inference(ish) time model-thought-paths. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets, in which models generalize after overfitting (phase changes?). Language models show human-like content effects on reasoning, really nice DeepMind paper that captures how language models are flawed in the same way humans are, referencing 70’s psych papers. RHO-LOSS for better data selection. Salesforce Code RL paper.
José Luis Ricón Fernández de la Puente’s O1 Visas post at last. Someone explained Jane Street in a way that didn’t make me sad. wtf someone dislikes food. Aella on Learning the Elite Class. Scratch is a Big Deal. Someone captured a lot of the important aspects of Miyazaki films, time for me to start rewatching.
A Vox article about EA, don’t think I’ve ever related so much to a journalist. The Times profile on MacAskill was such a throwback. Kelsey Piper on the divides in AI Safety, usually epistemic and moral philosophy is a stronger divider than cause area. The Elon <> EA story finally came out and it’s sad but dramatic? The CHIPS Science Act, a plan to spend 280B on semiconductor manufacturing.