A Liang Lab evals of long contexts, for which we find that language models have the same biases humans do to the beginning and end. Clever set on evals on counterfactual tasks, generalisation is never as real as we want it to be. Proof of forgetting for LLMs, though it doesn’t seem as easy as one would hope. Also yes we still have a lot of memorisation, which, due to the author, uses probably the best definition of k-extractability, though I wonder if some measure of loss over some context length makes more sense. I kind of think of self-play attempts as evals for some reason? Anyway this one on negotiation was well constructed and got good results on in-context AI feedback, usually better than human feedback. This was as month of great eval papers papers, but this one is a pretty garbage paper which dishonestly tries to present simple changes in formatting (usually for the better) from the GPT-4 API as performance degradation. But hey, people are working on good long context evals (this one is long-context human-labelled QA).
An attempt at a new architecture, but immediately opens with a plot showing they couldn’t scale their baseline transformer properly, an inspirational quote and an impossible triangle that is used as a diagram? A good explanation for why GPT-4 is so non-deterministic. Testings of think tokens (filler text) to see if it improves performance, and the answer is no except maybe for GPT-4?
Three year old text on administration markets, which references a much more historically loaded piece on the decline of administrations. Excuse the loading time from webarchive, but this content about Uber, Maoism, and Georgism. Muji Manifesto, “this will do”.
TinyStories! I can’t believe the same organisation put out both this paper and the Longformer paper. The TinyStories doesn’t focus on what should be the primary application of its technique, which is of course mech interp. The autointerpretability work was quite nifty, OpenAI alignment lives on after all (I haven’t read it yet, but DeepMind interp is also alive!) I am a big fan of line of research around measuring faithfulness in chain-of-thought reasoning, which includes checks like filler text (think tokens), early answering, adding mistakes and the always-choose-a trick from this other paper. That paper was published with the decomposition paper, which does two kinds of decomp and finds that it on average decreases performance (especially the factored-decomp) but increases faithfulness. nostalgebraist couldn’t find as much sycophancy in GPT models, my best guess is that this is a difference in RL data. Positional embeddings are a helix, but also I still don’t get how they can be important.
Paper doing a decent job outlining how one would manage AI public safety risks, though details feel a bit messy. I brought my rabbit, bundle, to the office for his second New York Times photoshoot but sadly the bunny didn’t support the AI doomerism narrative. India’s excerpt on a guide to guided missiles. The frontier model forum launched, lower cased (fingers crossed) until it goes well. Anthropic publishes a bit about frontier threats red-teaming, a bit because it’s mostly focused on biorisk. Good luck, everyone! Paul coming out swinging with a very compelling pitch on sharing LLM capabilities, with a subheader that accelerating agents in particular is neutral.
The highlight of this months Hillel Wayne is in favour of defenses of design. I read a bit of SemiAnalysis this month too (Gemini and TPU shenanigans, sigh), and have shamefully declared Matt Levine bankruptcy and only followed for the Sculptor Saga. Spencer Greenberg published this month on smart people with dumb beliefs, for which one must also see this video of Feynman singing about orange juice to make fun of Pauling’s vitamin C phase. Man being an olympic host basically turns your city into a charter city for a few years? They’re spending a billion dollars to clean the Seine. I learned about the wood-bending techniques for the very expensive Artek Aalto stools, and I’m into it.