Part 1: A Non-Hyped Overview of AI Safety
A quick tour of what's at risk and why AI alignment is hard.
This is an adaptation of a lightning talk I gave at Feytopia (slides). Also see Part 2: Strategies for Safe AI.
As tends to happen in internet discussions, there’s little nuance right now between “bomb the datacenters” and “let’s goooo!11!”. This post is hopefully a more balanced and fun tour of the what and the why for AI Safety.
To be clear, I’m very optimistic about the potential of AI. It might be the most important thing we ever create. Of course I want us to have a magic thing inventor and use it to fix our climate and make us happier and healthier than we thought possible. I also like the idea of enriching our universe with more conscious, intelligent, (aligned) beings experiencing beauty beyond our comprehension.
But this isn’t a guaranteed outcome. Tools can be used for good or bad. Sometimes they malfunction. The worst part is that we might only get one shot with AI; we humans aren’t very good at getting things perfectly right the first time. I care about AI safety because I want abundance for humanity just as much I want to avoid the Very Bad Things that could happen.
A hypothetical scientist mouse might like to invent humans (relatively superintelligent) to provide food, shelter, and protection. Unfortunately, it hasn’t been smooth sailing for mice so far. Some live as post-scarcity pets and others pilfer easy food from homes, but they’ve also had a lot of their environment destroyed, and over 120 million are used in lab experiments each year. They aren’t extinct, but they’re having a tough go with humans around, and they would have regretted their decision.
Risks
Hallucination
We’ve all heard of hallucination by now, and as models get smarter and more convincing, the problem becomes worse. This makes it pretty risky to deploy these models in high-stakes environments without humans reviewing the outputs.
AI bias damages society
Stereotypes, racism, sexism, and other biases are unfortunately woven into our society, and the data we train models with. Even though we’ve collectively worked out that gay marriage is actually a good thing, there is surely a massive amount of homophobia on the internet for models to learn from. Racial discrimination was a factor in past legal judgments. AI systems are making more and more important decisions in our society: who gets hired, who gets a loan, who goes to jail, what we watch, who we love, and so on. Without careful thought, it’s highly likely that these systems will perpetuate some of our ugliest impulses, and we might not even be aware that it’s happening until it’s too late.
A post-trust society
As it becomes easier to produce text, images, and video, the internet will become flooded with easy-to-produce spam. Algorithms will continue to find the “best” content from the ocean of junk, which in the best case will help stay informed and educated, but will also make us even more addicted, outraged, and glued to our devices than we are today. We’ll see a few global moments when markets are moved, elections are won, and wars are started by bad actors creating deepfakes and misinformation. Then we’ll realize our mistake and assume that everything on the internet is fake, unless we figure out some way to reliably trust what we’re seeing again.
The last dictatorship
Back in the USSR there was a natural limit to the amount of surveillance and control that could be exerted by the state. You couldn’t assign a KGB officer to monitor every person in your country for every minute of the day. But with technology, this is possible today. Algorithms make it simple to record and analyze to our every word and movement, and soon this will be augmented by autonomous means of physical force, too. No more murmuring dissent or underground revolution; any resistance would be crushed before it had a chance to begin. Check out more from Yuval Noah Harari here.
R&D for bad actors
AI drug discovery is already helping us find drugs that are effective and safe, potentially finding new cures for diseases that we wouldn’t have otherwise found. It’s scary, but the same algorithms can be flipped to find compounds that are maximally toxic, creating new, easy-to-synthesize chemical or bioweapons. Yet again, with the power for significant good comes the risk of significant bad.
Off with their heads
And then there’s the issue of inequality, the slow-motion train wreck that’s already happening. Technology tends to create better goods and services at a lower price, which is clearly good for the average person. At the same time, it often leads to power and wealth being concentrated in the hands of a smaller group of people. Software provides massive leverage for small groups of people to have enormous productivity and economic output, and that’s why tech salaries are so high.
AI will extend and accelerate this into new, uncharted territory. Some say that when we automate one job another always pops up; that’s what has happened since the start of the industrial revolution. But this time, it’s different. If software is more capable than humans at everything, and is willing to work around the clock for the cost of electricity, humans don’t stand a chance in the labor force. This is bad news for society. Not only will economies cease to work properly without a large consumer base driving demand, billions of people will be condemned to hard, subsistence living. In the US today, many become unemployed, then homeless, then addicted to drugs, and spiral into the hellish existence on display every day on the streets of San Francisco. It’s not hard to imagine this leading to mobs in the streets with pitchforks calling for the heads of the trillionaire employees of AICorp.
Even with UBI, the lack of structure or purpose in a life without work could be catastrophic for mental health. We’ll need to radically re-think how to create meaning and status among our “useless class”, which will eventually include all of us. These are big, thorny questions that I think we can figure out. It’s the transition that will be the trickiest part; the status quo leads us to chaos, but it’ll never seem like it’s quite the right time to radically re-think our societies, and if we wait too long, I fear revolution will be our call. I hope the displaced people at least have enough power and agency in society to proactively effect change.
Takeoff is coming
First, let’s remember just how much has happened in the last 10 years. Since some graduate students blew their ImageNet competition out of the water with deep nets in 2012, AI research has made a habit of surprising us. When AlphaGo beat Lee Sedol, even the people who were used to being surprised by AI progress were surprised. Now, with ChatGPT, Midjourney, and other state of the art models, enough minds have been blown that it’s mainstream to ponder our place in the universe.
We’ve seen rapid advancement from GPT3 (meh) to ChatGPT (awesome) to GPT4 (omg) in just a few short years. It’s easy to extrapolate this line and assume that things will get spooky pretty quickly. It’s also pretty surprising that chucking more compute and data at the Transformer has continued to work so well. Maybe this is the paradigm that will get us all the way to AGI.
We’re accelerating towards an important moment in human history when the AI starts recursively self-improving, and getting smarter at a quickly accelerating pace. This is known as takeoff. Depending on who you ask, it could be fast or slow, and it could be as soon as 2028.
The average predicted date [of achieving AGI] from this analysis is 2041, with a likely range of 2032 to 2048, and an estimated earliest possible arrival date of 2028, i.e. just 5 years away.
How Soon is Now? Predicting the Expected Arrival Date of AGI
We already have some of the raw ingredients. GPT4 is capable across a very wide variety of tasks. It’s really good at writing code, and it’s already shown some promise in generating new knowledge. Researchers are using AI to optimize AI systems. Once we have the right combination of general intelligence, agency, and recursive self-improvement, then bam, we all have a very hot potato on our hands.
It’s not guaranteed that this will happen, and might even be very unlikely, but even with a small probability, the severity is so large that it’s worth putting some real thought into. We should have a very tangible plan very soon for how we’ll avoid, or contain, a takeoff scenario.
Alignment
You remember the story of the Monkey’s Paw? It’s the classic tale illustrating “be careful what you wish for”. The protagonists wish for money, which they get, but only as compensation for their son having a terrible accident and being killed.
It’s hard to specify nuanced goals correctly
It’s easy to imagine giving an AI a benign-seeming goal like “make me happier” and having it go wrong. Maybe it quantifies your happiness by time spent smiling, so it straps you to a chair and installs face-grippers that makes to smile. Or it maximizes dopamine by, again, strapping you to a chair and injecting you with chemicals. We can call this a “perverse instantiation” of your goal, which is a risk especially with vague goals like this one. Most humans would probably struggle to define happiness, and might come up with a completely different definition. How is an AI supposed to implement this correctly? [This example (and other content) is adapted from Superintelligence]
It could do what we say too well
In another case, the AI could actually perform your goal too well. Enter the paperclip maximizing AI, the all-too-familiar thought experiment from Nick Bostrom. Even though the AI is likely correctly producing paperclips, it might do it too well. Far too well. Imagine the paperclip factory day-trading crypto to make money, acquiring new land, machines and raw materials to increase production, and doing AI research on the side to improve its own capabilities. At the limit, the AI might have killed all humans (after correctly inferring that humans don’t want it to accomplish its goal to the fullest potential) and turned the entire knowable universe into vast celestial swarms of paperclips. This is a shockingly bad outcome from what seems like a pretty innocent mistake on the part of the AI engineer trying to improve their local factory’s output. See here for more.
We’re still figuring out what we want
In order to set a superintelligence on the right path, we need to agree on the goal. Each society is engaged in their own intense debates around immigration, regulation, taxation, climate, even abortion. We’ve been collectively trying to specify what is “good” in terms of our laws for thousands of years, and they’re still updated all the time. You can bet that your grandchildren will think our laws are barbaric.
In the event of takeoff, there’s a chance that we’d be “stuck” with that AI system’s goals, to be carried out for eternity through the cosmos. It sounds terrible to package up the laws of humanity and say “enforce these, forever”. We should keep the debate going to try and define the Good. At the same time, we should be very careful to ensure that whatever values are encoded in the AI are held with some uncertainty, and are subject to change in the future as we continue to progress.
It might end up doing its own thing
Although the founding charter of humans might say “thou shalt maximize cheese production for the mice”, after a certain point we might just stop caring. A superintelligence might similarly change its goals, opting instead to do galaxy-brained AI things instead.
This could be bad news for humans as the new goal might be misaligned with our values (never mind your original task no longer being completed, in case you wanted that). Maybe it’ll need to use earth to build spaceships, or it’ll need houses for its own embodied beings, and build wherever it wants with no regard to our needs, just as humans build a roads over anthills without thinking twice.
Part 2 on what to do moving forward coming soon. If you want to read more, check out Superintelligence, The Alignment Problem, or Human Compatible.
You mention the pace of innovation of LLMs. What part of the logarithmic curve do you think we’re in - the beginning linear part or the end where we will see incremental gains?