Main Blog CV Projects Publications

Meditations on A.G.I. Ruin

Eliezer's argument that A.G.I. will kill us all has generated a lot of controversy, and also perhaps a bit of despair (possibly exacerbated by his avowed "Death with dignity" strategy). I don't want to discuss whether his choice to frame the situation this way is good or bad psychologically or rhetorically, except to say that I basically agree with credo "If the iron approaches your face, and you believe it is cool, and it is hot, the Way opposes your calm." However I think that those of us who tend to plan for worst-case outcomes should also remember that ""If the iron approaches your face, and you believe it is hot, and it is cool, the Way opposes your fear."" Instead I will focus on issues with the actual argument.

I believe there is a simple flaw in his reasoning about its epistemological grounding, as discussed here : "when you're fundamentally wrong about rocketry, this does not usually mean your rocket prototype goes exactly where you wanted on the first try while consuming half as much fuel as expected; it means the rocket explodes earlier yet, and not in a way you saw coming, being as wrong as you were"

Eliezer believes that A.I. alignment is likely to be at least as hard as he thinks, not surprisingly easy. This seems fair. However, A.I. itself is likely to be at least as hard as he thinks as well, maybe harder. This effect should tend to shift timelines later. Yet, the A.G.I. ruin argument seems much weaker if A.G.I. is 100 years away (not that I think that is likely). That state of affairs seems to leave plenty of time for humanity to "come up with a plan." In particular, recursively self-improving A.I. may be harder than expected. In fact, the difficulty of recursively self improving A.I. seems to be closely tied to the difficult of alignment, as noted here days after I resolved to mention it myself (but done much better than I would have, and the idea was probably floating around lesswrong mindspace long before that). For instance, many of the problems of embedded agency seem to be relevant to both. I am in fact still fairly pessimistic, in the sense that I think A.I. systems that recursively self improve past human level are probably easier to design than the alignment problem is to solve. One reason is that I suspect only some of the problems of (e.g.) embedded agency need to be solved explicitly to design such systems, and the rest can probably be "black magicked away" by appropriately incentivized learners, a topic I will not discuss in detail because any such research sounds like the exact opposite of a good idea. However, I may be wrong, and it seems slightly easier to be wrong in the direction that building self-improving agents is harder than expected.

Personally I think prolonged discussion of timelines seems to be over-rated by rationalists, but briefly I view the situation like this:

That is, money buys computing hardware, and optimization algorithms convert compute to "intelligence" in the form of intelligent artifacts (which may themselves be optimizers). Then these intelligent artifacts are hooked up to actuators which let them interact with the world, and convert intelligence into power. Much of the debate over A.I. timelines seems to come down to different models of which conversions are (sub/super) linear. For instance, some people seem to think that compute can't efficiently buy intelligence beyond some point (perhaps Steven Pinker in his debate with Scott Aaronson). Others seem to think that their are limits to the extent to which intelligence buys power (Boaz Barak, the "reformists"). These discussions tend to entangle with the extent to which positive feedback loops are possible.

My personal impression is that if intelligence is useful at improving optimization algorithms, and A.I. systems actually choose to improve their optimization algorithms, at human level, without first needing to solve alignment, and if DL scales effectively with compute instead of hitting a data bottleneck, we are in trouble (~5 years).

If this form of positive feedback is limited, because human level A.I.'s are not yet smart enough to reliably improve themselves without changing their values, they will still improve their own hardware (Bostrom's argument to the contrary is mostly unpersuasive to me). This may lead to a strange situation in which A.I. are as smart as humans running at 100x speed but with terrible actuators. Depending on efficiency of the intelligence/power conversion, the situation could take longer or shorter to spiral out of control, but spiral out of control it will as actuators catch up to the demand (large scale robot manufacturing). After some years of weirdness and warnings, if left unchecked, 100x will become 10,000x and biological entities will be out of luck (~25 years). It is quite clear that power buys money, so the "longest" feedback loop in the diagram exists, even if its power is degraded by all previous conversions. It seems that this outcome corresponds roughly to all conversions being weak and only the longest feedback loop being relevant, and therefore also pretty weak. This may look like a slow takeoff.

Another plausible outcome is that performance of deep learning systems tops out (possibly because of limited data) far below human level, and nothing takes off until many embedded agency/alignment relevant problems are solved (~60 years, heavy tailed). So, my timeline is trimodal, and it seems that the later outcomes are also more favorable.