AI, Machine Learning and the Pandemic | In the Pipeline – Science Magazine

Its not surprising that there have been many intersections of artificial intelligence and machine learning with the current coronavirus epidemic. AI and ML are very hot topics indeed, not least because they hold out the promise of sudden insights that would be hard to obtain by normal means. Sounds like something were in need of in the current situation, doesnt it? So there have been reports of using these techniques to repurpose known drugs, to sort through virtual compound libraries and to generate new structures, to try to optimize treatment regimes, to recommend antigen types for vaccine development, and no doubt many more.

Ive been asked many times over the last few months what I think about all this, and Ive written about some of this. And Ive also written about AI and machine learning in general, and quite a few times. But let me summarize and add a few more thoughts here.

The biggest point to remember, when talking about AI/ML and drug discovery, is that these techniques will not help you if you have a big problem with insufficient information. They dont make something from nothing. Instead, they sort through huge piles of Somethings in ways that you dont have the resources or patience to do yourself. That means (first) that you must be very careful about what you feed these computational techniques at the start, because garbage in, garbage out has never been more true than it is with machine learning. Indeed, data curation is a big part of every successful ML effort, for much the same reason that surface preparation is a big part of every successful paint job.

And second, it means that there is a limit on what you can squeeze out of the information you have. What if youve curated everything carefully, and the pile of reliable data still isnt big enough? Thats our constant problem in drug research. There are just a lot of things that we dont know, and sometimes we are destined to find out about them very painfully and expensively. Look at that oft-quoted 90% failure rate across clinical trials: is that happening because people are lazy and stupid and enjoy shoveling cash into piles and lighting it on fire? Not quite: its generally because we keep running into things that we didnt know about. Whoops, turns out Protein XYZ is not as important as we thought in Disease ABC the patients dont really get much better. Or whoops, turns out that drugs that target the Protein XYZ pathway also target other things that we had never seen before and that cause toxic effects, and the patients actually get worse. No one would stumble into things like that on purpose. Sometimes, in hindsight, we can see how such things might have been avoided, but often enough its just One of Those Things, and we add a bit more knowledge to the pile, at great expense.

So when I get asked about things like GPT3, which has been getting an awful lot of press in recent months, thats my first thought. GPT3 handles textual information and looks for patterns and fill-in-the-blank opportunities, and for human language applications we have the advantage of being able to feed gigantic amounts of such text into it. Now, not all of that text might be full of accurate information, but it was all written with human purpose and some level of intelligence, and with intent to convey information to its readers, and man, does that ever count for a lot. Compare that to the data we get from scientific observation, which comes straight from the source, as it were, without the benefit of having been run through human brains first. As Ive pointed out before, for example, a processing chip or a huge pile of software code may appear dauntingly complex, but they were both designed by humans and other humans therefore have huge advantage when it comes to understanding them. Now look at the physical wiring of neurons in a human brain hell, look at the wiring in the brain of a fruit fly or the biochemical pathways involved in gene transcription, or the cellular landscape of the human immune system. Theyre different, fundamentally different, because a billion years of evolutionary tinkering will give you wonderously strange things that are under no constraints to be understandable to anything.

GPT3 can be made to do all sorts of fascinating things, if you can find a way to translate your data into something like text. Its the same way that we try to turn text into vector representations for other computational purposes; you transform your material (if you can) into something thats best suited for the tools you have at hand. A surprising number of things can be text-ified, and we have yet another advantage that this process has already been useful for other purposes besides modern-day machine learning. Here, for example, is an earlier version of the program (GPT2) being used on text representations of folk songs, in order to rearrange them into new folk songs (I suspect that it would be even easier to generate college football fight songs, but perhaps theres not as much demand for those). You can turn images into long text strings, too, and turn the framework loose on them, withinteresting results.

But what happens if you feed a pile of (say) DNA sequence information into GPT3? Will it spit out plausible gene sequences for interesting new kinase enzymes or microtubule-associated proteins? I doubt it. In fact, I doubt it a lot, but I would be very happy to hear about anyone whos tried it. Human writing, images that humans find useful or interesting, and human music already have our fingerprints all over them, but genomic sequences, well. . .they have a funkiness that is all their own. There are things that Im sure the program could pick out, but Id like to know how far that extends.

And even if it really gets into sequences, itll hit a wall pretty fast. Theres a lot more to a single living cell than its gene sequence; thats one lesson that have had should have had beaten into our heads over and over. Now consider how much more there is to an entire living organism. Im all for shoveling in DNA sequences, RNA sequences, protein sequences, three-dimensional protein structures, everything else that we can push in through the textual formatting slot, to see what the technology can make of it. But again, thats only going to take you so far. There are feedback loops, networks of signaling, constantly shifting concentrations and constantly shifting spatial arrangements inside every cell, every tissue, every creature that are all interconnected in ways that, lets state again, we have not figured out. There are no doubt important things that can be wrung out of the (still massive) amount of information that we have, and Ill for finding them. But if you revved up the time machine and sent a bunch of GPT-running hardware (or any other back to 1975 (or 2005, for that matter) it would not have predicted the things about cell biology and disease that weve discovered since then. Those things, with few exceptions, werent latent in the data we had then. We needed more. We still do.

Apply this to the coronavirus pandemic, and the problems become obvious. We dont know what levels of antibodies (or T cells) are protective, how long such protection might last, and how it might vary among cohorts and individuals. We have been discovering major things about transmissibility by painful experience. We have no good idea about why some people become much sicker than others (once you get past a few major risk factors, age being the main one), or why some organ systems get hit in some patients and not in others. And so very much on these are limits of our knowledge, and no AI platform will fill those in for us.

From what I understand, the GPT3 architecture might already be near its limits, anyway. But there will be more ML programs and better ones, thats for sure. Google, for example, has just published a very interesting paper which is all about using machine learning to improve machine learning algorithms. I suspect that I am not the only old science-fiction fan who thought of this passage from William Gibsons Neuromancer on reading this:

Autonomy, thats the bugaboo, where your AIs are concerned. My guess, Case, youre going in there to cut the hard-wired shackles that keep this baby from getting any smarter. And I cant see how youd distinguish, say, between a move the parent company makes, and some move the AI makes on its own, so thats maybe where the confusion comes in. Again the non laugh. See, those things, they can work real hard, buy themselves time to write cookbooks or whatever, but the minute, I mean the nanosecond, that one starts figuring out ways to make itself smarter, Turingll wipe it. . .Every AI ever built has an electromagnetic shotgun wired to its forehead.

Were a long way from the world of Neuromancer probably a good thing, too, considering how the AIs behave in it. The best programs that we are going to be making might be able to discern shapes and open patches in the data we give them, and infer that there must be something important there that is worth investigating, or be able to say If there were a connection between X and Y here, everything would make a lot more sense maybe see if theres one we dont know about. Ill be very happy if we can get that far. We arent there now.

Continue reading here:
AI, Machine Learning and the Pandemic | In the Pipeline - Science Magazine

Related Posts
This entry was posted in $1$s. Bookmark the permalink.