Toward a theory of evolution as multilevel learning – pnas.org

Posted: February 5, 2022 at 4:56 am

Significance

Modern evolutionary theory gives a detailed quantitative description of microevolutionary processes that occur within evolving populations of organisms, but evolutionary transitions and emergence of multiple levels of complexity remain poorly understood. Here, we establish the correspondence among the key features of evolution, learning dynamics, and renormalizability of physical theories to outline a theory of evolution that strives to incorporate all evolutionary processes within a unified mathematical framework of the theory of learning. According to this theory, for example, replication of genetic material and natural selection readily emerge from the learning dynamics, and in sufficiently complex systems, the same learning phenomena occur on multiple levels or on different scales, similar to the case of renormalizable physical theories.

We apply the theory of learning to physically renormalizable systems in an attempt to outline a theory of biological evolution, including the origin of life, as multilevel learning. We formulate seven fundamental principles of evolution that appear to be necessary and sufficient to render a universe observable and show that they entail the major features of biological evolution, including replication and natural selection. It is shown that these cornerstone phenomena of biology emerge from the fundamental features of learning dynamics such as the existence of a loss function, which is minimized during learning. We then sketch the theory of evolution using the mathematical framework of neural networks, which provides for detailed analysis of evolutionary phenomena. To demonstrate the potential of the proposed theoretical framework, we derive a generalized version of the Central Dogma of molecular biology by analyzing the flow of information during learning (back propagation) and predicting (forward propagation) the environment by evolving organisms. The more complex evolutionary phenomena, such as major transitions in evolution (in particular, the origin of life), have to be analyzed in the thermodynamic limit, which is described in detail in the paper by Vanchurin etal. [V. Vanchurin, Y. I. Wolf, E. V. Koonin, M. I. Katsnelson, Proc. Natl. Acad. Sci. U.S.A. 119, 10.1073/pnas.2120042119 (2022)].

What is life? If this question is asked in the scientific rather than in the philosophical context, a satisfactory answer should assume the form of a theoretical model of the origin and evolution of complex systems that are identified with life (1). NASA has operationally defined life as follows: Life is a self-sustaining chemical system capable of Darwinian evolution (2, 3). Apart from the insistence on chemistry, long-term evolution that involves (random) mutation, diversification, and adaptation is, indeed, an intrinsic, essential feature of life that is not apparent in any other natural phenomena. The problem with this definition, however, is that natural (Darwinian) selection itself appears to be a complex rather than an elementary phenomenon (4). In all evolving organisms we are aware of, for natural selection to kick off and to sustain long-term evolution, an essential condition is replication of a complex digital information carrier (a DNA or RNA molecule). The replication fidelity must be sufficiently high to provide for the differential replication of emerging mutants and survival of the fittest ones (this replication fidelity level is often referred to as Eigen threshold) (5). In modern organisms, accurate replication is ensured by elaborate molecular machineries that include not only replication and repair enzymes but also, the entire metabolic network of the cell, which supplies energy and building blocks for replication. Thus, the origin of life is a typical chicken-and-egg problem (or catch-22); accurate replication is essential for evolution, but the mechanisms ensuring replication fidelity are themselves products of complex evolutionary processes (6, 7).

Because genome replication that underlies natural selection is itself a product of evolution, origin of life has to be explained outside of the traditional framework of evolutionary biology. Modern evolutionary theory, steeped in population genetics, gives a detailed and arguably, largely satisfactory account of microevolutionary processes: that is, evolution of allele frequencies in a population of organisms under selection and random genetic drift (8, 9). However, this theory has little to say about the actual history of life, especially the emergence of new levels of biological complexity, and nothing at all about the origin of life.

The crucial feature of biological complexity is its hierarchical organization. Indeed, multilevel hierarchies permeate biology: from small molecules to macromolecules; from macromolecules to functional complexes, subcellular compartments, and cells; from unicellular organisms to communities, consortia, and multicellularity; from simple multicellular organisms to highly complex forms with differentiated tissues; and from organisms to communities and eventually, to eusociality and to complex biocenoses involved in biogeochemical processes on the planetary scale. All these distinct levels jointly constitute the hierarchical organization of the biosphere. Understanding the origin and evolution of this hierarchical complexity, arguably, is one of the principal goals of biology.

In large part, evolution of the multilevel organization of biological systems appears to be driven by solving optimization problems, which entails conflicts or trade-offs between optimization criteria at different levels or scales, leading to frustrated states, in the language of physics (1012). Two notable cases in point are parasitehost arms races that permeate biological evolution and makes major contributions to the diversity and complexity of life-forms (1316) and multicellular organization of complex organisms, where the tendency of individual cells to reproduce at the highest possible rate is countered by the control of cell division imposed at the organismal level (17, 18).

Two tightly linked but distinct fundamental concepts that lie effectively outside the canonical narrative of evolutionary biology address evolution of biological complexity: major transitions in evolution (MTEs) (1921) and multilevel selection (MLS) (2227). Each MTE involves the emergence of a new level of organization, often described as an evolutionary transition in individuality. A clear-cut example is the evolution of multicellularity, whereby a new level of selection emerges, namely selection among ensembles of cells rather than among individual cells. Multicellular life-forms (even counting only complex organisms with multiple cell types) evolved on many independent occasions during the evolution of life (28, 29), implying that emergence of new levels of complexity is a major evolutionary trend rather than a rare, chance event.

The MLS remains a controversial concept, presumably because of the link to the long-debated subject of group selection (27, 30). However, as a defining component of MTE, MLS appears to be indispensable. A proposed general mechanism behind the MTE, formulated by analogy with the physical theory of the origin of patterns (for example, in glass-like systems), involves competing interactions at different levels and the frustrated states, such interactions cause (12). In the physical theory of spin glasses, frustrations result in nonergodicity and enable formation and persistence of long-term memory: that is, history (31, 32). By contrast, ergodic systems have no true history because they reach all possible states during their evolution (at least in the large time limit), and thus, the only content of quasihistory of such systems is the transition from less probable to more probable states for purely combinatorial reasons: that is, entropy increase (33). As emphasized in Schroedingers seminal book (34), even if only in general terms, life is based on negentropic processes, and frustrations at different levels are necessary for these processes to take off and persist (12).

The origin of cells, which can and probably should be equated with the origin of life, was the first and most momentous transition at the onset of biological evolution, and as such, it is outside the purview of evolutionary biology sensu stricto. Arguably, the theoretical investigation of the origin of life can be feasible only within the framework of an envelope theory that would incorporate biological evolution as a special case. It is natural to envisage such a theory as encompassing all nonergodic processes occurring in the universe, of which life is a special case, emerging under conditions that remain to be investigated and defined.

Here, in pursuit of a maximally general theory of evolution, we adopt the formalism of the theory of machine learning (35). Importantly, learning here is perceived in the maximally general sense as an objective process that occurs in all evolving systems, including but not limited to biological ones (36). As such, the analogy between learning and selection appears obvious. Both types of processes involve trial and error and acceptance or rejection of the results based on some formal criteria; in other words, both are optimization processes (22, 37, 38). Here, we assess how far this analogy extends by establishing the correspondence between key features of biological evolution and concepts as well as the mathematical formalism of learning theory. We make the case that loss function, which is central to the learning theory, can be usefully and generally employed as the equivalent of the fitness function in the context of evolution. Our original motivation was to explain major features of biological evolution from more general principles of physics. However, after formulating such principles and embedding them within the mathematical framework of learning, we find that the theory can potentially apply to the entire history of the evolving universe (36), including physical processes that have been taking place since the big bang and chemical processes that directly antedated and set the stage for the origin of life. The central propositions of the evolution theory outlined here include both key physical principles (namely, hierarchy of scale, frequency gaps, and renormalizability) (39, 40) and major features of life (such as MLS, persistence of genetic parasites, and programmed cell death).

We show that learning in a complex environment leads to separation of scales, with trainable variables splitting into at least two classes: faster- and slower-changing ones. Such separation of scales underlies all processes that involve the formation of complex structure in the universe from the scale of an atom to that of clusters of galaxies. We argue that, for the emergence of life, at least three temporal scales, which correspond to environmental, phenotypic, and genotypic variables, are essential. In evolving learning systems, the slowest-changing variables are digitized and acquire the replication capacity, resulting in differential reproduction depending on the loss (fitness) function value, which is necessary and sufficient for the onset of evolution by natural selection. Subsequent evolution of life involves emergence of many additional scales, which correspond to MTE. Hereafter, we use the term evolution to describe temporal changes of living and lifelike and prebiotic systems (organisms), whereas the more general term dynamics refers to temporal processes in other physical systems.

At least since the publication of Schroedingers book, the possibility has been discussed that, although life certainly obeys the laws of physics, a different class of laws unique to biology could exist. Often, this putative physics of life is associated with emergence (4143), but the nature of the involved emergent phenomena, to our knowledge, has not been clarified until very recently (36). Here, we outline a general approach to modeling and studying evolution as multilevel learning, supporting the view that a distinct type of physical theory, namely the theory of learning (35, 36), is necessary to investigate the evolution of complex objects in the universe, of which evolution of life is a specific, even if highly remarkable form.

In this section, we attempt to formulate the minimal universal principles that define an observable universe, in which evolution is possible and perhaps, inevitable. Our analysis started from the major features of biological evolution discussed in the next section and proceeded toward the general principles. However, we begin the discussion with the latter for the sake of transparency and generality.

What are the requirements for a universe to be observable? The possibility to make meaningful observations implies a degree of order and complexity in the observed universe emerging from evolutionary processes, and such evolvability itself seems to be predicated on several fundamental principles. It has to be emphasized that observation and learning here by no means imply mind or consciousness but a far more basic requirement. To learn and survive in an environment, a system (or observer) must predict, with some minimal but sufficient degree of accuracy, the response of that environment to various actions and to be able to choose such actions that are compatible with the observers continued existence in that environment. In this sense, any life-form is an observer, and so are even inanimate systems endowed with the ability of feedback reaction. In this most general sense, observation is a prerequisite for evolution. We first formulate the basic principles underlying observability and evolvability and then, give the pertinent comments and explanations.

P1. Loss function. In any evolving system, there exists a loss function of time-dependent variables that is minimized during evolution.

P2. Hierarchy of scales. Evolving systems encompass multiple dynamical variables that change on different temporal scales (with different characteristic frequencies).

P3. Frequency gaps. Dynamical variables are split among distinct levels of organization separated by sufficiently wide frequency gaps.

P4. Renormalizability. Across the entire range of organization of evolving systems, a statistical description of faster-changing (higher-frequency) variables is feasible through the slower-changing (lower-frequency) variables.

P5. Extension. Evolving systems have the capacity to recruit additional variables that can be utilized to sustain the system and the ability to exclude variables that could destabilize the system.

P6. Replication. In evolving systems, replication and elimination of the corresponding information-processing units (IPUs) can take place on every level of organization.

P7. Information flow. In evolving systems, slower-changing levels absorb information from faster-changing levels during learning and pass information down to the faster levels for prediction of the state of the environment and the system itself.

The first principle (P1) is of special importance as the starting point for a formal description of evolution as a learning process. The very existence of a loss function implies that the dynamical system of the universe or simpler, the universe itself is a learning (evolving) system (36). Effectively, here we assume that stability or survival of any subsystem of the universe is equivalent to solving an optimization or learning problem in the mathematical sense and that there is always something to learn. Crucially, for solving complex optimization problems dependent on many variables, the best and in fact, the only efficient method is selection implemented in various stochastic algorithms (Markov Chain Monte Carlo, stochastic gradient descent, genetic algorithms, and more). All evolution can be perceived as an implementation of a stochastic learning algorithm as well. Put another way, learning is optimization by trial and error, and so is evolution.

The remaining principles P2 to P7 provide sufficient conditions for observers of our type (that is, complex life-forms) to evolve within a learning system. In particular, P2, P3, and P4 comprise the necessary conditions for observability of a universe by any observer, whereas P5, P6, and P7 represent the defining conditions for the origin of life of our type (hereafter, we omit the qualification for brevity). More precisely, P2 and P3 provide for the possibility of at least a simple form of learning of the environment (fast-changing variables) by an observer (slow-changing variables) and hence, the emergence of complex organization of the slow-changing variables. P4 corresponds to the physical concept of renormalizability, or renormalization group (39, 40), whereby the same macroscopic equations, albeit with different parameters, govern processes at different levels or scales, thus limiting the number of relevant variables, constraining the complexity, and allowing for a coarse-grained description. This principle ensures a renormalizable universe capable of evolution and amenable to observation. Together, P2 to P4 define a universe, in which partial or approximate knowledge of the environment (in other words, coarse graining) is both attainable and useful for the survival of evolving systems (observers). In a universe where P4 does not apply (that is, one with nonrenormalizable physical laws), what happens at the macroscopic level will critically depend on the details of the processes at the microlevel. In a universe where P2 and P3 do not apply, the separation of the micro- and macrolevels itself would not be apparent. In such a universe, it would be impossible to survive without first discovering fundamental physical laws, whereas living organisms on our planet have evolved for billions of years before starting to study quantum physics.

Principles P5, P6, and P7 endow evolving systems with the access to more advanced algorithms for learning and predicting the environment, paving the way for the evolution of complex systems, including eventually, life. These principles jointly underlie the emergence of the crucial phenomenon of selection (44, 45). In its simplest form, selection is for stability and persistence of evolving, learning systems (46). Learning and survival are tightly linked because survival is predicated on the systems ability to extract information from the environment, and this ability depends on the stability of the system on timescales required for learning. Roughly, a system cannot survive in a world where the properties of the environment change faster than the evolving system can learn them. According to P5, evolving systems consume resources (such as food), which themselves could be produced by other evolving systems, to be utilized as building blocks and energy sources, which are required for learning. This principle embodies Schroedingers vision that organisms feed on negentropy (34). Under P6, replication of the carriers of slowly changing variables becomes the basis of long-term persistence and memory in evolving systems. This principle can be viewed as a learning algorithm built on P3, whereby the timescales characteristic of an individual organism and of consecutive generations are separated. This principle excludes from consideration certain imaginary forms of life: for example, Stanislav Lems famous Solaris (47). Finally, P7 describes how information flows between different levels in the multilevel learning, giving rise to a generalized Central Dogma of molecular biology, which is discussed in Generalized Central Dogma of Molecular Biology.

In this section, we link the fundamental principles of evolution P1 to P7 formulated above to the basic phenomenological features of life (E1 to E10) and seek equivalencies in the theory of learning. The list below is organized by first formulating a biological feature, and then, it is organized by 1) tracing the connections to the fundamental principles and 2) adding more general comments.

Discrete IPUs (that is, self- vs. nonself-differentiation and discrimination) exist at all levels of organization. All biological systems at all levels of organization, such as genes, cells, organisms, populations, and so on up to the level of the entire biosphere, possess some degree of self-coherence that separates them, first and foremost, from the environment at large and from other similar-level IPUs.

1) The existence of IPUs is predicated on the fundamental principles P1 to P4. The wide range of temporal scales (P2) in dynamical systems and gaps between the scales (P3) naturally enable the separation of slower- and faster-changing components. In particular, renormalizability (P4) applies to the hierarchy of IPUs. The statistical predictability of the higher frequencies allows the IPUs to decrease the loss function of the lower frequencies, despite the much slower reaction times.

2) Separation of (relatively) slow-changing prebiological IPUs from the (typically) fast-changing environment kicked off the most primitive form of prebiological selection: selection for stability and persistence (survivor bias). More stable, slower-changing IPUs win in the competition and accumulate over time, increasing the separation along the temporal axis as the boundary between the IPUs and the environment grows sharper. Additional key phenomena, such as utilization of available environmental resources (P5) and the stimulusresponse mode of information exchange (P7), stem from the flow of matter and information across this boundary and the ensuing separation of internal and external physicochemical processes. Increasing self- vs. nonself-differentiation, combined with replication of the carriers of slow-changing variables (P6), sets the stage for competition between evolving entities and for the onset of the ultimate evolutionary phenomenon, natural selection (E6).

All complex, dynamical systems face multidimensional and multiscale optimization problems, which generically lead to frustration resulting from conflicting objectives at different scales. This is a key, intrinsic feature of all such systems and a major force driving the advent of increasing multilevel complexity (12). Frustration is an extremely general physical phenomenon that is by no account limited to biology but rather, occurs already in much simpler physical systems, such as spin and structural glasses, the behavior of which is determined by competing interactions so that a degree of complexity is attained (31, 32).

1) The multiscale organization of the universe (P2) provides the physical foundation for the ubiquity of frustrated states that typically arises whenever there is a conflict (trade-off) between short- and long-range optimization problems. Frustrated interactions yield multiwell potential landscapes, in which no single state is substantially fitter than numerous other local optima. Multiparameter and multiscale optimization of the loss function on such a landscape involves nonergodic (history-dependent) dynamics, which is characteristic of complex systems.

2) IPUs face conflicting interactions starting from the most primitive prebiological state (12). Indeed, the separation of any system from the environment immediately results in the conflict of permeability; a stronger separation enhances the self- vs. nonself-differentiation and thus, increases the stability of the system, but it compromises information and matter exchange with the environment, limiting the potential for growth. In biology, virtually all aspects of the organismal architecture and operation are subject to such frustrations or trade-offs: the conflict between the fidelity and speed of information transmission at all levels, between specialization and generalism, between the individual- and population-level benefits, and more. The ubiquity of frustrations and the fundamental impossibility of their resolution in a universally optimal manner are perpetual drivers of evolution and give rise to evolutionary transitions, attaining otherwise unreachable levels of complexity.

There are two distinct types of frustrations, spatial and temporal. Spatial frustration is similar to the frustration that is commonly analyzed in condensed matter systems, such as spin glasses (31, 32). In this case, the spatially local and nonlocal interacting terms have opposite signs so that the equilibrium state is determined by the balance between the terms. In neural networks, a neuron (like a single spin) might have a local objective (such as binary classification of incoming signals) but is also a part of a neural network (like a spin network), which has its own global objective (such as predicting its boundary conditions). For a particular neuron, optimization of the local objective can conflict with the global objective, causing spatial frustration. Temporal frustration emerges because in the context of multilevel learning, the same neuron becomes a part of higher-level IPUs that operate at different temporal scales (frequencies). Then, the optimal state of the neuron with respect to an IPU operating at a given timescale can differ from the optimal state of the same neuron with respect to another IPU operating at a different timescale (36). Similarly to the spatial frustrations, temporal frustrations cannot be completely resolved, but an optimal balance between different spatial and temporal scales is achievable and represents a local equilibrium of the learning system.

The hierarchy of multiple levels of organization is an intrinsic, essential feature of evolving biological systems in terms of both the structure of these systems (genes, genomes, cells, organisms, kin groups, populations, species, communities, and more) and the substrate the evolutionary forces act upon.

1) Renormalizability of the universe (P4) implies that there is no inherently preferred level of organization, for which everything above and below would behave as a homogenous ensemble. Even if some levels of organization come into existence before others (for example, organisms before genes or unicellular organisms before multicellular ones), the other levels will necessarily emerge and consolidate subsequently.

2) The hierarchy of the structural organization of biological systems was apparent to scholars from the earliest days of science. However, MLS was and remains a controversial subject in evolutionary biology (23, 26, 27). Intuitively and as implied by the Price equation (48), MLS should emerge in all evolving systems as long as the higher-level agency of selection possesses a sufficient degree of self- vs. nonself-differentiation. In particular, if organisms of a given species form populations that are sufficiently distinct genetically and interact competitively, population-level selection will ensue. Evolution of biological systems is driven by conflicting interactions (E2) that tend to lead to ever-increasing complexity (12). This trend further feeds the propensity of these systems to form new levels of organization and is associated with evolutionary transitions that involve the advent of new units of selection at multiple levels of complexity. Thus, E3 can be considered a major consequence of E2.

Stochastic optimization or the use of stochastic optimization algorithms is the only feasible approach to complex optimization, but it guarantees neither finding the globally optimal solution nor retention of the optimal configuration when and if it is found. Rather, stochastic optimization tends to rapidly find local optima and keeps the system in their vicinity, sustaining the value of the loss function at a near-optimal level.

1) According to P1, the dynamics of a learning (that is, self-optimizing) system is defined by a loss function (35, 36). When there is a steep gradient in the loss function, a system undergoing stochastic optimization rapidly descends in the right direction. However, because of frustrations that inevitably arise from interactions in a complex system, actual local peaks on the landscape are rarely reached, and the global peak is effectively unreachable. Learning systems tend to get stalled near local saddle points where changes along most of the dimensions either lead up or are flat in terms of the loss function, with only a small minority of the available moves decreasing the loss function (49).

2) The extant biological systems (cells, multicellular organisms, and higher-level entities, such as populations and communities) are products of about 4 billion y of the evolution of life, and as such, they are highly, albeit not completely, optimized. As a consequence, the typical distribution of the effects of heritable changes in biological evolution comprises numerous deleterious changes, comparatively rare beneficial changes and common neutral changes, and those with fitness effects below the noise level (50). The preponderance of neutral and slightly deleterious changes provides for evolution by genetic drift whereby a population moves on the same level or even slightly downward on the fitness landscape, potentially reaching another region of the landscape where beneficial mutations are available (51, 52).

Solutions on the loss function landscapes that arise in complex optimization problems span numerous local peaks of comparable heights.

1) The existence of multiple peaks of comparable heights in the loss function landscapes is a fundamental physical property of frustrated systems (E2), whereas the pervasiveness of frustration itself is a consequence of the multiscale and multilevel organization of the universe (P2). Frustrated dynamical systems are nonergodic, which from the biological perspective, means that, once separated, evolutionary trajectories diverge rather than converge. Because most of these trajectories traverse parts of the genotype space with comparable fitness values, competition rarely results in complete dominance of one lineage over the others but rather, generates rich diversity.

2) In terms of evolutionary biology, fitness landscapes are rugged, with multiple adaptive peaks of comparable fitness (53, 54), and a salient trend during evolution is the spread of life-forms across multiple peaks as opposed to concentrating on one or few. Evolution pushes evolving organisms to explore and occupy all available niches and try all possible strategies. In the context of machine learning, identical neural networks can start from the same initial state but for example, under the stochastic gradient descent algorithm, would generically evolve toward different local minima. Thus, the diversity of solutions is a generic property of learning systems. More technically, the diversification is due to the entropy production through the dynamics of the neutral trainable variables (see the next section).

This quintessential feature of life embodies two distinct (albeit inseparable in known organisms) symmetry-breaking phenomena: 1) separation between dedicated digital information storage media (stable, rarely updatable, tending to distributions with discrete values) and mostly analog processing devices and 2) asymmetry of the information flow within the IPUs whereby the genotype provides instructions for the phenotype, whereas the phenotype largely loses the ability to update the genotype directly. The separation between the information storage and processing subsystems is a prerequisite for efficient evolution that probably emerged early on the path from prebiotic entities to the emergence of life.

1) The separation between phenotype and genotype extends the scale separation on the intra-IPU level as follows from the fundamental principles P1 to P4. Intermediate-frequency components of an IPU (phenotype) buffer the slowest components from direct interaction with the environment (the highest-frequency variables), further increasing the stability of the slowest components and making them suitable for long-term information storage. As the temporal scales separate further, the interactions between them change. Asymmetric information flow (P7) stabilizes the system, enabling long-term preservation of information (genotype) while retaining the reactive flexibility of the faster-changing components (phenotype).

2) The emergence of the separation between phenotype and genotype is a crucial event in prebiotic evolution. This separation is prominent in all known as well as hypothetical life-forms. Even when the phenotype and genotype roles are fulfilled by chemically identical molecules, as in the RNA world scenario of primordial evolution (55, 56), their roles as effectors and information storage devices are sharply distinct. In biological terms, the split is between replicators (that is, the digital information carriers [genomes]) and reproducers (5759), the analog devices (cells, organisms) that host the replicators, supply them with building blocks (P5), and themselves reproduce (P6) under the replicators instruction (P7). Although the genotype/phenotype separation is a major staple of life, it is in itself insufficient to qualify an IPU as a life-form (computers and record players, in which the separation between information storage and operational parts is prominent and essential, clearly are not life, even though invented by advanced organisms). The asymmetry of information flow between genotype and phenotype (P7) is the most general form of the phenomenon known as the Central Dogma of molecular biology: the unidirectional flow of information from nucleic acids to proteins as originally formulated by Crick (60). This asymmetry is also prominent in other information-processing systems, in particular computers. Indeed, von Neumann architecture computers have inherently distinct memory and processing units, with the instruction flow from the former to the latter (61, 62). It appears that any advanced information-processing systems are endowed with this property.

Emergence of long-term digital storage devices, that is genomes consisting of RNA or DNA (E6) provides for long-term information preservation, facilitates adaptive reactions to changes in the environment, and promotes the stability of IPUs to the point where (at least in chemical systems) it is limited by the energy of the chemical bonds rather than the energy of thermal fluctuations. Obviously, however, as long as this information is confined to a single IPU, it will disappear with the inevitable eventual destruction of that IPU. Should this be the case, other IPUs of similar architecture would need to accumulate a comparable amount of information from scratch to reach the same level of stability. Thus, copying and sharing information are essential for long-term (effectively, indefinite) persistence of IPUs.

1) The fundamental principle P6 postulates the existence of mechanisms for information copying and elimination. If genomic information can be replicated, even most primitive sharing mechanisms (such as physical splitting of an IPU under forces of surface tension) would result (even if not reliably) in the emergence of distinct IPUs preloaded with information that was amassed by their progenitor(s). This process short circuits learning and allows the information to accumulate at timescales far exceeding the characteristic lifetimes of individual IPUs.

2) Information copying and sharing are beneficial only if the fidelity exceeds a certain threshold, sometimes called Eigen limit in evolutionary biology (57). Nevertheless, in primitive prebiotic systems, the required fidelity level could have been quite low (63). For instance, even a biased chemical composition of a hydrophobic droplet could enhance the stability of the descendant droplets and thus, endow them with an advantage in the selection for persistence. However, once relatively sophisticated mechanisms of information copying and sharing emerge or more precisely, when replicators become information storage devices, the overall stability of the system can increase by orders of magnitude. To wit and astonishingly, the only biosphere known to us represents an unbroken chain of genetic information transmission that spans about 4 billion y, commensurate with the stellar evolution scale.

Evolution by natural selection (Darwinian evolution) arises from the combination of all the principles and phenomena described above. The necessary and sufficient conditions for Darwinian evolution to operate are 1) the existence of IPUs that are distinct from the environment and from each other (E1), 2) the dependence of the stability of an IPU on the information it contains (that is, the phenotypegenotype feedback; E6), and 3) the ability of IPUs to make copies of embedded information and share it with other IPUs (E7). When these three conditions are met, the relative frequencies of the more stable IPUs will increase with time via attrition of the less stable ones (survival of the fittest) and transfer of information among IPUs, both vertically (to progeny) and horizontally. This process engenders the key feature of Darwinian evolution, differential reproduction of genotypes, based on the feedback from the environment transmitted through the phenotype.

1) All seven fundamental principles of life-compatible universes (P1 to P7) are involved in enabling evolution by natural selection. The very existence of units, on which selection can operate, hinges on self- vs. nonself-discrimination of prebiotic IPUs (E1) and the emergence of shareable information storage (E6 and E7). The crucial step to biology is the emergence of the link between the loss function (P1), on the one hand, and the existence of the IPUs (P2, P3, P4, and E1), on the other hand. Consumption of (limited) external resources (P5) entails competition between IPUs that share the same environment and turns mere shifts of the relative frequencies into true survival of the fittest. The ability of the IPUs to replicate (P6) and expand their memory storage (genotype; P7, E6, and E7) provides them with access to hitherto unavailable degrees of freedom, making evolution an open-ended process rather than a quick, limited search for a local optimum.

2) Evolution by natural selection is the central tenet of evolutionary biology and a key part of the NASA definition of life. An important note on definitions is due. We already referred to selection when discussing prebiotic evolution (E1); however, the term natural (Darwinian) selection is here reserved for the efficient form of selection that emerges with the replication of dedicated information storage devices (P6 and E6). Differential reproduction, whereby the environment provides feedback on the fitness of genotypes while acting on phenotypes, turns into Darwinian survival of the fittest in the presence of competition. When IPUs depend on environmental resources, such competition inevitably arises, except in the unrealistic case of unlimited supply (44). With the onset of Darwinian evolution, the system can be considered to cross the threshold from prelife to life (64, 65). The evolutionary process is naturally represented by movement of an evolving IPU in a genotype space, where proximity is defined by similarity between distinct genotypes and transitions correspond to elementary evolutionary events: that is, mutations in the most general sense (66). For any given environment, fitnessthat is, a measure of the ability of a genotype to produce viable offspringcan be defined for each point in the genotype space, forming a multidimensional fitness landscape (53, 54). Selection creates a bias for preferential fixation of mutations that increase fitness, even if the mutations themselves occur completely randomly.

Parasites and hostparasite coevolution are ubiquitous across biological systems at multiple levels of organization and are both intrinsic to and indispensable for the evolution of life.

1) Due to the flexibility of life-compatible systems (P5 and P6) and the symmetry breaking in the information flow (P7) combined with the inherent tendency of life to diversify (E5), parts of the system inevitably settle on a parasitic state: that is, scavenging information and matter from the host without making a positive contribution to its fitness.

2) From the biological perspective, parasites evolve to minimize their direct interface with the environment and conversely, maximize their interaction with the host; in other words, the host replaces most of the environment for the parasite. Parasites inevitably emerge and persist in biological systems because of two reasons. 1) The parasitic state is reachable via an entropy-increasing step and therefore, is highly probable (16), and 2) highly efficient antiparasite immunity is costly (67). The cost of immunity reflects another universal trade-off analogous to the trade-off between information transfer fidelity and energy expenditure; in both cases, an infinite amount of energy is required to reach a zero error rate or a parasite-free state. From a complementary standpoint, parasites inevitably evolve as cheaters in the game of life that exploit the host as a resource, without expending energy on resource production. Short-term, parasites reduce the host fitness by both direct drain on its resources and various indirect effects, including the cost of defense. However, in a longer-term perspective, parasites make up a reservoir for recruitment of new functions (especially, but far from exclusively, for defense) by the hosts (14, 15). The hostparasite relationship can evolve toward transition to a mutually beneficial, symbiotic lifestyle that can further progress to mutualism and in some cases, complete integration as exemplified by the origin of essential endosymbiotic organelles in eukaryotes, mitochondria, and chloroplasts (68, 69). Parasites emerge at similar levels of biological organization (organisms parasitizing other organisms) or across levels (genetic elements parasitizing organismal genomes or cell clones parasitizing multicellular organisms).

Programmed (to various degrees) death is an intrinsic feature of life.

1) Replication and elimination of IPUs (P6) and utilization of additional degrees of freedom (P5) form the foundation for the phenomenon of programmed death. At some levels of organization (for example, intragenomic), the ability to add and eliminate units (such as genes) for the benefit of the higher-level systems (such as organisms) provides an obvious path of optimization. Elimination of units could be, in principle, completely random, but selection (E8) generates a sufficiently strong feedback to facilitate and structure the loss process (for example, purging low-fitness genes via homologous recombination or altruistic suicide of infected or otherwise impaired cells). The same forces operate at least at the cell level and conceivably, at all levels of organization and selection (P4). In particular, if population-level or kin-level selection is sufficiently strong, mechanisms for altruistic death of individual organisms apparently can be fixed in evolution (70, 71).

2) Programmed death is a prominent case of minimization of the higher-level (for example, organismal) loss function at the cost of increasing the lower-level loss function (such as that of individual cells). Although (tightly controlled) programmed cell death was originally discovered in multicellular organisms and has been thought to be limited to these complex life-forms, altruistic cell suicide now appears to be a universal biological phenomenon (7173).

To conclude this section, which we titled fundamental evolutionary phenomena, deliberately omitting biological, it seems important to note that phenomena E1 to E7 are generic, applying to all learning systems, including purely physical and prebiotic ones. However, the onset of natural selection (E8) marks the origin of life, so that the phenomena E8 to E10 belong in the realm of biology.

In the previous sections, we formulated the seven fundamental principles of evolution P1 to P7 and then, argued that the key evolutionary phenomena E1 to E10 can be interpreted and analyzed in the context of these principles and apparently, derived from the latter. The next step is to formulate a mathematical framework that would be consistent with the fundamental principles and thus, would allow us to model evolutionary phenomena analytically or numerically. For concreteness, the proposed framework is based on a mathematical model of artificial neural networks (74, 75), but we first outline a general optimization approach in a form suitable for modeling biological evolution.

We are interested in the broadest class of optimization problems, where the loss (or cost) function H(x,q) is minimized with respect to some trainable variables,q=(q(c),q(a),q(n)),[3.1]for a given training set of nontrainable variables,x=(x(o),x(e)).[3.2]

Near a local minimum, the first derivatives of the average loss function with respect to trainable variables q are small, and the depth of the minimum usually depends on the second derivatives. In particular, the second derivative can be large for the effectively constant degrees of freedom, q(c); small for adaptable degrees of freedom, q(a); or near zero for symmetries or neutral directions q(n). The separation of the neutral directions q(n) into a special class of variables simply means that some of the trainable variables can be changed without affecting the learning outcome: that is, the value of the loss function. Put another way, neutral changes are always possible. The neutral directions q(n) are the fastest changing among the trainable variables because fluctuations resulting in their change are, in general, fully stochastic. On the other end of the spectrum of variables, even minor changes to the effectively constant variables q(c) compromise the entire learning (evolution) process: that is, result in a substantial increase of the loss function value; these variables correspond to deep minima of the loss function. When the basin of attraction of a minimum is deep and narrow, the system stays in its bottom for a long time, and then, to describe such a state, it is sufficient to use discrete information (that is, to indicate that the system stays in a given minimum) rather than to list all specific values of the coordinates in a multidimensional space.

In a generic optimization problem, the dynamics of both trainable and nontrainable variables involves a broad distribution of characteristic timescales , and switching between scales is equivalent to switching between different frequencies or in the context of biological evolution, between different levels of organization. For any fixed , all variables can be partitioned into three classes depending on how fast they change with respect to the specified timescale:

fast-changing nontrainable variables that characterize an organism(x(o)) and its environment x(e) and change on timescales ;

intermediate-changing adaptable variables q(a) or neutral directions q(n) that change on timescales ~; and

slow-changing variables, which are the degrees of freedom q(c) that have already been well trained and are effectively constant (at or near equilibrium), only changing on timescales .

As will become evident shortly, the separation of these three classes of variables and interactions between them are central to the evolution and selection on all levels of organization, resulting in pervasive multilevel learning and selection.

Depending on the considered timescale (or as a result of environmental changes), the same dynamical degree of freedom can be assigned to different classes of variables: that is, x(o), x(e), q(c), q(a), or q(n). For example, on the shortest timescale, which corresponds to the lifetime of an individual organism (one generation), the adaptable variables are the phenotypic traits that quickly respond to environmental changes, whereas the slowest, near-constant variables are the genomic sequences (genotype) that change minimally if at all. On longer timescales, corresponding to thousands or millions of generations, fast-evolving portions of the genome become adaptable variables, whereas the conserved core of the genome remains in the near-constant class (50). Analogously, the neutral directions correspond either to nonconsequential phenotypic changes or to neutral genomic mutations, depending on the timescale. It is well known that the overwhelming majority of the mutations are either deleterious and therefore, eliminated by purifying selection or (nearly) neutral and thus, can be either lost or fixed via drift (76, 77). However, when the environment changes or under the influence of other mutations, some of the neutral mutations can become beneficial [a genetic phenomenon known as epistasis, which is pervasive in evolution (78, 79)], and in their entirety, neutral mutations form the essential reservoir of variation available for adaptive evolution (80). Even which variables are classified as nontrainable (x) depends on the timescale . For example, if a learning system was trained for a sufficiently long time, some of the trainable variables q(a) or q(n) might have already equilibrated and become nontrainable.

Now that we described an optimization problem that is suitable for modeling evolution of organisms (or populations of organisms), we can construct a mathematical framework to solve such optimization problems. For this purpose, we employ a mathematical theory of artificial neural networks (74, 75), which is simple enough to perform calculations while being consistent with all of the fundamental principles (P1 to P7), and thus, it can be used for modeling evolutionary phenomena (E1 to E10). We first recall a general framework of the neural network theory.

Consider a learning system represented as a neural network, with the state vector described by trainable variables q (which describe a collective notation for weight matrix w^ and bias vector b) and nontrainable variables x (which describe the current state vector of individual neurons). In the biological context, x collectively represent the current state of the organism x(o) and of its environment x(e), and q determines how x changes with time, in particular, how the organism reacts to environmental challenges. The nontrainable variables are modeled as changing in discrete time stepsxi(t+1)=fi(jwijxj(t)+bi),[4.1]where fi(y)s are some nonlinear activation functions (for example, hyperbolic tangent or rectifier activation functions). The trainable variables are modeled as changing according to the gradient descent (or stochastic gradient descent) algorithmqi(t+1)=qi(t)H(x(t),q(t))qi,[4.2]

where is the learning rate parameter and H(x,q) is a suitably defined loss function (Eqs. 4.3 and 4.4). In other words, q are gross or main variables, which determine the rules of dynamics, and the dynamics of all other variables x is governed by these rules, per Eq. 4.1. In the biological context, Eq. 4.1 represents fast, often stochastic environmental changes and the corresponding fast reaction of organisms at the phenotype level, whereas [4.2] reflects slower-learning dynamics of evolutionary adaptation via changes in the intermediate, adaptable variables: that is, the variable portion of the genome. The main learning objective is to adjust the trainable variables such that the average loss function is minimized subject to boundary conditions (also known as the training dataset), which in our case, is modeled as a time sequence of the environmental variables.

For example, on a single-generation timescale, the fast-changing variables represent the environment x(e) and nontrainable variables associated with organisms x(o), the intermediate-changing variables represent adaptive q(a) and neutral q(n) phenotype changes, and the slow-changing variables q(c) represent the genotype (SI Appendix, Fig. S1).

The temporal-scale separation in biology is readily apparent in all organisms. Indeed, consequential changes in the environment x(e) often occur on the scale of milliseconds to seconds, triggering physical changes within organisms x(o) at matching timescales. In response, individual organisms respond with phenotypic changes both adaptive q(a) and neutral q(n) on the scale of minutes to hours, exploiting their genetically encoded phenotypic plasticity. A paradigmatic example is induction of bacterial operons in response to a change in the chemical composition of the environment, such as the switch from glucose to galactose as the primary nutrient (81, 82). In contrast, changes in the genome q(c) take much longer. Mutations typically occur at rates of about 1 to 10 per genome replication cycle (83), which for unicellular organisms, is the same as a generation comprising from about an hour to hundreds or even thousands of hours. However, fixation of mutations, which represents an evolutionarily stable change at the genome level, typically takes many generations and thus, always occurs orders of magnitude slower than phenotype changes. Accordingly, on this timescale, any changes in the genome represent the third layer in the network, the slowly changing variables.

To specify a microscopic loss function that would be appropriate for describing evolution and thus, give a specific form to the fundamental principle P1, we first note that adaptation to the environment is more efficient (that is, the loss function value is smaller) for a learning system, such as an organism, that can predict the state of its environment with a smaller error. Then, the relevant quantity is the so-called boundary loss function defined as the sum of squared errors,He(x,q)12iE(xi(e)fi(x(o),q))2,[4.3]where the summation is taken only over the boundary (or environmental) nontrainable variables. It is helpful to think of the boundary loss function as the mismatch between the actual state of the environment and the state that would be predicted by the neural network if the environmental dynamics was switched off. In neuroscience, boundary loss is closely related to the surprise (or prediction error) associated with predictions of sensations, which depend on an internal model of the environment (84). In machine learning, boundary loss functions are most often used in the context of supervised learning (35), and in biological evolution, the supervision comes from the environment, which the evolving system, such as an organism or a population, is learning to predict.

Another possibility for a learning system is to search for the minimum of the bulk loss function, which is defined as the sum of squared errors over all neurons:H(x,q)=12i(xifi(x(o),q))2.[4.4]

The bulk loss function assumes extra cost incurred by changing the states of organismal neurons, x(o): that is, rewarding stationary states. In the limit of a very large number of environmental neurons, the two loss functions are indistinguishable, H(x,q)He(x,q), but bulk loss is easier to handle mathematically (the details of boundary and bulk loss functions are addressed in ref. 35).

More generally, in addition to the kinetic term [4.4], the loss function can include a potential term V(x,q):H(x,q)=12i(xifi(x(o),q))2+V(x,q).[4.5]

The kinetic term in [4.5] reflects the ability of organisms x(o) to predict the changes in the state of the given environment x(e) over time, whereas the potential term reflects their compatibility with a given environment and hence, the capacity to choose among different environments.

In the context of biological evolution, Malthusian fitness is defined as the expected reproductive success of a given genotype: that is, the rate of change of the prevalence of the given genotype in an evolving population (85). However, in the context of the theory of learning, the loss function must be identified with additive fitness: that is,H(x,q)=Tlog(x,q).[4.6]

For a microscopic description of learning, the proportionality constant is unimportant, but as we argue in detail in the accompanying paper (86), in the description of the evolutionary process from the point of view of thermodynamics, T plays the role of evolutionary temperature.

Given a concrete mathematical model of neural networks, one might wonder if all fundamental principles of evolution (P1 to P7) can be derived from this model. Such derivation would comprise additional evidence supporting the claim that the entire universe can be adequately described as a neural network (36). Clearly, the existence of a loss function (P1) follows automatically because learning of any neural network is always described relative to a specified loss function (Eq. 4.4 or 4.5). The other six principles also seem to naturally emerge from the learning dynamics of neural networks. In particular, the hierarchy of scales (P2) and frequency gaps (P3) are generic consequences of the learning dynamics, whereby a system that involves a wide range of variables changing at different rates is attracted toward a self-organized critical state of slow-changing trainable variables (87). Additional gaps between levels of organization are also expected to appear through phase transitions as becomes apparent in the thermodynamic description of evolution we develop in the accompanying paper (86). Renormalizability (P4) is a direct consequence of the second law of learning (35), according to which entropy of a system (and consequently, complexity of neural network or rank of its weight matrix) decreases with learning. This phenomenon was observed in neural network simulations (35) and is the exact type of dynamics that can make the system renormalizable even if it started off as a highly entangled (large rank of weight matrix), nonrenormalizable neural network. The extension (P5) and replication (P6) principles simply indicate that additional variables can lead to either increase or decrease in the value of the loss function (35). It is also important to note that in neural networks, an additional computational advantage (quantum advantage) can be achieved if the number of IPUs can vary (88). Therefore, to achieve such an advantage, a system must learn how to replicate and eliminate its IPUs (P6). Finally, in Generalized Central Dogma of Molecular Biology, we illustrate how Fourier transform (or more generally, wavelet transform) of the environmental degrees of freedom can be used for learning the environment and how the inverse transform can be used for predicting it. Thus, to be able to predict the environment (and hence, to be competitive), any evolving system must learn the mechanism behind such asymmetric information flow (P7).

In the previous sections, we argued that the learning process naturally divides all the dynamical variables into three distinct classes: fast-changing ones, x(o) and x(e); intermediate-changing ones, q(a) and q(n) (q(n) being faster than q(a)); and slow-changing ones, q(c) (SI Appendix, Fig. S1). Evidently, this separation of variables depends on the timescale during which the system is observed, and variables migrate between classes when is increased or decreased (SI Appendix, Fig. S2). The longer the time, the more variables reach equilibrium and therefore, can be modeled as nontrainable and fast changing, x(e), and the fewer variables remain slowly varying and can be modeled as effectively constant q(c). In other words, many variables that are nearly constant at short timescales migrate to the intermediate class at longer timescales, whereas variables from the intermediate class migrate to the fast class.

In biological terms, if we consider learning dynamics on the timescale of one generation, then q(a) and q(n) represent phenotype variables, and q(c) represents genotype variables; however, on much longer timescales of multiple generations, the learning dynamics of populations (or communities) of organisms becomes relevant. On such timescales, the genotype variables acquire dynamics, with purifying and positive selection getting into play, whereas the phenotype variables progressively equilibrate. There is a clear connection between learning dynamics, including that in biological systems, and renormalizability of physical theories (P4). Indeed, from the point of view of an efficient learning algorithm, the parameters controlling learning dynamics, such as effective learning (or information-processing) rate , can vary from one timescale to another (for example, from individual organisms to populations or communities of organisms), but the general principles as well as specific dependencies captured in the equations above that govern the learning dynamics on different timescales remain the same. We refer to this universality of the learning process on different timescales and partitioning of the variables into temporal classes as multilevel learning.

More precisely, multilevel learning is a property of learning systems, which allows for the basic equations of learning, such as [4.4], to remain the same on all levels of organization but for the parameters, which describe the dynamics such as (), to depend on the level or on the timescale . For example, if the effective learning (or information-processing) rate () decreases with timescale , then the local processing time, which depends on (), runs differently for different trainable variables: slower for slow-changing variables (or larger ) and faster for fast-changing ones (or smaller ). For such a system, the concept of global time (that is, the same time for all variables) becomes irrelevant and should be replaced with the proper or local time, which is defined for each scale separately:t()t.[5.1]

This effect closely resembles time dilation phenomena in physics, except that in special and general relativity, time dilation is linked with the possibility of movement between slow and fast clocks (or variables) (89). To illustrate the role time dilation plays in biology, consider only two types of variables: slow changing and fast changing. Then, the slow variables should be able to outsource certain computational tasks to faster variables. Because the local clock for the fast-changing variables runs faster, the slow-changing variables can take advantage of the fast-changing ones to accelerate computation, which would be rewarded by evolution. The flow of information between slow-changing and fast-changing variables in the opposite direction is also beneficial because the fast-changing variables can use the slow-changing variables to store useful information for future retrieval: that is, the slow variables function as long-term memory. In the next section, we show that such cooperation between slow- and fast-changing variables, which is a concrete manifestation of principle P7, corresponds to a crucial biological phenomenon as the Central Dogma of molecular biology (60).

In terms of learning theory, the two directions of the asymmetric information flow (P7) represent learning the state of the environment and predicting the state of the environment from the results of learning. For learning, information is passed from faster variables to slower variables, and for predicting, information flows in the opposite direction from slower variables to the faster ones. A more formal analysis of the asymmetric information flows (or a generalized Central Dogma) can be carried out by forward propagation (from slow variables to fast variables) and back propagation (from fast variables to slow variables) of information within the framework of the mathematical model of neural networks developed in the previous sections (SI Appendix, Fig. S3).

Consider nontrainable environmental variables that change continuously with time x(e)(t), while the learning objective of an organism is to predict x(e)(t) at time t> given that it was observed for time 0, and to do so, it should be able to store and retrieve the values of the Fourier coefficientsqk=10dtx(e)(t)ei2fkt,[6.1]or more generally, wavelet coefficientsqk=10dtWk(t)x(e)(t)ei2fkt,[6.2]

for suitably defined window functions Wi. Then, a prediction could be made by extrapolating x(e)(t) using the inverse transformationx(e)(t+)2k=kminkmaxRe(qkei2fk),[6.3]

for some >0, which is not too large compared with . However, in general, the total number of (Fourier or wavelet) coefficients qk would be countably infinite. Therefore, any finite-size organism has to decide which frequencies to observe (and remember) and which ones to filter out (and forget).

Let us assume that the organism decided to only observe/remember discrete frequenciesfminfkmin,,fkmaxfmax,[6.4]and forget everything else. Then, to predict the state of the environment [6.3] and as a result, minimize the loss function [4.5], the organism should be able to store, retrieve, and adjust information about coefficients qk in some adaptable trainable variables q(a).

Given this simple model, we can study the flow of information between different nontrainable variables of the organism x(o). To this end, it is convenient to organize the variables asx=(x(o),x(e))=(xkmin,,xkmax,x(e)),[6.5]wherexk(t+)2l=kminkRe(qlei2fl),[6.6]

and assume that the relevant information about qis is stored in the adaptable trainable variablesq(a)=(qkmin,,qkmax).[6.7]

In the estimate of xk(t+) in Eq. 6.6, all the higher-frequency modes are assumed to average to zero as is often the case if we are only interested in the timescale fk1. A better estimate can be obtained using, once again, the ideas of the renormalization group flow following the fundamental principle P4. To make learning (and thus, survival) efficient, truncation of the set of variables relevant for learning is crucial. The main point is that the higher-frequency modes can still contribute statistically, and then, an improved estimate of xk(t+) would be obtained by appropriately modifying the values of the coefficients qk. Either way, in order to make an actual prediction, the organism should first calculate xk(t+) for small fk and then, pass the result to the next level to calculate xk+1(t+) for larger fk+1 and so on. Such computations can be described by a simple mappingxk+1(t+)=xk(t+)+2Re(qk+1ei2fk+1),[6.8]which can be interpreted as passage of data from one layer to another in a deep, multilayer neural network (SI Appendix, Fig. S2). Eq. 6.8 implies that, during the predicting phase, relevant information only flows from variables encoding low frequencies to variables encoding high frequencies but not in the reverse direction. In other words, in the process of predicting the environment, information propagates from slower variables to faster variables: that is, from genotype to phenotype or from nucleic acids to proteins (hence, the Central Dogma). Because only the fast variables change in this process, the prediction of the state of the environment is rapid, as it is indeed required to be for the organism survival. Conversely, in the process of learning the environment, information is back propagated in the opposite direction: that is, from faster to slower variables. However, this back propagation is not a microscopic reversal of the forward propagation but a distinct, much slower process (given that changes in slow variables are required) that involves mutation and selection.

Thus, the meaning of the generalized Central Dogma from the point of view of the learning theoryand our theory of evolutionis that slow dynamics (that is, evolution on a long timescale) should be mostly independent of the fast variables. In less formal terms, slow variables determine the rules of the game, and changing these rules depending on the results of some particular games would be detrimental for the organism. Optimization within the space of opportunities constrained by temporally stable rules is advantageous compared with optimization without such constraints. The trade-off between global and local optimization is a general, intrinsic property of frustrated systems (E2). For the system to function efficiently, the impact of local optimization on the global optimization should be restricted. The separation of the long-term and short-term forms of memory through different elemental bases (nucleic acids vs. proteins) serves this objective.

In this work, we outline a theory of evolution on the basis of the theory of learning. The parallel between learning and biological evolution becomes obvious as soon as the mapping between the loss function and the fitness function is established (Eq. 4.6). Indeed, both processes represent movement of an evolving (learning) system on a fitness (loss function) landscape, where adaptive (learning), upward moves are most consequential, although neutral moves are most common, and downward moves also occur occasionally. However, we go beyond the obvious analogy and trace a detailed correspondence between the essential features of the evolutionary and learning processes. Arguably, the most important fundamental commonality between evolution and learning is the stratification of the trainable variables (degrees of freedom) into classes that differ by the rate of change. At least in complex environments, all learning is multilevel, and so is all selection that is relevant for the evolutionary process. The framework of evolution as learning developed here implies that evolution of biological complexity would be impossible without MLS permeating the entire history of life. Under this perspective, emergence of new levels of organization, in learning and in evolution, and in particular, MTE represent genuine phase transitions as previously suggested (41). Such transitions can be analyzed consistently only in the thermodynamic limit, which is addressed in detail in the accompanying paper (86).

The origin of complexity and long-term memory from simple fundamental physical laws is one of the hardest problems in all of science. One popular approach is synergetics pioneered by Haken (91, 92) and the related nonequilibrium thermodynamics founded by Prigogine and Stengers (93) that employ mathematical tools of the theory of dynamical systems, such as theory of bifurcations and analysis of attractors. However, these concepts appear to be too general and oversimplified to usefully analyze biological phenomena, which are far more complex than dissipative structures that are central to nonequilibrium thermodynamics, such as, for example, autowave chemical reactions.

An alternative is the approach based on the theory of spin glasses (43, 94), which employs the mathematical apparatus of statistical physics and seems to provide a deeper insight into the origin of complexity. However, the energy landscape of spin glasses contains too many minima that are too shallow to account for long-term memory that is central to biology (12, 41). Thus, some generalization of the spin glass concept is likely to be required for productive application in evolutionary biology (95).

A popular and promising approach is self-organized criticality (SOC), a concept developed by Bak etal. (96, 97). Although relevant in biological contexts (12), SOC, by definition, implies self-similarity between different levels of organization, whereas biologically relevant complexity is rather associated with distinct emergent phenomena at different spatiotemporal scales (90).

A fundamental shortcoming of all these approaches is that they do not include, at least not as a major component, evolutionary concepts, such as natural selection. The framework of learning theory used here allows us to naturally unify the descriptions of physical and biological phenomena in terms of optimization by trial and error and loss (fitness) functions. Indeed, a key point of the present analysis is that most of our general principles apply to both living and nonliving systems.

The detailed correspondence between the key features of the processes of learning and biological evolution implies that this is not a simple analogy but rather, a reflection of the deep unity of evolutionary processes occurring in the universe. Indeed, separation of the relevant degrees of freedom into multiple temporal classes is ubiquitous in the universe from composite subatomic particles, such as protons, to atoms, molecules, life-forms, planetary systems, and galaxy clusters. If the entire universe is conceptualized as a neural network (36), all these systems can be considered emerging from the learning dynamics. Furthermore, scale separation and renormalizability appear to be essential conditions for a universe to be observable. According to the evolution theory outlined here, any observable universe consists of systems that undergo learning or synonymously, adaptive evolution, and actually, the universe itself is such a system (36). The famous dictum of Dobzhansky (98), thus, can and arguably should be rephrased as [n]othing in the world is comprehensible except in the light of learning.

Within the theory of evolution outlined here, the difference between life and nonliving systems, however important, can be considered as one in the type and degree of optimization, so that all evolutionary phenomena can be described within the same formal framework of the theory of learning. Crucially, any complex optimization problem can be addressed only with a stochastic learning algorithm: hence, the ubiquity of selection. Origin of life can then be conceptualized within the framework of multilevel learning as we explicitly show in the accompanying paper (86). The point when life begins can be naturally associated with the emergence of a distinct class of slowly changing variables that are digitized and thus, can be accurately replicated; these digital variables store and supply information for forward propagation to predict the state of the environment. In biological terms, this focal point corresponds to the advent of replicators (genomes) that carry information on the operation of reproducers within which they reside (99). This is also the point when natural (Darwinian) selection takes off (64). Our theory of evolution implies that this pivotal stage was preceded by evolution of prelife, which comprised reproducers that lacked genomes but nevertheless, were learning systems that were subject to selection for persistence. Self-reproducing micelles that harbor autocatalytic protometabolic reaction networks appear to be plausible models of such primordial reproducers (100). The first replicators (RNA molecules) would evolve within these reproducers, perhaps, initially, as molecular parasites (E9) but subsequently, under selection for the ability to store, express, and share information essential for the entire system. This key step greatly increased the efficiency of evolution/learning and provided for long-term memory that persisted throughout the history of life, enabling the onset of natural selection and the unprecedented diversification of life-forms (E5). It has to be emphasized that, compared with the existing evolutionary models that explore replicator dynamics, the learning approach described here is more microscopic in that the existence of replicators is not initially assumed but rather, appears as an emergent property of multilevel learning dynamics. For learning to be efficient, the capacity of the system to add new adaptable variables is essential. In biological terms, this implies expandability of the genome (that is, the ability to add new genes), which necessitated the transition from RNA to DNA as the genome substrate given the apparent intrinsic size constraints on replicating RNA molecules. Another essential condition for efficient learning is information sharing, which in the biological context, corresponds to horizontal gene transfer. The essentiality of horizontal gene transfer at the earliest stages of life evolution is perceived as the cause of the universality of the translation machinery and genetic code in all known life-forms (101). The conceptual model of the origin of life implied by our learning-based theoretical framework appears to be fully compatible with Gntis chemoton, a model of protocell emergence and evolution based on autocatalytic reaction networks (102104).

Go here to read the rest:

Toward a theory of evolution as multilevel learning - pnas.org

Related Posts