Recurrent quantum embedding neural network and its application in vulnerability detection | Scientific Reports – Nature.com

Posted: June 13, 2024 at 4:37 pm

In this section, we first introduce the composition and principles of the important components of RQENN including the trainable encoding method based on parameterized binary index and recurrent cell. Then we introduce the RQENN-based classification model. Finally, we present the task flow of applying RQENN classification model to vulnerability detection.

Classical neural network models for processing NLP tasks first need to tokenize the text and build a word dictionary, according to which the text is converted into a digital index sequence of words. Each digital index corresponds to a one-hot vector, which are transformed into dense vectors by word embedding methods involved in the network training to obtain a more accurate vector representation. However, similar methods migrated to QNN do not work. Specifically, in the classical model, the one-hot vectors are sparse and orthogonal, which means that when word embedding is performed, each word gets only some of the weights from the embedding weight matrix (W) as the representation vector. This process can be viewed as using the one-hot vector as the key to query the corresponding value in the weight map (W), as shown in Fig.1a. Thus, in the case of random initialization of weights, the initial representation vectors of all words are uncorrelated, and they establish lexical connections as the training process proceeds. However, in the quantum model, due to the properties of quantum superposition and entanglement, the quantum state obtained from encoded words (e.g., (|{psi }_{1}rangle) in Fig.1b) is difficult to be sparse and orthogonal as the classical one-hot vectors. This implies that the initially encoded quantum state of each word has some kind of connection, and the use of this quantum state as the "key" inevitably leads to the "value" obtained from the query being related to all the elements in the unitary matrix, which contains various non-semantic connections. The use of this quantum state as the "key" inevitably leads to a query that yields a "value" that is related to all the elements of the Missy's orthogonal weight matrix, and the results obtained contain various non-semantic connections. This prevents QNN from learning the semantics of words through the quantum embedding method.

(a) The process of token 'NULL' being encoded into a quantum circuit. The (|{psi }_{1}rangle) obtained after rotation input layer is treated as a quantum one-hot vector. It further applies QEmbedding to obtain (|{psi }_{2}rangle) which can be treated as a quantum dense vector used as token representation. (b) Classical word embedding computation process. The one-hot vector is treated as a key to query value in the map of weights.

To address this problem, we propose a trainable encoding method based on parameterized binary index to encode code tokens as quantum state data and efficiently learn the semantics of the tokens. The specific steps are as follows:

Step I: Tokenize source code into tokens to create a dictionary and then tokens are mapped to numeric indexes.

Step II: Convert the numeric indexes from decimal to binary representation. For a dictionary containing (N) words, an index is represented by an (n=lceil lo{g}_{2}Nrceil) bits binary numbers.

Step III: Replace "0" and "1" in the binary number indexes with the trainable parameters "({theta }_{0})" and "({theta }_{1})", forming parameterized binary indexes.

Step IV: Encode parameterized binary index using (n=lceil lo{g}_{2}Nrceil) qubits. Each bit of the input is encoded to the corresponding qubit through an Ry rotation gate.

Step V: Add a trainable layer containing parameters to the quantum circuit as a quantum embedding (QEmbedding) implementation.

The Ry rotation input layer and the QEmbedding layer in the above steps together form the trainable encoding layer.

As a simple example, for the following source code training set:

$$left[ {text{``}{text{VAR1 }} = {text{ NULL''}}, , text{``}{text{VAR2 }} = {text{ NULL''}}, , text{``}{text{VAR1 }} = {text{ VAR2''}}} right],$$

we can build such a dictionary to map all code tokens to numeric indexes:

$${ text{`}= text{'} : , 0,text{`}{text{VAR1'}} :{ 1},text{`}{text{NULL'}} :{ 2},text{`}{text{VAR2'}} :{ 3}} .$$

These numeric indexes are further converted to parameterized binary indexes:

$${ text{`}= text{'} : , {theta }_{0}{theta }_{0},text{`}{text{VAR1'}} :{ {theta }_{0}{theta }_{1}},text{`}{text{NULL'}} :{{theta }_{1}{theta }_{0}},text{`}{text{VAR2'}} :{ {theta }_{1}{theta }_{1}}} .$$

Next, we determine the angles of the Ry gates and construct the quantum circuit based on the parameterized binary index of the word to be encoded. Taking encoding token 'NULL' as an example, as shown in Fig.1b, its corresponding indexes ({theta }_{1}{theta }_{0}) are encoded bit-by-bit to a circuit with 2 qubits as the Ry gates angles. Then a QEmbedding layer is further added to jointly form the quantum trainable encoding circuits.

As Eqs.(13) shown below. (|{psi }_{0}rangle) is the initial state. The quantum circuit encodes the index ({theta }_{1}{theta }_{0}) into the quantum state (|{psi }_{1}rangle) by rotation input layer. It is a ({2}^{n})-dimensional vector. It has (N) different cases, corresponding to (N) possible combinations of the input rotation layer parameters. The parameters "({theta }_{0})" and "({theta }_{1})" are involved in the training process of QNN to eliminate possible inherent connections, so that (|{psi }_{1}rangle) has the same function as the classic one-hot vector. The one-hot vector is a form transformed by symbols that is easy to use by the classic network model, and the obtained (|{psi }_{1}rangle) is a form transformed by symbols that is easy to use by QNN. This is the unique aspect of trainable encoding based on parameterized binary index and the key to improving model performance. Next, (|{psi }_{1}rangle) learns lexical connections between encoded words through a trainable QEmbedding layer ({U}_{qe}({{varvec{theta}}}_{qe})), which is similar to the classic word embedding principle. The obtained quantum state (|{psi }_{2}rangle) is described in Eq.(3), where ({U}_{qe}({{varvec{theta}}}_{qe})={[begin{array}{cccc}{{varvec{u}}}_{0}^{dagger}& {{varvec{u}}}_{1}^{dagger}& {{varvec{u}}}_{2}^{dagger}& {{varvec{u}}}_{3}^{dagger}end{array}]}^{dagger}). At this point, the ({2}^{n})-dimensional dense vectors corresponding to (|{psi }_{2}rangle) are the representations of the words, except that the words are converted from indexes to quantum-friendly quantum state representations instead of classical vector representations.

$$|{psi }_{0}rangle ={[begin{array}{cccc}{varepsilon }_{0}& {varepsilon }_{1}& {varepsilon }_{2}& {varepsilon }_{3}end{array}]}^{dagger}$$

(1)

$$left|{psi }_{1}right.rangle =left{begin{array}{c}Ryleft({theta }_{0}right)otimes Ryleft({theta }_{0}right)left|{psi }_{0}right.rangle ,,, index={theta }_{0}{theta }_{0}\ Ryleft({theta }_{0}right)otimes Ryleft({theta }_{1}right)left|{psi }_{0}right.rangle ,,, index={theta }_{0}{theta }_{1}\ Ryleft({theta }_{1}right)otimes Ryleft({theta }_{0}right)left|{psi }_{0}right.rangle ,,, index={theta }_{1}{theta }_{0}\ Ryleft({theta }_{1}right)otimes Ryleft({theta }_{1}right)left|{psi }_{0}right.rangle , ,, index={theta }_{1}{theta }_{1}end{array}right.$$

(2)

$$|{psi }_{2}rangle ={U}_{qe}({{varvec{theta}}}_{qe})|{psi }_{1}rangle ={[begin{array}{cccc}{{varvec{u}}}_{0}^{dagger}|{psi }_{1}rangle & {{varvec{u}}}_{1}^{dagger}|{psi }_{1}rangle & {{varvec{u}}}_{2}^{dagger}|{psi }_{1}rangle & {{varvec{u}}}_{3}^{dagger}|{psi }_{1}rangle end{array}]}^{dagger}$$

(3)

Compared with the trainable encoding method based on parameterized binary index, if the binary index obtained in Step II is used for encoding, the fixed angle of the rotation gate (0 or 1) will result in constant non-lexical connections between (|{psi }_{1}rangle) of different words. These connections are brought into training process of the quantum word embedding layer, possibly affecting the normal learning of lexical connections of the code tokens. In fact, it can also make the (N) quantum states (|{psi }_{1}rangle) orthogonal to each other like classical one-hot vectors by choosing a suitable fixed angle of the rotation gate, this method is called the "orthogonal method". It determines the specific angles "({theta }_{0})" and "({theta }_{1})" to be used for replacing the binary "0" and "1" before training. By respectively applying (N) different rotation layers with angles "({theta }_{0})" and "({theta }_{1})" on (N) independent quantum circuits, we can obtain (N) quantum states. We use the gradient descent algorithm to minimize the sum of the absolute values of the two-by-two inner products of these (N) quantum states under random initialization of the quantum initial states. This approach references the property of mutual orthogonality between one-hot vectors, which ultimately yields ({theta }_{0}=-frac{pi }{2}) and ({theta }_{1}=frac{pi }{2}), and encoding using this value will make (|{psi }_{1}rangle) as orthogonal as possible for different tokens. But there are also differences between QEmbedding and classical embedding. Each element of the weight matrix in classical embedding is a trainable parameter, while QEmbedding only controls the changes of the matrix through a small number of parameter-containing unitary gates. It cannot be proved that the always orthogonal (|{psi }_{1}rangle) is more helpful for learning (|{psi }_{2}rangle). Therefore, in this paper, we add "({theta }_{0})" and "({theta }_{1})" as trainable parameters to the learning process of RQENN, which is the reason for using parameterized binary indexes in Step III. We will show in the Results section the performance of the model when using trainable encoding based on binary index, orthogonality method, and parameterized binary index as data inputs, further demonstrating the effectiveness of the proposed methods.

The trainable encoding method defined in the above section is a crucial component in the construction of our recurrent quantum embedding neural network cell. Much like classical RNNs, we define such a cell that will be successively applied to the input presented to the network for capturing contextual connections in the code. More specifically, the cell is comprised of a trainable encoding stage and a working stage, which are used to learn the semantics of input tokens and memorize contextual dependencies, respectively. This cell is applied iteratively in RQENN, and its internal state is passed on to the next iteration of the network. RQENN cells at all time steps share the same trainable parameters.

Figure2 illustrates the RQENN cell, which learns the quantum word embedding of the current time step input ({{varvec{x}}}_{t}=({x}_{{t}_{0}},...,{x}_{{t}_{n}})) in the encoding stage and combines it with the cell input hidden state (|{psi }_{t-1}rangle) in the work stage to learn the mapping relation from this combined state to the cell output hidden state (|{psi }_{t}rangle). The equation for this process is as follow:

$$|{psi }_{t}rangle ={U}_{qnn}{U}_{qe}{U}_{in}({{varvec{x}}}_{t})|{psi }_{t-1}rangle$$

(4)

where ({U}_{in}), ({U}_{qe}) and ({U}_{qnn}) denote the unitary matrix of the rotation input layer, QEmbedding layer and quantum weight (QWeight) layer, respectively.

Recurrent quantum embedding neural network cell. It consists of a trainable encoding stage and a QNN work stage, where the principle of the encoding stage is as described in the above section. It transforms the internal state in into the state out at each time step and iterates this process.

The encoding stage uses the trainable encoding method described above. In the rotation input layer, an Ry gate is applied on the ith qubit to rotate the angle to the ith value of the parameterized binary index. In the QEmbedding layer, a (m) layer ansatz composed of alternate rotation layer and entanglement layer is used to learn the quantum word embedding representation. Each layer of the ansatz consists of 2 rotation layers and 2 entanglement layers consisting of staggered entanglements between adjacent qubits. In the working stage, a (n) layer one-dimensional alternating layered Hardware Efficient Ansatz48,49 was used to build the QWeight layer. This ansatz is implemented by sequentially applying a two-qubit unitary to adjacent qubits. Each two-qubit unitary entangles the last qubit obtained from a previous unitary with the next one. The unique recurrent circuit cell architecture with scalable layer in multi-stage is the key to improving model performance. This two-qubit unitary consists of two Ry gates and a Cnot gate that have been proven effective50, and its unitary transformation is described by Eq. (5). We show below the specific implementations of the different network layers by equations. Equation (6) shows the unitary transformation of the rotation input layer, where (tin {1,...,T}) represents the time step. A token is input into the network at each time step, and (T) is set as the total code length. Equations (7, 8) and Eq. (9) are the unitary transformations of the QEmbedding layer and QWeight layer respectively.

$${U}_{l,i}^{left[2right]}left({{varvec{theta}}}_{l,i}right)=Cno{t}_{i,i+1}{otimes }_{j=0}^{1}R{y}_{i+j}left({theta }_{l,i,j}right), lin left{text{0,1}right} and iin left{0,dots ,n-2right}$$

(5)

$${U}_{in}left({{varvec{x}}}_{t}right)={otimes }_{i=0}^{n}left(R{y}_{i}left({x}_{{t}_{i}}right)right), {x}_{{t}_{i}}in left{{theta }_{0},{theta }_{1}right} and tin {1,...,T}$$

(6)

$${U}_{q{e}_{l}}({{varvec{theta}}}_{q{e}_{l}})=prod_{i=1}^{lfloor (n-1)/2rfloor }Cno{t}_{2i-text{1,2}i}{otimes }_{i=0}^{n-1}R{y}_{i}({theta }_{l,n+i})prod_{i=1}^{lfloor n/2rfloor }Cno{t}_{2i-text{2,2}i-1}{otimes }_{i=0}^{n-1}R{y}_{i}({theta }_{l,i})$$

(7)

$${U}_{qe}left({{varvec{theta}}}_{qe}right)={U}_{q{e}_{1}}left({{varvec{theta}}}_{q{e}_{1}}right){U}_{q{e}_{0}}left({{varvec{theta}}}_{q{e}_{0}}right)$$

(8)

$${U}_{qnn}left({{varvec{theta}}}_{qnn}right)={U}_{1,n-2}^{left[2right]}left({{varvec{theta}}}_{1,n-2}right)dots {U}_{text{1,0}}^{left[2right]}left({{varvec{theta}}}_{text{1,0}}right){U}_{0,n-2}^{left[2right]}left({{varvec{theta}}}_{0,n-2}right)dots {U}_{text{0,0}}^{left[2right]}left({{varvec{theta}}}_{text{0,0}}right)$$

(9)

We use RQENN cell to build classifiers applied to vulnerability detection. Similar to RNN, RQENN initializes the hidden state at (t=0) by adding a layer of Hadamard gates initially, and then the RQENN cell is iteratively applied to a sequence of the input source code ({{varvec{x}}}_{1},{{varvec{x}}}_{2},...,{{varvec{x}}}_{T}) as shown in Fig.3 to capture the contextual connections in the source code. The entire model also includes measuring the expectation value of a single qubit for the last two qubits. This expectation value is described as Eq.(10):

$${E}_{i}left({varvec{X}},{varvec{Theta}}right)=langle {0}^{otimes n}|{H}^{daggerotimes n}{U}_{QC}^{dagger}left({varvec{X}},{varvec{Theta}}right){widehat{M}}_{i}{U}_{QC}({varvec{X}},{varvec{Theta}}){H}^{otimes n}|{0}^{otimes n}rangle ,hspace{0.5em}hspace{0.5em}iin {n-1,n-2}$$

(10)

where ({U}_{QC}({varvec{X}},{varvec{Theta}})={U}_{cell}({{varvec{x}}}_{1},{varvec{Theta}})...{U}_{cell}({{varvec{x}}}_{T},{varvec{Theta}})) is the a quantum circuit composed of all cells, and ({U}_{cell}({{varvec{x}}}_{t},{varvec{Theta}})={U}_{qnn}({{varvec{theta}}}_{qnn}){U}_{qe}({{varvec{theta}}}_{qe}){U}_{in}({{varvec{x}}}_{t},{{varvec{theta}}}_{in})). ({varvec{Theta}}) is the parameter set of the cell, and ({varvec{X}}=[{{{varvec{x}}}_{1},dots ,{varvec{x}}}_{T}]) is the input index sequence. ({widehat{M}}_{i}) is the operator used to calculate the expectation of the ith qubit, i.e.

RQENN classifier. The model is built by iteratively applying the same RQENN cell to the input code token sequence. Measurements are performed on the last two qubits separately to obtain the expectation values as classification logits.

$${widehat{M}}_{i}=left{begin{array}{c}Iotimes Iotimes ...otimes {sigma }_{z}otimes I,hspace{0.5em}i=n-2\ Iotimes Iotimes ...otimes Iotimes {sigma }_{z},hspace{0.5em}i=n-1end{array}right.$$

(11)

The two calculated expectation values are used to determine the data category by comparing the numerical magnitudes, and we use them as logits to calculate the cross-entropy loss function for classification.

The goal of our vulnerability detection is to detect whether a program's source code may contain vulnerabilities using the RQENN classifier. In this paper, we perform the vulnerability detection task using the pipeline shown in Fig. 4, which consists of the following three steps:

Vulnerability detection task flow. We extract normalized labeled code gadgets from the source code as training data and then generate parameterized binary indexes from them, which are fed into the RQENN classifier. After training, the model can detect the presence of vulnerabilities in the source code.

Step I: Generating normalized code gadgets and labels from source code. First, we extract the data dependency graph (DDG) of the code using the open source code analysis tool Joern. Next, we extract labeled code gadgets based on manually defined vulnerability features. Specifically, we locate the node containing the vulnerable library function/API call in the extracted DDG, such as the "strcat" function shown in left side of Fig.4, and slice the code into small pieces according to the connection to the node. The types of API calls are categorized into forward (e.g., the "recv" function) and backward API calls (e.g., the "strcat" function here) according to whether or not they take external input from a socket, and forward slices and backward slices are generated accordingly. The forward slices obtain the set of statements of the nodes in the DDG that are recursively pointed forward from the API node, and the backward slices obtain the set of statements of the nodes in the DDG that are recursively pointed to the API node. These slices are code gadgets, which are labeled '0' or '1' depending on whether they contain vulnerabilities or not. In the next step we normalized the code gadget. The processing methods include removing comments and strings, normalizing user-defined variable names ('VAR1' etc.) and function names ('FUN1' etc.). Finally, the normalized labeled code gadget is obtained.

Step II: The normalized labeled code gadgets are treated as text data from which parameterized binary index sequences are generated. First, we preprocess the data set, clean the original text, remove punctuation marks and non-ascii characters, etc. Then we split and pad the preprocessed text and build a dictionary, which is converted into a parameterized binary index dictionary according to the method mentioned before. Finally, the token sequences after tokenization are converted into parameterized binary index sequences according to the dictionary.

Step III: Training and evaluating RQENN Models. We input the sequence of parameterized binary indexes into the RQENN model in order, execute the quantum circuit on the simulator or a real machine, and complete the training and validation of the RQENN model according to the quantum circuit learning framework18. The model can detect the presence of vulnerabilities in the source code.

Read the rest here:

Recurrent quantum embedding neural network and its application in vulnerability detection | Scientific Reports - Nature.com

Related Posts