Transformer models as an efficient replacement for statistical test suites to evaluate the quality of random numbers

1^st Rishabh Goel Monta Vista High School
Cupertino, United States
[email protected] 2^nd Yizi Xiao Optum
Eden Prairie, United States
[email protected] 3^rd Ramin Ramezani Department of Computer Science
UCLA
Los Angeles, United States
[email protected]

Abstract

Random numbers are incredibly important in a variety of fields, and the need for their validation remains important for safety. A Quantum Random Number Generator (QRNG) can theoretically generate truly random numbers, however their quality still needs to be thoroughly validated. Generally, the task of validating random numbers has been delegated to different statistical tests such as the tests from the NIST Statistical Test Suite (STS), which are often slow and only perform one test at a time. Our work presents a deep learning model utilizing the Transformer architecture that 1) performs multiple NIST STS tests at once, and 2) runs much faster. This model outputs multi-label classification results on passing these statistical tests. We performed a thorough hyper-parameter optimization to converge on the best possible model and as a result, achieved a high degree of accuracy with a Macro F1-score of above 0.96. We also compared this model to a conventional deep learning method (Long Short Term Memory Recurrent Neural Networks) to quantify randomness and showed our model achieved similar performances while being much more efficient and scalable. The high performance and efficiency of this Transformer-based deep learning model showed that it can be a viable replacement for the NIST STS for validating random numbers.

Index Terms—Transformers, Deep learning, Random numbers, Multi-label classification, LSTMs

I Introduction

Random numbers serve an important purpose in many fields with numerous applications. Within cryptography, they are extensively studied and utilized to ensure secure encryption schemes that keep our data safe [1, 2, 3, 4]. In physics, random numbers are highly studied in their appearances in quantum mechanics and thermodynamics [5, 6]. Given the vast array of applications associated with random numbers, their validation also becomes a critical issue[7]. Most of our encryption schemes today use pseudo-random numbers generated through pseudo-random number generators, and their validation has been delegated to a variety of statistical tests [8].

Along with using statistical tests for validating random numbers, deep learning and its derivatives can be used for determining the randomness of random numbers [9]. The uses of deep learning in this area have branched out to many of its facets such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Long Short Term Memory RNNs (LSTMs) [10, 11, 12].

More recently, the introduction of the Transformer model has allowed further exploration of this deep learning method in the context of validating random numbers[13]. In the applications of the Transformer models, the self-attention head is designed to detect sequential patterns[14]. Thus, in many applications where sequential data is available and easy to represent, the Transformer has been quite successful, especially in the field of Natural Language Processing (NLP) but also in many others [15, 16, 17, 18, 19].

Since a random number by definition would have minimal sequential patterns and a non-random number would be the opposite, we reasoned that the self-attention mechanism in the Transformer architecture should be quite effective at quantifying the randomness of random numbers which was confirmed by Li et al [20]. In the context of random numbers, traditionally, LSTMs were a popular deep learning method for qualifying the randomness of random numbers, but due to their complex nature as well as their inability to be parallelized, they were slow and quite inefficient but were quite accurate [21]. As Li et al. described, these problems with LSTMs were solved by the Transformer architecture. Although Li et al. showed that the Transformer architecture is well equipped for prediction of the next token and use that as a metric of validating random numbers, we focus on the quantification on the quality of randomness more directly through the encoding of various statistical tests rather than predicting what the next random number will be.

We hypothesized that a deep learning architecture based on Transformers would be more efficient and high in performance at quantifying the randomness of random numbers as well as the type of randomness to a degree in which it could be considered as a potential replacement for the NIST STS. In this paper, we present an encoder-only Transformer model that can accurately encode and replace the statistical tests that were used to quantify the quality of random numbers. The goal is to develop a single model that transforms binary sequences into probabilities of passing different statistical tests, thereby quantifying the amount of randomness of the sequence as well as the type of randomness. We experimented with different architectures of this model as well as different hyperparameters and converged on the most optimized architecture for this problem. We showed that by utilizing the Transformer architecture, our model can accurately describe the type and degree of randomness of a binary sequence with comparable performance to an LSTM while improving efficiency over the original NIST STS and LSTM, serving as a strong potential replacement.

II Materials and Methods

Non-deep learning based methods of validating random numbers include using a variety of statistical test suites such as the DIEHARD and DIEHARDER statistical test suites along with the NIST STS [22, 23, 24]. These suites contain a multitude of tests, each testing for a particular type of randomness. The purpose of this work is to encode a portion of these tests into a single Transformer in order to test random numbers on multiple different tests at the same time.

II-A Dataset

The training data for our model was first generated using truly random numbers from a Quantum Random Number Generator (QRNG), then augmented to have non random numbers as well as random numbers in the dataset. We ran the NIST STS to generate labels for our data. Running the NIST STS on each binary sequence would generate a p-value for each sequence. We used the alpha values as outlined by Rukhin et al. in the definition of the NIST STS as thresholds for determining if the sequence would pass each particular test. Running seven tests on each sequence would result in a one-hot encoded label vector for each input sequence. The purpose of the model is to run multi-label classification on given data so that the user can find out which tests the random number passes and which ones it fails. Thus, our model outputs a vector where each entry would be the probability of the random number passing the corresponding test. There were seven tests from the NIST STS that we encoded: Frequency, Block Frequency, Runs, Longest Run of Ones, Discrete Fourier Transform, Nonperiodic Template Matchings, Cumulative Sums. Out of the seven, we augmented data to generate non-random results for five of the tests: Frequency, Block Frequency, Runs, Longest Run of Ones, Nonperiodic Template Matching. We found that augmenting the data only for these five aforementioned tests also resulted in the binary sequences failing the Discrete Fourier Transform and Cumulative Sums tests, which we did not augment the data for. Therefore, it was not necessary to augment the data to introduce non-randomness for these tests in order for the model to encode them. The specific details of how these tests work are outlined clearly by Ruhkin et al. in their definition of NIST STS[24]. The data augmentation techniques were adopted from Nagy et al. with minimal changes[10].

II-B Model Training and Validation

Refer to caption — Figure 1: Baseline model architecture. The tokenizer includes positional encoding and embeddings. The flatten layer is swapped with the averaging layer when constructing the final model.

We began by taking the binary sequence and tokenizing it to reduce its dimensionality. We trained many models to experiment with hyper-parameters which included varying the input size. We used inputs of size 512, 1024, and 2048 bits to train our models due to their relevance in cryptography [25] as well as resource constraints barring us from using input sizes past 2048 bits. Our tokenization technique was simply getting the integer representation of every 16 bits. This would create a vocabulary for the Transformer of size 65536 which was sufficiently large for model training. This means that our input would reduce to 32, 64, or 128 tokens, which we would then run through the model whose architecture is shown above in Figure 1.

Model training was done with three datasets, each with a size of 100000 labeled binary sequences pertaining to each input size with a 60-20-20 train-validation-test split. To monitor how well the model was training, we used aggregate F1 to measure model performance per batch as well as loss that best pertains to multi-label classification [26, 27]. We use a combination of sigmoid, binary cross entropy loss with the Adam optimizer with default parameters as defined by Kingma et al. [28] to train the model, and Macro, Micro, Weighted, and Sample F1 scores to validate it with equations available in the Appendix -B. All training, testing, and analysis was performed on a single computer with the CPU: i5-9600k and GPU: 3060 ti.

II-C Handling Varying Input Size

One of the great advantages of Transformer models is their ability to handle data with no fixed input size in parallel [29]. However, as seen in Figure 1, since the output of the encoder is then flattened and connected into a fully connected layer, the input size must be fixed as a fully connected layer of an ANN cannot handle varying input size. Our solution to this was quite simple. Since the fully connected layer can only take a fixed input length, we collapsed the only dimension that was varying which is the sequence length dimension. The drawback of this of course is that there is some information lost through the averaging process and how this affects model performance is discussed later.

II-D Experimental Procedure

There are three main hyper-parameters that we observed the performance of to find the best architecture: the number of encoder layers in the transformer, the number of heads in the multi-headed attention, the dimensionality or size of the embeddings, the input size, and whether or not we used an averaging layer. To get the data points for our analysis and to converge on the best possible model through hyper-parameter optimization, we used the following procedure:

1.
For each encoder layer number/embedding size/attention head number:
1. (a)
  
  Train a model for each input size (512, 1024, 2048) and averaging type (averaging or non-averaging) with the corresponding architecture with a default of 3 encoder layers, 8 attention heads, and an embedding size of 240. (If you are testing for the number of attention heads, the number of encoder layers the model will have for all numbers will be 3 and the embedding size will be 240 for all numbers)
2. (b)
  
  Record all aggregate metrics.
2.

Determine the optimal architecture, input size, and averaging type.
3.

Train the model with the previously found optimal architecture and compare against LSTM for validation, recording all the aggregate metrics including time.

III Results

Our goal in this study is to create the best possible model that could be used as an alternative to the NIST STS in classifications of random numbers. To do this, we had to first converge on the best possible model utilizing the transformer architecture and then validate it using previously accepted techniques, namely LSTMs. The concrete values of the hyper-parameter optimization are too many to include in tabular form here so they may be found in supplemental form in the Appendix -A. We visualize and discuss the conclusions of the raw data in the Discussion and Analysis portion of this article. We found the optimal model for the task of classifying random numbers to be: one encoder layer, single-headed attention, 192 embedding size with the averaging layer.

The next step was to validate this model with previously accepted techniques of classifying random numbers which are LSTMs and of course, the STS itself. Below are the tables providing the raw metrics of LSTM and Transformer inference performance and time as well as comparisons to the STS.

TABLE I: Performance metrics for input size 512

Technique	Inference Time (s)	Micro F1	Macro F1
LSTM	3.046	0.931	0.932
Transformer	0.965	0.931	0.934
STS	3.82	-	-

TABLE II: Performance metrics for input size 1024

Technique	Inference Time (s)	Micro F1	Macro F1
LSTM	5.172	0.967	0.970
Transformer	1.071	0.961	0.962
STS	4.63	-	-

TABLE III: Performance metrics for input size 2048

Technique	Inference Time (s)	Micro F1	Macro F1
LSTM	8.991	0.965	0.964
Transformer	1.201	0.960	0.964
STS	5.73	-	-

IV Discussion and Analysis

We set out with the goal of creating the best possible model that can be used as an alternative to the statistical test suites. To find the model with the best hyper-parameters, we ran tests with different hyper-parameter settings to converge on the best possible model that we can find with our current architecture. The hyper-parameters that we tested for were the number of encoder layers, the length of the embedding dimension, and the number of attention heads in the multi-headed attention section of the encoder. For brevity’s sake, we are only showing the Macro F1 scores and omitting the other aggregate measures. To validate our model, we also compared it to previous methods of classifying the type of randomness and quantifying the degree of randomness, namely NIST’s STS and LSTMs. Here we take the raw data presented above and build visualizations that contextualize the data in our study.

IV-A Comparative Analysis

As outlined by Nagy et al. and Li et al. traditional RNN based methods for classifying random numbers have been using LSTMs. To compare Transformer model performance with LSTM performance, we used the LSTM architecture as outlined by Nagy et al. with the addition of embeddings and positional encoding as that improved performance. As seen above, the Transformer based approach achieves similar levels of performance as compared to the LSTM with the LSTM being marginally better at times. Of course, this benefit is not justified by the time needed to both train and run the model which is discussed later. Overall, their performances are similar enough to be deemed comparable and interchangeable. Furthermore, it does not appear as if input size made a difference at all in the performance of either the LSTM or the Transformer. This can have two main explanations. First, it is likely that with much larger input sizes (perhaps an order of magnitude greater), the pitfall of LSTMs being poor handlers of long input sizes might become more evident. This however requires more exploration and such lengthy input sizes were not tested due to our testing machine being unable to handle these lengthy inputs. Second, perhaps the task of detecting randomness and quantifying it in binary sequences does not require long term sequential dependencies that Transformers are better suited than LSTMs to handle.

Discussed further on, we have performed a hyper-parameter optimization and even explored a range of values of each hyper-parameter to find trends in model performance and we have converged on a model with 1 encoder layer with single-headed attention with an embedding dimension of 192. To further explore the usability of this model, we compared the time taken to run through our test dataset for both the NIST STS as well as our model. Since our model can run on the GPU and the Transformer architecture is parallelizable, it runs much faster than the NIST STS which runs on the CPU [30]. Our model processes the same amount of data as the STS in almost 33% of the time. As can be seen, the LSTM is considerably slower due to it not being as parallelizable as the Transformer model. As input size grows, the LSTM begins to take considerably more time, even slower than STS, while the Transformer seems to grow at a much slower rate and never even comes close to the STS time. The usability of LSTM drops dramatically as input size increases and clearly LSTM should not even be considered as a replacement for STS given the lack of any observable efficiency gain.

IV-B Encoder Layers

As Huang et al. show, more Transformer encoder layers should improve the model accuracy and in our case, F1 score [31]. While we did observe this to be true, the benefits were marginal and there were some important caveats. Namely, the effects were only beneficial until a certain point; in our case, once the number of encoder layers exceeded three, the model stopped learning and failed to converge as seen in Figure 5. A possible conclusion that can be drawn from this is that the model became too large for the input and ended up adding too much noise for any useful information to be perceptible in later layers. However, it may also be the case that the amount of data we had was not sufficient to train a model with more than four encoder layers, regardless of whether we were averaging across the outputs or not. From this graph however, we can say that one encoder layer is sufficient for this task and any more encoder layers would only cause an increase in training time and inference time without providing any tangible performance benefit.

IV-C Embedding Size

The length of the embedding of the tokenized input string is crucial to the performance of the model but again we can see that there is not much of a difference until the embedding size gets around 375 (250 for the non-averaging version)[32]. At this point, the embedding size becomes too high and no model is able to converge to an acceptable F1 score (above 0.7). Furthermore, it seems like input size plays a marginal role in performance with the model performing slightly better with longer input sizes (1024 and 2048 bit sequences). However, what is interesting is that the 512 bit input seems to be more resilient so to say, as it converges for a greater range of embedding sizes compared to the other input sizes. Again, without the averaging layer, the F1 score ceiling is higher with the 1024 bit stream reaching F1 scores of above 0.95 however this difference again seems to be only marginal as the top F1 scores of the models with the averaging layers also exceeding 0.95 however by a smaller amount. The main observations are that the averaging layer allows the models to converge for a larger embedding size and that the smaller the input size, the larger the embedding size it converges for. Once more, there is not a tangible performance increase as the embedding size increases so an embedding size of 192 (the smallest we tested) is sufficient for the task of classifying random numbers. Since the non-averaging models and the models with higher input sizes seem to be performing worse, it is likely that they require more data in order to train but more exploration on this matter is needed.

IV-D Number of Attention Heads

The number of attention heads in the multi-headed attention layer is important to finding the most optimized model as the number of attention heads can quite readily affect the performance of a model [33]. Therefore, we also experimented with the number of multi-headed attention layers to find the most optimized model. As seen in Figure 7, the number of heads did not play a large role in the performance of the model at all with more attention heads only being marginally better but adding quite a lot of overhead. As a common trend, we see that the averaging layer did lower overall Macro F1, yet it was more stable as the 2048 input size model failed to converge past 20 self-attention heads layers. Relatively speaking, input sizes of 32, 64 and 128 tokens are quite small and therefore do not require large models to compute which is why more self-attention heads likely did not positively impact performance. However what was unexpected was that without the averaging layer, the 2048 input size model failed to converge with larger self-attention heads. Perhaps at this stage a greater restructuring of the model architecture would be required or a larger dataset. Through this graph, the conclusion can be drawn that single headed attention is sufficient for a model that does average across the output of the encoder and that two attention heads are sufficient for a non-averaging model. As before, there is no real tangible performance gain so more attention heads would only lead to longer training and inference times. Overall, it seems that the averaging layer provides stability and resilience to over-training, making it preferable to the non-averaging layer despite it being marginally worse in terms of performance. This performance deficit is made up, however, by the fact that the averaging layer allows the model to handle varying input sizes, thus allowing it to be trained on all three datasets at once. Combined with its added resilience to over-training, the performance of the model with the averaging layer actually ended up being better than without.

V Conclusion

In conclusion, we see that deep learning models that utilize the Transformer architecture are adequate alternatives for the tests of the NIST STS and serve as a faster and scalable alternative as our highest performing model can classify over 20000 numbers in almost 50% of the time it takes the NIST STS to do so (perhaps even faster with a more powerful GPU). With the use of the averaging layer, our model can handle inputs of varying size from 512 bits to 2048 bits, showing versatility in its capabilities. Our work also invites further exploration by expanding the model’s capabilities to encoding the rest of the statistical tests presented in NIST STS, further expanding the applicability of our model. Overall, our model serves as a promising potential replacement for the NIST STS given its time efficiency and comparable performance to more traditional deep learning methods such as LSTMs.

References

[1] H. Corrigan-Gibbs, W. Mu, D. Boneh, and B. Ford, “Ensuring high-quality randomness in cryptographic key generation,” in Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security, 2013, pp. 685–696.
[2] A. Shen, “Randomness tests: theory and practice,” in Fields of Logic and Computation III: Essays Dedicated to Yuri Gurevich on the Occasion of His 80th Birthday. Springer, 2020, pp. 258–290.
[3] R. Gennaro, “Randomness in cryptography,” IEEE security & privacy, vol. 4, no. 2, pp. 64–67, 2006.
[4] Y. Dodis, S. J. Ong, M. Prabhakaran, and A. Sahai, “On the (im) possibility of cryptography with imperfect randomness,” in 45th Annual IEEE Symposium on Foundations of Computer Science. IEEE, 2004, pp. 196–205.
[5] J. Natal, I. Ávila, V. B. Tsukahara, M. Pinheiro, and C. D. Maciel, “Entropy: From thermodynamics to information processing,” Entropy, vol. 23, no. 10, p. 1340, 2021.
[6] M. N. Bera, A. Acín, M. Kuś, M. W. Mitchell, and M. Lewenstein, “Randomness in quantum mechanics: philosophy, physics and technology,” Reports on Progress in Physics, vol. 80, no. 12, p. 124001, 2017.
[7] C. Li, J. Zhang, L. Sang, L. Gong, L. Wang, A. Wang, and Y. Wang, “Deep learning-based security verification for a random number generator using white chaos,” Entropy, vol. 22, no. 10, p. 1134, 2020.
[8] S. Arman, T. Rehnuma, and M. Rahman, “Design and implementation of a modified aes cryptography with fast key generation technique,” in 2020 IEEE International Women in Engineering (WIE) Conference on Electrical and Computer Engineering (WIECON-ECE). IEEE, 2020, pp. 191–195.
[9] Y. Feng and L. Hao, “Testing randomness using artificial neural network,” IEEE Access, vol. 8, pp. 163 685–163 693, 2020.
[10] I. Nagy and A. Suciu, “Randomness testing with neural networks,” in 2021 IEEE 17th international conference on intelligent computer communication and processing (ICCP). IEEE, 2021, pp. 431–436.
[11] S. Sokorac, “Optimizing random test constraints using machine learning algorithms,” in Proceedings of the design and verification conference and exhibition US (DVCon), 2017.
[12] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[13] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[14] D. Xu, C. Ruan, E. Korpeoglu, S. Kumar, and K. Achan, “Self-attention with functional time representation learning,” Advances in neural information processing systems, vol. 32, 2019.
[15] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, “Transformers in vision: A survey,” ACM computing surveys (CSUR), vol. 54, no. 10s, pp. 1–41, 2022.
[16] T. Lin, Y. Wang, X. Liu, and X. Qiu, “A survey of transformers,” AI open, vol. 3, pp. 111–132, 2022.
[17] N. Patwardhan, S. Marrone, and C. Sansone, “Transformers in the real world: A survey on nlp applications,” Information, vol. 14, no. 4, p. 242, 2023.
[18] N. Geneva and N. Zabaras, “Transformers for modeling physical systems,” Neural Networks, vol. 146, pp. 272–289, 2022.
[19] A. M. Bran and P. Schwaller, “Transformers and large language models for chemistry and drug discovery,” arXiv preprint arXiv:2310.06083, 2023.
[20] Z. Li, B. Feng, L. Cui, H. Wang, Y. Bian, G. Piao, and X. Zhou, “Quantify randomness of quantum random number with transformer network,” in 2023 3rd International Conference on Intelligent Power and Systems (ICIPS), 2023, pp. 17–22.
[21] A. Zeyer, P. Bahar, K. Irie, R. Schlüter, and H. Ney, “A comparison of transformer and lstm encoder decoder models for asr,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 8–15.
[22] R. G. Brown, D. Eddelbuettel, and D. Bauer, “Dieharder,” Duke University Physics Department Durham, NC, pp. 27 708–0305, 2018.
[23] G. Marsaglia, “The marsaglia random number cdrom including the diehard battery of tests of randomness,” http://www. stat. fsu. edu/pub/diehard/, 2008.
[24] A. Rukhin, J. Soto, J. Nechvatal, M. Smid, E. Barker, S. Leigh, M. Levenson, M. Vangel, D. Banks, A. Heckert et al., A statistical test suite for random and pseudorandom number generators for cryptographic applications. US Department of Commerce, Technology Administration, National Institute of …, 2001, vol. 22.
[25] B. K. Alese, E. Philemon, and S. O. Falaki, “Comparative analysis of public-key encryption schemes,” International Journal of Engineering and Technology, vol. 2, no. 9, pp. 1552–1568, 2012.
[26] G. Bénédict, V. Koops, D. Odijk, and M. de Rijke, “Sigmoidf1: A smooth f1 score surrogate loss for multilabel classification,” arXiv preprint arXiv:2108.10566, 2021.
[27] R. Yacouby and D. Axman, “Probabilistic extension of precision, recall, and f1 score for more thorough evaluation of classification models,” in Proceedings of the first workshop on evaluation and comparison of NLP systems, 2020, pp. 79–91.
[28] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[29] Q. Zhang, Y. Xu, J. Zhang, and D. Tao, “Vsa: Learning varied-size window attention in vision transformers,” in European conference on computer vision. Springer, 2022, pp. 466–483.
[30] R. Pope, S. Douglas, A. Chowdhery, J. Devlin, J. Bradbury, J. Heek, K. Xiao, S. Agrawal, and J. Dean, “Efficiently scaling transformer inference,” Proceedings of Machine Learning and Systems, vol. 5, 2023.
[31] X. S. Huang, F. Perez, J. Ba, and M. Volkovs, “Improving transformer optimization through better initialization,” in International Conference on Machine Learning. PMLR, 2020, pp. 4475–4483.
[32] Z. Huang, D. Liang, P. Xu, and B. Xiang, “Improve transformer models with better relative position embeddings,” arXiv preprint arXiv:2009.13658, 2020.
[33] S. Merity, “Single headed attention rnn: Stop thinking with your head,” arXiv preprint arXiv:1911.11423, 2019.

-A Concrete data values

All concrete data values used in the visualizations can be found on the Github: https://github.com/bloodpool7/RandomTransformer

-B F1 Scores

Here are the equations used to define the f1 metrics which were used to validate our model.

Micro F1 Score:

\frac{{2\times\text{Micro Precision}\times\text{Micro Recall}}}{{\text{Micro % Precision}+\text{Micro Recall}}}

(1)

Macro F1 Score:

\frac{{1}}{{N}}\sum_{i=1}^{N}\frac{{2\times\text{Precision}_{i}\times\text{% Recall}_{i}}}{{\text{Precision}_{i}+\text{Recall}_{i}}}

(2)

Weighted F1 Score:

\frac{{1}}{{\text{Total Support}}}\sum_{i=1}^{N}(\text{Support}_{i}\times\text% {F1}_{i})

(3)

Sample F1 Score:

\frac{{2\times\text{True Positives}}}{{2\times\text{True Positives}+\text{% False Positives}+\text{False Negatives}}}

(4)