Machine learning-based novel continuous authentication system using soft keyboard typing behavior and motion sensor data

In this study, the data obtained from the smartphone were tested with three different classification approaches. Tests were performed on a computer with an Intel Core i5-7400 3.0 GHz processor on a Windows 10 operating system using the Java programming language in Apache NetBeans version 11.2 by applying tenfold cross-validation. WEKA toolkit version 3.8.5 was used for feature selection and classification.

Various heuristic algorithms such as best first search, genetic algorithm, greedy search, and particle swarm optimization are used in filter-based feature selection approaches [28]. However, optimization algorithms (i.e., genetic algorithm, particle swarm optimization) have disadvantages such as parameter selection problems, convergence problems, unbalanced distribution, problem dependency, and high computational cost. In this study, after implementing CFS to the dataset, the attributes are ordered. Experiments were carried out with the best N-element subset approach which provided successful results in Sağbaş et al. [32] and Şen et al. [43]. The flow chart of this approach is given in Fig. 5.

As shown in Fig. 5, the features are sorted according to the score values obtained from the CFS. Afterward, the best features were added one after another and the experiments were repeated. When related studies on authentication are examined, random forest [9], kNN [53], support vector machine [2, 19, 36], Bayesian networks [13], artificial neural networks [30] and various deep learning approaches [1, 3, 57] appear to have been used. These approaches are well-known and frequently used classification models. In this study, each feature subset is tested with random forest (RF), k nearest neighbor (kNN), and simple logistic regression (SLR) methods, and their performances are compared. As a result of the preliminary experiments, it was decided to determine the k value as 1 in the kNN method. LinearNNSearch was used as the nearest neighbor search algorithm. In the RF method, the number of leaves was determined as 200 and number of trees was determinated 100. In SLR, the method default values, heuristicStop as 50, and maxBoostingIteration as 500 were used. The change in the accuracy rates obtained as a result of the experiments is presented in Fig. 6.

When the change in classification accuracy is examined, a significant improvement is noticeable after the first 5 features. After reaching a feature subset of 35 elements, the accuracy is remarkably close to the best results achieved. The highest classification accuracy, at 92.9551%, was obtained with the SLR classifier, using a 92-element feature subset. kNN classifier achieved an accuracy of 89.604% with a 73-element subset, while the RF classifier reached an accuracy of 90.3656% with a 105-element subset. Test times for the classifications presented in Fig. 5 can be found in Fig. 7.

Upon examination of Fig. 6, it becomes evident that the test time of the kNN method, which does not create a model but performs classification directly on samples, displays a linear increase. Conversely, the test times for the SLR and RF methods exhibit a consistent trajectory from the beginning to the end. It is important to note that the times presented in the chart represent the duration required to test the entire dataset, encompassing 2626 patterns. The performance metrics for the best results achieved as a result of the conducted tests are provided in Table 2.

Table 2 Performance measures of the best results according to the methods

Upon examination of the performance measurements, it is evident that the most successful method is the SLR. This method achieved an impressive classification accuracy of approximately 93%, and it required 92 features for this classification, reducing the feature set by 26%. Individual values for TPR, FPR, precision, and f-score were calculated. The lowest TPR observed was 0.621, while the average TPR was computed as 0.930. The highest FPR value was 0.006, while the average FPR was 0.001. The mean precision and f-score values were calculated as 0.954 and 0.929, respectively. Considering test times, the method with the longest test time was kNN, taking 30 ms to classify a pattern. RF followed with a test time of 0.13 ms. In contrast, the SLR method had a pattern classification time of 0.03 ms. Precision values based on participants are presented in Fig. 8.

When the results are analyzed on the basis of participants, it is seen that there are a limited number of participants whose precision 0.85. It was observed that the performance measurements of the participants numbered 4, 24, 25, 26, 36, 39, 48, and 53 were lower than the other participants. However, 100% success was achieved in estimating participants 7, 9, 13, 14, 15, 18, 20, 22, 23, 27, 28, 29, 32, 34, 35, 40, 46, 49, 51, 55, 56, 57 and 60.

A detailed comparison table of related authentication studies is presented in Table 3. This table presents the data types of the studies, evaluation metrics, obtained evaluation values, and the machine learning methods used. But, it is not possible to compare this study directly with other studies. Because the types of data used and the approaches to identify people differ. For authentication, Srikar et al. [41] used sound signals, Lu et al. [24] lip-reading, Wang et al. [47] facial recognition, Qin et al. [29] biometric gait information. In addition to these, various studies were carried out by using the screen touch and keystroke dynamics. Feng et al. [13], Zhao et al. [55] Tse and Hung [45], Yang et al. [52], Ramadan et al. [30], Lu and Liu [23], Acien et al. [3], Xu et al. [51] examined tactile approaches. Shen et al. [36], Abuhamad et al. [1], Incel et al. [19], Acien et al. [2], and Yuksel et al. [53] also benefited from motion sensors. When evaluation metrics are considered, it can be seen that metrics such as false acceptance rate, false rejection rate, classification accuracy, average error rate, true acceptance rate, and f-score were used. If the studies that use the accuracy rate as an evaluation metric are filtered, the studies suitable for comparison are as follows: Ramadan et al. [30], Tse and Hung [45], Yuksel et al. [53], Yang et al. [52], Acien et al. [2], Lu et al. [24], Wang et al. [47], and Zhu et al. [57]. The average accuracy rate for these eight studies is 92.47%. However, it is worth remembering again that the types of data used in the studies are different from each other.