Research Article |
Corresponding author: Raju Kamaraj ( kamarajr@srmist.edu.in ) Academic editor: Milen Dimitrov
© 2024 Navyaja Kota, Raju Kamaraj, S. Murugaanandam, Mohan Bharathi, T. Sudheer Kumar.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Kota N, Kamaraj R, Murugaanandam S, Bharathi M, Kumar TS (2024) A data-driven approach utilizing a raw material database and machine learning tools to predict the disintegration time of orally fast-disintegrating tablet formulations. Pharmacia 71: 1-12. https://doi.org/10.3897/pharmacia.71.e122507
|
Orally fast-disintegrating tablets (OFDTs) have seen a significant increase in popularity over the past decade, becoming a rapidly expanding sector in the pharmaceutical market. The aim of the current study is to use machine learning (ML) methods to predict the disintegration time (DT) of OFDTs. In this study, we have developed seven ML models using the TPOT AutoML platform to predict the DT of OFDTs. These models include the decision tree regressor (DTR), gradient boost regressor (GBR), random forest regressor (RFR), extra tree regressor (ETR), least absolute shrinkage and selection operator (LASSO), support vector machine (SVM), and deep learning (DL). The results indicate that ML methods are effective in predicting the DT, especially with ETR. However, after fine-tuning the deep neural network with a 10-point cross-validation scheme, the DL model showed superior performance with an NRMSE of 6.2% and an R2 of 0.79. The key factors influencing the DT of OFDTs were identified using the SHAP method.
Deep learning, data sets, machine learning, OFDTs, SHAP
Despite significant advancements in drug delivery methods, oral administration continues to be an ideal approach for administering therapeutic agents due to its precise dosage, cost-effectiveness, ability for self-medication, non-invasive nature, and convenience of administration, leading to a notable level of patient adherence. Tablets are the most widely used type of drug, but a key limitation of these formulations is “dysphagia,” which refers to difficulties in swallowing experienced by a significant portion of the population, affecting over 50% of individuals. Consequently, individuals may fail to adhere to their prescribed drugs, leading to a high occurrence of noncompliance and inefficient treatment (
An established literature-based data model was selected for the development and validation process. The data has been refined to include only verified records, focusing on features of OFDTs such as tablet hardness, thickness, friability, and punch size. To expand our database, we conducted a literature review using the Scopus, Web of Science, and Google Scholar databases. A keyword search strategy was employed, including terms like “oral disintegrating,” “fast disintegrating,” “rapidly disintegrating,” and “oral dispersible.” The formulation should specify the total quantity of all excipients. Additionally, tablet quality attributes such as hardness, thickness, friability, punch size, and disintegration time should be included (
A total of 248 articles were retrieved through a database search. Out of these, only 185 research articles were selected for data extraction. Upon further manual search, 93 articles did not meet the inclusion criteria and were excluded from the study. After thorough sorting, 92 articles yielded a total of 1076 formulations. The formulation data included the name of the active pharmaceutical ingredient (API), other excipients, and process details, all of which were documented in the dataset. The final dataset consisted of the following parameters for each formulation: API name, dose, excipient name, dose (each excipient displayed in a separate column), hardness, friability, thickness, punch size, and DT. This information was then used for modeling using ML techniques (
According to the European Pharmacopoeia 10th edition, orodispersible tablets should breakdown within 3 minutes. Therefore, any records from the database that exceed 180 seconds have been excluded from further analysis. A correlation study was conducted to investigate the relationship between the dependent variable (DT) and various independent factors such as API, process parameters, and composition (
After completing data collection, the data must be processed before building predictive models to ensure the robustness and effectiveness of ML models. Several commonly employed methods, such as data cleansing, dimension reduction, imbalanced data solutions, and data splitting strategies, are necessary for data analysis. Data cleaning is performed to identify missed observations and is done by replacing data points with median or mean values. However, there are limitations to replacing missing values, as a decrease in data size may impact the accuracy of the model. Dimensionality reduction is used to eliminate the least significant features in the dataset, reducing overfitting issues and simplifying the model’s complexity. Various approaches to dimensionality reduction, such as principal component analysis (PCA), high correlation filtering, and random forest feature selection, are commonly used in data processing. Imbalanced data solutions address the uneven distribution of different database classes, as using an unbalanced dataset in a prediction model can lead to poor performance. Data splitting is another crucial step in data processing, where the entire dataset is randomized and divided into three subcategories: training, validation, and testing. The training set is used to train the models; validation is for tuning hyperparameters and preventing overfitting; and the testing set is used to assess the prediction potential of unknown data. The recommended ratio for these categories is 70% for training, 20% for validation, and 10% for testing, though the ratios may vary depending on the data size. Therefore, data preprocessing and splitting strategies are essential steps before undertaking the task.
ML modeling tasks involve various techniques such as classification, regression trees, neural networks, and potentially many other algorithms. These models are trained using prepared databases, and their performance is evaluated using an error metric. Keeping track of different modeling methods and exploring various features can be challenging and computationally expensive. Therefore, AutoML (Automated Machine Learning) is utilized. AutoML approaches often use ensemble learning strategies, which combine several model types to produce predictions that are more reliable. In this case, TPOT AutoML employed the K-fold cross-validation technique to generate a definitive production model by selecting features based on a predefined threshold. Each fold consists of a distinct training-testing pair and a validation set, with 568 records randomly selected for training, 244 records for validation, and 348 records for testing.
After completing the ML modeling process, it is necessary to evaluate the predictive performance of ML models to understand how well they generalize to new, unseen data. ML models are often prone to overfitting, which occurs when the model not only learns the underlying patterns in the training data but also the noise and random fluctuations that come with the data. To prevent overfitting and ensure model stability, the selected attributes indicate the significant impact they had on the model’s assumptions, highlighting the opaque nature of machine learning models. A small amount of data was removed during the model evaluation using the K-fold approach. This research involves training and validating models using a five-fold cross-validation method, followed by selecting features using a Python script. The training and validation procedures were repeated five times to thoroughly cover the input database and achieve the optimal model. After selecting the final input feature vector, the model was trained using a 10-fold cross-validation procedure. Root mean square error (RMSE), normalized root mean square error (NRMSE), coefficient of determination (R2), mean absolute error (MAE), and mean square error (MSE) are used to measure the robustness of the models. Seven algorithms from the TPOT AUTOML platform were utilized for feature selection and final model development: DTR, GBR, RFR, ETR, LASSO, SVM, and DL.
The variables in the equation are as follows: “obsi, predi” represent the practical and expected values, respectively; “i” is the data record number; “n” is the total number of records; “obsmax” is the highest experimental value; “obsmin” is the least observed value; “R2” is the coefficient of determination; “SSres” is the sum of squares of the residual errors; “SStot” is the total sum of the errors; and “obs” is the arithmetic mean of observed values (
The accuracy of ML results cannot be improved simply by fitting data into models. As the data becomes larger and more complex, better data handling techniques such as DTR, GBR, random RFR, ETR, LASSO, SVM, and DL become necessary to handle it.
DTR is a versatile algorithm used for classification and regression tasks simultaneously. It operates on the concept of breaking down complex problems into simpler, more manageable subproblems, making it an excellent choice for various applications. Decision trees (DT) have a hierarchical structure, with conditions applied from the tree’s root to its leaves. This structure allows for a step-by-step decision-making process. One of the key strengths of DT is its transparent and interpretable structure. The rules generated by DT are easy to understand. Once trained on a dataset, DT can produce logical rules that can be applied to new, unseen data by recursively dividing them into subgroups based on the conditions learned during the training phase.
GBR generates a series of decision trees, with each tree addressing the errors of the previous one. The model is generated iteratively, with each iteration adding a new decision tree to the ensemble and focusing on the errors or residuals of the combined model from prior iterations. The loss function is a crucial component in GBR as it determines the variation between the predicted and actual values of the desired variables. The algorithm minimizes this loss function during each iteration, ensuring that the model is continually improving. MSE is a commonly used loss function in GBR, where the average squared difference between the expected and actual values is calculated. Overall, GBR is a powerful algorithm for regression tasks and is widely used in practice due to its flexibility, high predictive accuracy, and ability to handle complex relationships in data (Ghazwani et al. 2023).
RFR is a machine-learning-based regression algorithm. It is built on bagging and random subspace algorithms. Due to its versatility, capability to handle uncertain data, and suitability for high-dimensional feature spaces (with many predictors), RFR is widely respected. In recent years, RFR has emerged as the most advantageous general-purpose algorithm. The “divide and conquer” approach, which involves bootstrapping data subsets, building decision trees on each subset, and then aggregating these results, best characterizes.
The RFR employs a vector input variable x to generate an output. This is achieved by merging the predictions of the C decision trees. Ti(x) indicates a regression tree created from a subset of input variables and bootstrapped samples (
ETR is an enhanced method for addressing generalization (overfitting) concerns associated with random forest (RF). This approach is a recent advancement in the field of ML and can be viewed as an extension of the widely used RF. Its purpose is to minimize the risk of overfitting. Similar to RF, ETR trains each base estimator using a random subset of features. It does not select a feature and its corresponding value for use in node splitting (
LASSO is a linear regression method that reduces the total sum of squares of residuals and the sum of the absolute values of the regression coefficients. The regression coefficient can be obtained using the given equation.
Where m represents the number of samples, n represents the number of x variables - y(i) and x(i) ε R1x m are the y and x values in the ith sample, respectively; bj is the jth regression coefficient; and λ is the hyperparameter. b is denoted by the following formula:
b = (b1 b2 ... bn)T
In LASSO, the regression coefficient, bj, can be reduced to zero, leading to the removal of the corresponding x. The study considered a range of values for λ from 2–15 to 2–14 …, 2–2, and 2–1 to find the value that maximizes the coefficient of determination, r2, by 5-fold cross-validation. Scikit-learn was employed to estimate the LASSO (
SVM is the most commonly used method in ML for classification, regression, and other tasks. SVM operates in high- or infinite-dimensional space and constructs a hyperplane or multiple hyperplanes. The hyperplane that maximizes the distance from the closest training data points in each class achieves significant separation. A larger margin typically results in a lower generalization error for the classifier. It is effective in high-dimensional spaces and can exhibit different behaviors based on mathematical functions like the kernel. SVM classifiers utilize various types of functions, such as linear, polynomial, radial basis function (RBF), and sigmoid, as kernel functions. However, if the dataset contains a higher level of noise, such as overlapping target classes, the performance of SVM is compromised (
DL is primarily used as a neural network, as shown in Fig.
As ML models are inherently black boxes, efforts have been made to shed light on their prediction techniques. In our study, illustrated in Fig.
Where “S” is a subset of features in the model, “x” is the vector of feature values to be explained, and “p” is the number of features. The prediction “Valx(S)” is the result of estimating feature values in set “S” (
The Shapley value calculation method adheres to the axioms of efficiency, symmetry, dummy, and additivity, thus explaining the process for developing predictions. Random samples are used to replace the values of each attribute to assess their importance and impact. Computing the Shapley value can be computationally intensive due to the numerous potential coalitions of feature values that must be considered. Coalitions are carefully chosen to reduce repetitions, leading to decreased calculation time. However, this approach also leads to an increase in the variation of the Shapley value. The k-means method was utilized to reduce the number of repetitions needed to represent each feature’s impact. The k-means algorithm was set up with 12 centroids, each corresponding to a feature data domain in a cluster. A comprehensive SHAP matrix can be created by grouping the data domain of each characteristic. Displaying this matrix facilitates the understanding of the model’s predictions (
The pre-processed database contained 92 direct compression OFDTs (new data entries), including 28 unique APIs and 50 variable coding compositions (excipients were topologically coded). There were 5 variables encoding formulation dimensions such as thickness [mm], hardness [N], friability [%], punch size [mm], and DT [s]. Descriptive statistics (Table
Box and violin plots of specific features from the database. The graph illustrates the interquartile range (IQR) using boxes, which include the first quartile (Q1), median (horizontal line), and third quartile (Q3). The lower whisker is calculated as Q1–1.5*IQR and the upper whisker as Q3 + 1.5*IQR. These graphs demonstrate the dispersion of numerical data through the kernel density function.
Feature | Count | Min | Mean | Std | 25% | 50% | 75% | Max |
---|---|---|---|---|---|---|---|---|
Thickness (nm) | 690 | 2.9 | 1.19 | 0.82 | 2.4 | 3.14 | 3.6 | 6.5 |
Hardness | 690 | 3.42 | 0.85 | 0.17 | 3.1 | 3.48 | 4 | 7.98 |
Friability | 690 | 0.55 | 0.27 | 0.1 | 0.4 | 0.57 | 0.68 | 3.45 |
Punch | 690 | 5.6 | 4.1 | 0 | 0 | 8 | 8 | 16 |
Disintegration time (s) | 690 | 41.62 | 40.37 | 0.47 | 20 | 32 | 51.9 | 623 |
Filler Avicel [%] | 690 | 0.86 | 6.48 | 0 | 0 | 0 | 0 | 51.02 |
Filler Mannitol [%] | 690 | 25.26 | 27.5 | 0 | 0 | 15.79 | 48.52 | 93.72 |
Binder Avicel [%] | 690 | 2.71 | 9.93 | 0 | 0 | 0 | 0 | 58.42 |
Binder HPMC [%] | 690 | 0.02 | 0.19 | 0 | 0 | 0 | 0 | 2 |
Binder Microcrystalline cellulose [%] | 690 | 15.36 | 23.94 | 0 | 0 | 0 | 23.65 | 89.29 |
Binder_PVP K30 [%] | 690 | 0.09 | 0.85 | 0 | 0 | 0 | 0 | 11.11 |
Disintegrants Crospovidone [%] | 690 | 5.05 | 11.32 | 0 | 0 | 0 | 5.15 | 78.95 |
Disintegrants Cross carmellose sodium [%] | 690 | 4.44 | 10.26 | 0 | 0 | 0 | 4.1 | 71.43 |
Disintegrants Indion 414 [%] | 690 | 0.12 | 1.08 | 0 | 0 | 0 | 0 | 13.3 |
Disintegrants Polyplasdon XL [%] | 690 | 0.05 | 0.7 | 0 | 0 | 0 | 0 | 11.11 |
Disintegrants Pregelatinized starch [%] | 690 | 0.03 | 0.42 | 0 | 0 | 0 | 0 | 8.65 |
Disintegrants Sodium starch glycolate [%] | 690 | 2.52 | 7.24 | 0 | 0 | 0 | 2.22 | 62.5 |
Lubricant Magnesium stearate [%] | 690 | 2.8 | 3.64 | 0 | 1.05 | 1.81 | 2.42 | 28.57 |
Lubricant Sodium stearyl fumarate [%] | 690 | 0.02 | 0.15 | 0 | 0 | 0 | 0 | 1.12 |
The raw data and curated data are available at (Raw database) https://doi.org/10.6084/m9.figshare.25880377. (Curated database) https://doi.org/10.6084/m9.figshare.25880560. (accessed on 22 May 2024)
The selection of features and development of the final model were conducted using an automated approach using the TPOT AUTOML method. The dimensions of the TPOT AUTOML are provided in Table
ML Techniques | RMSE (s) | NRMSE (%) | R2 | MAE | MSE |
---|---|---|---|---|---|
DTR | 33.18 | 10.0 | 0.28 | 13.28 | 1156.34 |
GBR | 27.1 | 9.0 | 0.5 | 12.81 | 863.83 |
RFR | 25.98 | 7.0 | 0.57 | 12.39 | 745.91 |
ETR | 23.37 | 7.0 | 0.65 | 9.55 | 608.58 |
LASSO | 26.38 | 8.0 | 0.55 | 13.38 | 761.85 |
SVM | 30.04 | 9.0 | 0.4 | 14.18 | 970.99 |
DL (5-F-CV) | 30.6 | 6.89 | 0.61 | 17.1 | 940.34 |
DL (10-F-CV) | 27.9 | 6.29 | 0.79 | 14.8 | 782.66 |
As expected, ML techniques, especially ETR, have proven to be the best pipeline in the TPOT AutoML analysis for the curated dataset. However, it is clear that the deep learning model, when hyper-tuned, performed on par with its ML counterparts, achieving great R2 values and NRMSE percentages. After the initial evaluation of the deep learning models trained using a five-fold cross-validation scheme, it was found that when the DL was trained with three hidden layers of 100 neurons each, 56 input neurons, and one output neuron, using a tanh activation function with 2200 epoch values combined with a 10-fold cross-validation scheme, the model accuracy significantly improved. This improvement was reflected in the NRMSE and R2 values, as shown in Fig.
The input variables were categorized into two main groups: composition and manufacturing parameters. Features below the variable importance level were eliminated, except for those in the composition section. The final input vector consisted of 18 inputs. (Table
Feature | Feature type | Scaled feature importance |
---|---|---|
Crospovidone [%] | Disintegrant, Coposition | 1 |
Microcrystalline cellulose [%] | Binder, Composition | 0.744682848 |
Sodium starch glycolate [%] | Disintegrant, Composition | 0.488169243 |
Friability | Manufacturing Parmeter | 0.251528875 |
Avicel [%] | Filler, Composition | 0.104084555 |
Indion 414 [%] | Disintegrant, Composition | 0.089409042 |
Thickness(mm) | Manufacturing Parameter | 0.087242712 |
Pregelatinized starch [%] | Disintegrant, Composition | 0.081137918 |
Polyplasdon XL [%] | Disintegrant, Composition | 0.074485462 |
HPMC [%] | Binder, Composition | 0.045193837 |
Cross carmellose sodium [%] | Disintegrant, composition | 0.039986272 |
Avicel [%] | Binder, Composition | 0.016576396 |
Sodium stearyl fumarate [%] | Lubricant, Composition | 0.005441542 |
Mannitol [%] | Filler, Composition | 0.00258793 |
Punch | Manufacturing Parameter | 0.002117971 |
Hardness | Manufacturing Parameter | 0.001023564 |
Magnesium stearate [%] | Lubricant, Composition | 0.000947287 |
PVP K30 [%] | Binder, Composition | 0.000403805 |
Table
The SHAP summary graph in Fig.
A higher disintegration time (DT) is predicted with a greater concentration of disintegrants such as crospovidone and croscarmellose sodium. Conversely, when it comes to fillers like Mannitol and Avicel, a higher quantity of Mannitol and Avicel leads to lower DT. For binders like Microcrystalline Cellulose (MCC), Hydroxypropylmethylcellulose (HPMC), and Polyvinyl Pyrrolidine (PVPK30), two distinct effects are observed. A higher amount of HPMC and PVPK30 results in higher DT, while MCC decreases DT at higher concentrations. Lubricants like magnesium stearate (MgSt) and sodium stearyl fumarate (SSF) also play a role. SSF tends to increase DT, likely due to its hydrophilic properties. On the other hand, a higher amount of MgSt, which is more lipophilic, lowers DT due to the occlusion effect (
OFDTs have experienced a significant increase in demand over the past decade, leading to rapid growth in the pharmaceutical sector. Oral drug delivery remains the preferred method for administering many drugs. Advances in technology have inspired researchers to develop OFDTs that improve patient compliance and convenience. These tablets disintegrate upon administration without the need for water, making them popular and useful for various patient populations, particularly pediatric and geriatric individuals who may have difficulty swallowing traditional tablets and capsules (
Pharmaceutical formulation manufacturing currently relies on the trial-and-error method, which is both ineffective and time-consuming. In recent years, ML has emerged as a solution that can generate data-driven forecasts using existing experimental data, opening up significant possibilities for creating optimal formulations. A well-established ML algorithm can greatly speed up the development process, optimize formulations, save costs, and maintain product consistency (
In the present study, as mentioned in the methodology section, the developed models were analyzed to determine which produced better outcomes. The outcomes displayed in Table
OFDTs are a promising method to achieve rapid pharmacological action and offer advantages over traditional dosage forms already on the market. The conventional method of formulation development, based on trial and error, is tedious and demanding. In contrast, a ML-driven development technique accelerates the process by allowing scientists to efficiently produce accurate predictions. ML and DL models were effectively created in this study to forecast the DT of OFDTs. Although ML models are often deemed inscrutable black boxes, with the use of theoretical approaches such as Shapley additive explanations, we can gain an estimated understanding of what’s happening inside the black box. The study outcomes exhibited that the proposed ML models could precisely predict the DT of OFDTs, with DL showing better performance and lower complexity compared to other established models. Therefore, DL could also be applied to more fields in pharmaceutical research. The anticipated benefits of DL include a substantial reduction in the duration of therapeutic product development and a decrease in the quantity of materials required. Furthermore, the interdisciplinary fusion of pharmaceutics and AI has the potential to transform pharmaceutical research from experience-based studies to data-driven approaches. Various ML techniques will be investigated in the future to forecast optimal formulations more effectively.