Using amino acid characteristics to identify influenza B virus pathogenicity | infectious diseases of poverty



To describe the transmission dynamics of IBV, surveillance data from 1997 to 2020 were collected from the United States Centers for Disease Control and Prevention ( Sparse data from the 2020-2021 and 2021-2022 flu seasons have been omitted due to the impact of COVID-19. With regard to pathogenicity, the percentage of IBV in all positive specimens of influenza virus per season was calculated. Because the number of positive tests changes each year, the positive test rate was used to reflect pathogenicity.

To construct a machine learning model, protein data from IBVs isolated from the US were downloaded from the public GISAID database ( [31, 32]. To reduce sequence similarity redundancy and cover the integrity of the viral genome, the raw data were processed prior to modelling [18]. The clustering algorithm was used to reduce redundancy of viral sequences. Only strains with full-length viral proteins were considered. Ambiguous amino acid residues were checked and carefully edited. Strains with inferior sequencing were also removed. The final data set included all 11 influenza virus proteins (PB2, PB1, PA, HA, NP, NA, NB, M1, BM2, NS1 and NEP) from 1724 strains (see Supplementary File 1).

Signature amino acid position

Viral proteins have important biological functions and play key roles in infection and transmission. The total length of the 11 viral proteins was 4708 amino acids. Although rapid mutation rates were observed, most amino acid residues were conserved in the 11 viral proteins. Signature positions have been verified to reduce computational complexity. Entropies in each position of the 11 viral proteins were calculated and measured with ({E}_{i}= -sum _{j=1}^{21}{P}_{i,j}mathrm{log}left({P}_{i,j}right ))Where ({P}_{i,j}) is the frequency of the amino acid (j) instead of (I). Deletion or insertion was also considered. High values ​​reflect frequent mutations at a specific position [33].

Composition of the amino acids

To identify the pathogenicity of IBV using a machine learning technique, the features should be encoded for amino acids in signature positions as input. In this article, six different coding algorithms were used from multiple perspectives, including compositional information, site-specific information, and physicochemical properties. The AAC is a simple descriptor for the IBV viral protein sequence [27]. The AAC method calculates the frequency of an amino acid in signature positions. The gap (deletion or insertion) was also taken into account. A 21-dimensional feature vector was used to represent each strain.


The PC-PseAAC is an updated AAC that calculates the parallel correlation of any two amino acids in a protein or peptide sequence [28]. For each strain used in this article, the PC-PseAAC trait vector is measured as

$$PC-PseAAC={links[{fv}_{1},dots ,{fv}_{21},{fv}_{21+1},dots ,{fv}_{20+uplambda }right]}^{T},$$


$${fv}_{u}=left{begin{array}{c}frac{{f}_{u}}{{sum }_{i=1}^{21}{f }_{i}+w{sum }_{j=1}^{uplambda }{theta }_{j}}, 1le ule 21 frac{w{theta }_ {u-21}}{{sum }_{i=1}^{21}{f}_{i}+w{sum }_{j=1}^{uplambda }{theta }_ {j}}, 21+1le ule 21+lambda end{array}right..$$

Here, (u) is an integer that changes with (uplambda); ({fv}_{u}) ((1le ule 21)) represents the normalized frequency of occurrence of the 20 amino acids and a gap for each strain; λ represents the highest level of correlation along signature positions; ({theta}_{j}) ((j=mathrm{1,2},dots,uplambda)) is the correlation function that measures the (j)-Tier sequence order correlation between all (j)-th most contiguous residues along signature positions [18].

G-Gap Dipeptide Composition

G-gap dipeptide composition (GGAP) measures dipeptide composition coupled with local order information of any two interval residues within protein sequences. GGAP is represented as

$$GGAPleft(gright)=left({fv}_{1}^{g},{fv}_{2}^{g},dots ,{fv}_{441}^{ g}right),$$

Where ({fv}_{i}^{g}) is the frequency of (I)-ten ((I)= 1,2, …, 441) g-gap dipeptide in signature positions [18]. The dimension of the GGAP feature vector is 21 x 21 = 441. Deletion or insertion was also calculated.

20-bit functions

In addition to methods based on the abundance of the amino acid, features about position-specific information and physicochemical properties were also used. The 20 standard amino acids were grouped according to five physicochemical properties: polarity, secondary structure, molecular volume, codon diversity, and electrostatic charge [29]. For each physicochemical property, the 20 amino acids were grouped into three groups and deletion/insertion was considered as the fourth group [18]. A total of 20 groups were achieved for each alphabet in the signature positions. Each residue was encoded as a 20-bit vector with 0/1 elements, with the position of the bit set to 1 if the residue belonged to the corresponding group and 0 otherwise. The signature positions in this work were determined using the method scrutinized by entropy. The top k residues with the highest entropy values ​​were selected and the dimension of the feature vector was 20 × k [18].

21-bit functions

For position-specific information of signature positions, each alphabet was encoded into a 21-bit 0/1 vector as in a one-hot encoding, e.g. B. Ala by 1,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0 or delete/insert 0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,1). Therefore, the top k residues were encoded with a 21 × k dimensional feature vector [18].

Overlapping property traits

Each amino acid was classified into 10 groups based on overlapping physicochemical properties [30]. The 10 physicochemical properties and their corresponding amino acid groups were as follows: Aromatic = {F, Y, W, H}, Negative = {D, E}, Positive = {K, H, R}, Polar = {N, Q , S, D, E, C, T, K, R, H, Y, W}, Hydrophobic = {A, G, C, T, I, V, L, K, H, F, Y, W, M }, Aliphatic = {I, V, L}, Tiny = {A, S, G, C}, Supercharged = {K, H, R, D, E}, Small = {P, N, D, T, C , A, G, S, V} and proline = {A, S, G, C}. Deletion/insertion was considered as the 11th group. The alphabet in the signature positions was then encoded by an 11-dimensional 0/1 vector. The position of the vector was set to 1 if the residue belonged to the physicochemical trait and to 0 otherwise. In this article, the top k residues were encoded with an 11xk trait vector [18].

HF predictor

The RF algorithm was used to output the informative features about the class label and the probability prediction [18]. R 3.5.0 (Lucent Technologies, Jasmine Mountain, USA) was used to run the RF algorithm and the tree number was set to 500 by default [34].

Framework for pathogenicity identification

The framework for identifying the pathogenicity of IBV is shown in FIG. Two types of traits have been used hierarchically to represent IBV: amino acid traits and informative traits [27]. Amino acid traits were supplied directly from 67 trait descriptors and input into the RF predictors. Subsequently, the meaningful features for the class label and the probabilistic prediction were generated and further optimized. The optimal subset of informative features to represent each tribe had small dimensions and included knowledge from different perspectives that were expected to improve the performance of the identification model.

Fig. 1

Flow chart of pathogenicity identification of IBV. The 40 entropy-based signature positions were first checked after the data was downloaded and cleaned. Six amino acid encoding methods with tunable parameters were used to extract features. Then 67 descriptors were proposed and two types of informative outputs from the RF method were obtained, which should be further optimized with the mRMR algorithm and the SFS strategy. Finally, each tribe was represented by two optimized informative features with the low dimensions “class” and “probability”. These optimal subsets were used to construct predictive models

The six amino acid coding algorithms were AAC, PC-PseAAC, GGAP, 20-bit feature (BIT20), 21-bit feature (BIT21), and overlapping property feature (OLP). The variable k is the common parameter for BIT20, BIT21 and OLP and controls the dimension of amino acid features. k varied from 4 to 40 in increments of 4. The maximum was set to 40 because there were 40 signature positions. The 67 trait descriptors under different parameters were constructed (Table 1). The class and probability features were then provided by each RF model. The class property is the predicted class designation. The positive samples were marked with 1 and the negative samples with 0. The probabilistic feature is the probability of positive marking. For each type of informative feature, the 67 values ​​were concatenated into a new vector. Each strain was then represented by two informative traits.

Table 1 Summary of feature description and feature number

In this paper, two 67-dimensional features have been further optimized to reduce computational complexity and increase performance. The minimum redundancy maximum relevance (mRMR) algorithm was used to rank informative features by importance values [35]. In addition, the forward sequential search (SFS) strategy was used to sequentially increase the informative features from the rank list. The subset with the best performance was considered to have the optimal characteristics and proposed to construct the final model for pathogenicity identification [27].

performance evaluation

Four popular metrics for performance evaluation, Sensitivity (SN), Specificity (SP), Accuracy (ACC), and Matthews Correlation Coefficient (MCC), were used as follows:

$$SN=frac{TP}{TP+FN}times 100%$$

$$SP=frac{TN}{TN+FP}times 100%$$

$$ACC=frac{TP+TN}{TP+TN+FP+FN}times 100%$$

$$MCC=frac{TPtimes TN+FPtimes FN}{sqrt{left(TP+FNright) left(TP+FPright) left(TN+FNright) left (TN+FPright)}}times 100%$$

where TP indicates the correct number of strains with the high pathogenicity phenotype; TN represents the correct number of strains with the low pathogenicity phenotype; FP gives wrong number of strains with low pathogenicity phenotype; and FN is the wrong number of strains with the high pathogenicity phenotype.

The Receiver Operating Characteristic (ROC) curve was also used to assess overall performance [36]. The curve is constructed by plotting the true positive rate (TPR) versus the false positive rate (FPR) under various classification thresholds. The area under the ROC curve (AUC) was used to assess prediction performance. A larger AUC value indicates that the model is performing better [26].


Comments are closed.