Abstract(#br)In clinical research, DNA microarrays are widely applied in the identification of the oncogenes, which are differentially expressed between two clinical states and considered as predictors for the cancer prognosis. Due to the heterogeneity of clinical samples, the differentially expressed genes (DEGs) discovered by current statistical methods or machine learning algorithms involve a number of genes unrelated to the phenotypic differences between the compared samples and, consequently, will impact on the reliability of the predictive models in the cancer prognosis. In our study, we proposed Bayesian nonparametric variable selection algorithm, a stochastic random and hierarchical search method, to separate out the cancer-related genes from the DEG lists. The importance of the... genes in the DEG lists can be inferred from the posterior distribution of the predicted clinical endpoints, which can be simulated by the Markov Chain Monte Carlo (MCMC) algorithm. The cancer-related genes were identified according to their importance and used to construct models for the prediction of three clinical endpoints, namely the estrogen receptor status (ER status) of the breast cancer patient, the preoperative treatment response of breast cancer and the overall survival milestone outcome of acute myeloma leukemia (OS of AML). The prediction accuracies of preoperative treatment response, ER status and OS of AML were 86%, 89% and 58%, and the Mathew’s correlation coefficients were 0.42, 0.77 and 0.33, which were higher than those reported in previous studies. Furthermore, most of the genes identified by our method were reported as oncogenes in previous literatures. Our results demonstrated that the Bayesian nonparametric variable selection algorithm proposed in current study can efficiently identify the oncogenes for cancer prognosis and enhance the performance of the predictive models.