scholarly journals Probabilistically sampled and spectrally clustered plant species using phenotypic characteristics

PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e11927
Author(s):  
Aditya A. Shastri ◽  
Kapil Ahuja ◽  
Milind B. Ratnaparkhe ◽  
Yann Busnel

Phenotypic characteristics of a plant species refers to its physical properties as cataloged by plant biologists at different research centers around the world. Clustering species based upon their phenotypic characteristics is used to obtain diverse sets of parents that are useful in their breeding programs. The Hierarchical Clustering (HC) algorithm is the current standard in clustering of phenotypic data. This algorithm suffers from low accuracy and high computational complexity issues. To address the accuracy challenge, we propose the use of Spectral Clustering (SC) algorithm. To make the algorithm computationally cheap, we propose using sampling, specifically, Pivotal Sampling that is probability based. Since application of samplings to phenotypic data has not been explored much, for effective comparison, another sampling technique called Vector Quantization (VQ) is adapted for this data as well. VQ has recently generated promising results for genotypic data. The novelty of our SC with Pivotal Sampling algorithm is in constructing the crucial similarity matrix for the clustering algorithm and defining probabilities for the sampling technique. Although our algorithm can be applied to any plant species, we tested it on the phenotypic data obtained from about 2,400 Soybean species. SC with Pivotal Sampling achieves substantially more accuracy (in terms of Silhouette Values) than all the other proposed competitive clustering with sampling algorithms (i.e. SC with VQ, HC with Pivotal Sampling, and HC with VQ). The complexities of our SC with Pivotal Sampling algorithm and these three variants are almost the same because of the involved sampling. In addition to this, SC with Pivotal Sampling outperforms the standard HC algorithm in both accuracy and computational complexity. We experimentally show that we are up to 45% more accurate than HC in terms of clustering accuracy. The computational complexity of our algorithm is more than a magnitude less than that of HC.

2021 ◽  
Vol 12 ◽  
Author(s):  
◽  
Aline Fugeray-Scarbel ◽  
Catherine Bastien ◽  
Mathilde Dupont-Nivet ◽  
Stéphane Lemarié

The present study is a transversal analysis of the interest in genomic selection for plant and animal species. It focuses on the arguments that may convince breeders to switch to genomic selection. The arguments are classified into three different “bricks.” The first brick considers the addition of genotyping to improve the accuracy of the prediction of breeding values. The second consists of saving costs and/or shortening the breeding cycle by replacing all or a portion of the phenotyping effort with genotyping. The third concerns population management to improve the choice of parents to either optimize crossbreeding or maintain genetic diversity. We analyse the relevance of these different bricks for a wide range of animal and plant species and sought to explain the differences between species according to their biological specificities and the organization of breeding programs.


HortScience ◽  
2009 ◽  
Vol 44 (3) ◽  
pp. 582-588 ◽  
Author(s):  
Innocenzo Muzzalupo ◽  
Francesca Stefanizzi ◽  
Enzo Perri

Olive (Olea europaea L.) is a species of great economic importance in the Mediterranean basin. Italy is very important for the olive industry; in fact, olive's genetic patrimony is very rich and characterized by an abundance of cultivars. At present, the majority of ancient landraces are vegetatively propagated by farm. It is likely that the number of cultivars is underestimated because of inadequate information on minor local cultivars that are widespread in different olive-growing areas. The existence of many cultivars reinforces the need for a reliable identification method. It is important to improve the ex situ plant germplasm collection and fairly to characterize all cultivars for future breeding programs. In the present report, we used 11 loci microsatellites to characterize 211 olive cultivars of an olive collection cultivated in six regions of southern Italy. These regions represent the major area for olive cultivation in Italy and have a strategic geographical location in the Mediterranean basin. The dendrogram obtained, using the unweighted pair group method with arithmetic mean clustering algorithm, depicts the pattern of relationships between the studied cultivars. There is a clear structuring of the variability relative to the geographic origin of olive cultivars. This work, for the very high number of the Italian olive cultivars analyzed, highlights the degree and distribution of genetic diversity of this species for better exploitation of olive resources and for the design of plant breeding programs. Besides, the use of molecular markers, like simple sequence repeats, is imperative to build a database for cultivar analysis, for traceability of processed food, and for appropriate management of olive germplasm collections.


2000 ◽  
Vol 13 ◽  
pp. 155-188 ◽  
Author(s):  
J. Cheng ◽  
M. J. Druzdzel

Stochastic sampling algorithms, while an attractive alternative to exact algorithms in very large Bayesian network models, have been observed to perform poorly in evidential reasoning with extremely unlikely evidence. To address this problem, we propose an adaptive importance sampling algorithm, AIS-BN, that shows promising convergence rates even under extreme conditions and seems to outperform the existing sampling algorithms consistently. Three sources of this performance improvement are (1) two heuristics for initialization of the importance function that are based on the theoretical properties of importance sampling in finite-dimensional integrals and the structural advantages of Bayesian networks, (2) a smooth learning method for the importance function, and (3) a dynamic weighting function for combining samples from different stages of the algorithm. We tested the performance of the AIS-BN algorithm along with two state of the art general purpose sampling algorithms, likelihood weighting (Fung & Chang, 1989; Shachter & Peot, 1989) and self-importance sampling (Shachter & Peot, 1989). We used in our tests three large real Bayesian network models available to the scientific community: the CPCS network (Pradhan et al., 1994), the PathFinder network (Heckerman, Horvitz, & Nathwani, 1990), and the ANDES network (Conati, Gertner, VanLehn, & Druzdzel, 1997), with evidence as unlikely as 10^-41. While the AIS-BN algorithm always performed better than the other two algorithms, in the majority of the test cases it achieved orders of magnitude improvement in precision of the results. Improvement in speed given a desired precision is even more dramatic, although we are unable to report numerical results here, as the other algorithms almost never achieved the precision reached even by the first few iterations of the AIS-BN algorithm.


2011 ◽  
Vol 11 (spe) ◽  
pp. 16-26 ◽  
Author(s):  
Luiz Antônio dos Santos Dias

The paper analyses the puzzle of the food-energy-environmental security interaction, to which biofuels are part of the solution. It presents and discusses the contribution of genetic improvement to biofuels, with regard to the production of raw materials (oil and ethanol-producing plant species) and designs perspectives, opportunities, risks and challenges, with a special focus on the Brazilian scene. Bioethanol is a consolidated biofuel owing largely to the sugarcane breeding programs. These programs released 111 sugarcane cultivars and were responsible for a 20.8 % gain in productivity of bioethanol (in m³ ha-1) between 2000 and 2009. The program of Brazilian biodiesel production, initiated in 2005, had an annual growth rate of 10 % and the country is already the world's fourth largest producer. However, the contribution of breeding to biodiesel production is still modest, due to the lack of specific improvement programs for oil.


Author(s):  
Sikiru Adeniyi Atanda ◽  
Michael Olsen ◽  
Juan Burgueño ◽  
Jose Crossa ◽  
Daniel Dzidzienyo ◽  
...  

Abstract Key message Historical data from breeding programs can be efficiently used to improve genomic selection accuracy, especially when the training set is optimized to subset individuals most informative of the target testing set. Abstract The current strategy for large-scale implementation of genomic selection (GS) at the International Maize and Wheat Improvement Center (CIMMYT) global maize breeding program has been to train models using information from full-sibs in a “test-half-predict-half approach.” Although effective, this approach has limitations, as it requires large full-sib populations and limits the ability to shorten variety testing and breeding cycle times. The primary objective of this study was to identify optimal experimental and training set designs to maximize prediction accuracy of GS in CIMMYT’s maize breeding programs. Training set (TS) design strategies were evaluated to determine the most efficient use of phenotypic data collected on relatives for genomic prediction (GP) using datasets containing 849 (DS1) and 1389 (DS2) DH-lines evaluated as testcrosses in 2017 and 2018, respectively. Our results show there is merit in the use of multiple bi-parental populations as TS when selected using algorithms to maximize relatedness between the training and prediction sets. In a breeding program where relevant past breeding information is not readily available, the phenotyping expenditure can be spread across connected bi-parental populations by phenotyping only a small number of lines from each population. This significantly improves prediction accuracy compared to within-population prediction, especially when the TS for within full-sib prediction is small. Finally, we demonstrate that prediction accuracy in either sparse testing or “test-half-predict-half” can further be improved by optimizing which lines are planted for phenotyping and which lines are to be only genotyped for advancement based on GP.


Agronomy ◽  
2019 ◽  
Vol 9 (4) ◽  
pp. 196
Author(s):  
Sadal Hwang ◽  
Tong Geon Lee

Genetic mapping studies provide improved estimates for novel genomic loci, allelic effects and gene action controlling important traits. Such mapping studies are regularly performed by using a combination of genotypic data (e.g., genotyping markers tagging genetic variation within populations) and phenotypic data of appropriately structured mapping populations. Randomly obtained DNA information and more recent high-throughput genome sequencing efforts have dramatically increased the ability to obtain genetic markers for any plant species. Despite the presence of constantly and rapidly increasing genotypic data, necessary steps to determine whether specific markers can be associated with genetic variation may often be initially neglected, meaning that ever-growing genotypic markers do not necessarily maximize the power of mapping studies and often generate false results. To address this issue, we present a framework for analyzing genotypic data while developing a genetic linkage map. Our goal is to raise awareness of a stepwise procedure in the development of genetic maps as well as to outline the current and potential contribution of this procedure to minimize bias caused by errors in genotypic datasets. Empirical results obtained from the R/qtl package for the statistical language/software R are prepared with details of how we handled genotypic data to develop the genetic map of a major plant species. This study provides a stepwise procedure to correct pervasive errors in genotypic data while developing genetic maps. For use in custom follow-up studies, we provide input files and written R codes.


PeerJ ◽  
2018 ◽  
Vol 6 ◽  
pp. e5722 ◽  
Author(s):  
Wartini Ng ◽  
Budiman Minasny ◽  
Brendan Malone ◽  
Patrick Filippi

Background The use of visible-near infrared (vis-NIR) spectroscopy for rapid soil characterisation has gained a lot of interest in recent times. Soil spectra absorbance from the visible-infrared range can be calibrated using regression models to predict a set of soil properties. The accuracy of these regression models relies heavily on the calibration set. The optimum sample size and the overall sample representativeness of the dataset could further improve the model performance. However, there is no guideline on which sampling method should be used under different size of datasets. Methods Here, we show different sampling algorithms performed differently under different data size and different regression models (Cubist regression tree and Partial Least Square Regression (PLSR)). We analysed the effect of three sampling algorithms: Kennard-Stone (KS), conditioned Latin Hypercube Sampling (cLHS) and k-means clustering (KM) against random sampling on the prediction of up to five different soil properties (sand, clay, carbon content, cation exchange capacity and pH) on three datasets. These datasets have different coverages: a European continental dataset (LUCAS, n = 5,639), a regional dataset from Australia (Geeves, n = 379), and a local dataset from New South Wales, Australia (Hillston, n = 384). Calibration sample sizes ranging from 50 to 3,000 were derived and tested for the continental dataset; and from 50 to 200 samples for the regional and local datasets. Results Overall, the PLSR gives a better prediction in comparison to the Cubist model for the prediction of various soil properties. It is also less prone to the choice of sampling algorithm. The KM algorithm is more representative in the larger dataset up to a certain calibration sample size. The KS algorithm appears to be more efficient (as compared to random sampling) in small datasets; however, the prediction performance varied a lot between soil properties. The cLHS sampling algorithm is the most robust sampling method for multiple soil properties regardless of the sample size. Discussion Our results suggested that the optimum calibration sample size relied on how much generalization the model had to create. The use of the sampling algorithm is beneficial for larger datasets than smaller datasets where only small improvements can be made. KM is suitable for large datasets, KS is efficient in small datasets but results can be variable, while cLHS is less affected by sample size.


2017 ◽  
Vol 17 (04) ◽  
pp. 1750024 ◽  
Author(s):  
Qianwen Li ◽  
Zhihua Wei ◽  
Cairong Zhao

Region of interest (ROI) is the most important part of an image that expresses the effective content of the image. Extracting regions of interest from images accurately and efficiently can reduce computational complexity and is essential for image analysis and understanding. In order to achieve the automatic extraction of regions of interest and obtain more accurate regions of interest, this paper proposes Optimized Automatic Seeded Region Growing (OASRG) algorithm. The algorithm uses the affinity propagation (AP) clustering algorithm to extract the seeds automatically, and optimizes the traditional region growing algorithm by regrowing strategy to obtain the regions of interest where target objects are contained. Experimental results show that our algorithm can automatically locate seeds and produce results as good as traditional region growing with seeds selected manually. Furthermore, the precision is improved and the extraction effect is better after the optimization with regrowing strategy.


2004 ◽  
Vol 54 (4) ◽  
pp. 1049-1054 ◽  
Author(s):  
Véronique Roux ◽  
Didier Raoult

Gram-positive, spore-forming rods were isolated from blood cultures of three different patients. Based on phylogenetic analyses, these strains were placed within the Paenibacillus cluster and specific phenotypic characteristics for each strain were described. Levels of 16S rRNA gene sequence similarity between existing Paenibacillus species and the three novel strains 2301065T, 2301032T and 2301083T were 87·6–94·4, 88·5–95·4 and 87·5–96·0 %, respectively, and anteiso-branched C15 : 0 was the major fatty acid. On the basis of phenotypic data and phylogenetic inference, it is proposed that these strains should be designated Paenibacillus massiliensis sp. nov., Paenibacillus sanguinis sp. nov. and Paenibacillus timonensis sp. nov. The type strains are respectively strain 2301065T (=CIP 107939T=CCUG 48215T), strain 2301083T (=CIP 107938T=CCUG 48214T) and strain 2301032T (=CIP 108005T=CCUG 48216T).


2017 ◽  
Vol 6 (2) ◽  
pp. 107-114
Author(s):  
Paulina Ampomah ◽  
◽  
Kobina Yankson ◽  
Hugh Komla Akotoye ◽  
Elvis OforiAmeyaw ◽  
...  

Herbal medicines form a major component of traditional medicine in Africa because they are perceived to be efficacious and safe. The study aimed at investigating anti-malarial plants used by the indigenous people in the Central Region of Ghana. The study was conducted in three districts: Cape Coast (CC), Assin and AsikumaOdoben-Brakwa (AOB). Ethnomedicinal data on antimalarial plants was collected using a convenient sampling technique consisting of field observation, collection of vouchers and semi-structured interviews of herbalists/herbal practitioners and general public. Respondents interviewed constituted Herbal practitioners/Vendors (8%), Traditional birth attendants (TBAs, 6%), Chiefs/Opinion leaders (2%), and general public (84%). The female respondents formed 54% and 46% were males. Eighty-nine plant species belonging to 41 families were recorded as useful in treating malaria. Leaves were the commonest plant part used for herbal preparation (49.5%), followed by Stem bark (21.2%), roots (14.1%), fruits (7.2%), seeds and whole (3.0% each) and flowers (2.0%). Thirty two plant species belonging to 23 families were found to be common in all study areas. Herbal medicine patronage for malaria treatment in Central Region of Ghana is high with common species occurring inall study areas.


Sign in / Sign up

Export Citation Format

Share Document