Development of an artificial intelligence-based assessment model for prediction of embryo viability using static images captured by optical light microscopy during IVF

M VerMilyea; J M M Hall; S M Diakiw; A Johnston; T Nguyen; D Perugini; A Miller; A Picou; A P Murphy; M Perugini

doi:10.1093/humrep/deaa013

Development of an artificial intelligence-based assessment model for prediction of embryo viability using static images captured by optical light microscopy during IVF

Human Reproduction ◽

10.1093/humrep/deaa013 ◽

2020 ◽

Vol 35 (4) ◽

pp. 770-784 ◽

Cited By ~ 15

Author(s):

M VerMilyea ◽

J M M Hall ◽

S M Diakiw ◽

A Johnston ◽

T Nguyen ◽

...

Keyword(s):

Light Microscope ◽

Predictive Accuracy ◽

Predictive Ability ◽

Time Lapse ◽

Clinical Pregnancy ◽

Embryo Selection ◽

Embryo Viability ◽

Blind Test ◽

Student’S T ◽

Optical Light Microscopy

Abstract STUDY QUESTION Can an artificial intelligence (AI)-based model predict human embryo viability using images captured by optical light microscopy? SUMMARY ANSWER We have combined computer vision image processing methods and deep learning techniques to create the non-invasive Life Whisperer AI model for robust prediction of embryo viability, as measured by clinical pregnancy outcome, using single static images of Day 5 blastocysts obtained from standard optical light microscope systems. WHAT IS KNOWN ALREADY Embryo selection following IVF is a critical factor in determining the success of ensuing pregnancy. Traditional morphokinetic grading by trained embryologists can be subjective and variable, and other complementary techniques, such as time-lapse imaging, require costly equipment and have not reliably demonstrated predictive ability for the endpoint of clinical pregnancy. AI methods are being investigated as a promising means for improving embryo selection and predicting implantation and pregnancy outcomes. STUDY DESIGN, SIZE, DURATION These studies involved analysis of retrospectively collected data including standard optical light microscope images and clinical outcomes of 8886 embryos from 11 different IVF clinics, across three different countries, between 2011 and 2018. PARTICIPANTS/MATERIALS, SETTING, METHODS The AI-based model was trained using static two-dimensional optical light microscope images with known clinical pregnancy outcome as measured by fetal heartbeat to provide a confidence score for prediction of pregnancy. Predictive accuracy was determined by evaluating sensitivity, specificity and overall weighted accuracy, and was visualized using histograms of the distributions of predictions. Comparison to embryologists’ predictive accuracy was performed using a binary classification approach and a 5-band ranking comparison. MAIN RESULTS AND THE ROLE OF CHANCE The Life Whisperer AI model showed a sensitivity of 70.1% for viable embryos while maintaining a specificity of 60.5% for non-viable embryos across three independent blind test sets from different clinics. The weighted overall accuracy in each blind test set was >63%, with a combined accuracy of 64.3% across both viable and non-viable embryos, demonstrating model robustness and generalizability beyond the result expected from chance. Distributions of predictions showed clear separation of correctly and incorrectly classified embryos. Binary comparison of viable/non-viable embryo classification demonstrated an improvement of 24.7% over embryologists’ accuracy (P = 0.047, n = 2, Student’s t test), and 5-band ranking comparison demonstrated an improvement of 42.0% over embryologists (P = 0.028, n = 2, Student’s t test). LIMITATIONS, REASONS FOR CAUTION The AI model developed here is limited to analysis of Day 5 embryos; therefore, further evaluation or modification of the model is needed to incorporate information from different time points. The endpoint described is clinical pregnancy as measured by fetal heartbeat, and this does not indicate the probability of live birth. The current investigation was performed with retrospectively collected data, and hence it will be of importance to collect data prospectively to assess real-world use of the AI model. WIDER IMPLICATIONS OF THE FINDINGS These studies demonstrated an improved predictive ability for evaluation of embryo viability when compared with embryologists’ traditional morphokinetic grading methods. The superior accuracy of the Life Whisperer AI model could lead to improved pregnancy success rates in IVF when used in a clinical setting. It could also potentially assist in standardization of embryo selection methods across multiple clinical environments, while eliminating the need for complex time-lapse imaging equipment. Finally, the cloud-based software application used to apply the Life Whisperer AI model in clinical practice makes it broadly applicable and globally scalable to IVF clinics worldwide. STUDY FUNDING/COMPETING INTEREST(S) Life Whisperer Diagnostics, Pty Ltd is a wholly owned subsidiary of the parent company, Presagen Pty Ltd. Funding for the study was provided by Presagen with grant funding received from the South Australian Government: Research, Commercialisation and Startup Fund (RCSF). ‘In kind’ support and embryology expertise to guide algorithm development were provided by Ovation Fertility. J.M.M.H., D.P. and M.P. are co-owners of Life Whisperer and Presagen. Presagen has filed a provisional patent for the technology described in this manuscript (52985P pending). A.P.M. owns stock in Life Whisperer, and S.M.D., A.J., T.N. and A.P.M. are employees of Life Whisperer.

Download Full-text

P–263 Life Whisperer™, an AI-based algorithm to select non invasively best quality blastocysts for transfer: A multicenter analysis

Human Reproduction ◽

10.1093/humrep/deab130.262 ◽

2021 ◽

Vol 36 (Supplement_1) ◽

Author(s):

P Muño. Espert ◽

Y Galiana ◽

L Medrano ◽

J Ballester ◽

L Ortega ◽

...

Keyword(s):

Roc Curve ◽

Inner Cell Mass ◽

Cell Mass ◽

High Sensitivity ◽

Roc Curves ◽

Time Lapse ◽

Clinical Pregnancy ◽

Embryo Selection ◽

Computer Assisted ◽

Stage Duration

Abstract Study question Is the AI-based Life Whisperer™ (LW) tool, suitable to evaluate blastocysts quality and predict clinical pregnancy (CP) in couples undergoing ICSI cycles? Summary answer LW blastocyst score is comparable to the scores of other classification methods. This AI model showed high sensitivity and a comparable specificity for CP. What is known already The morphology grading is the most widely used method for the selection and classification of the embryos in clinical practice.However,this evaluation entails intervariability and intravariability decision among the embryologists.Recently, research has been focused on new embryo selection systems based on computer-assisted evaluation such as time-lapse with complex algorithms that allow the recognition of objective parameters of the embryo morphology.The implementation of these technologies requires substantial investments that are not available for all clinics.LW is a new embryo selection method based on AI,where specific hardware is not needed,as it is based on single blastocyst images taken with a routine microscope. Study design, size, duration Between 2017–2020, a total of 513 Day–5 blastocysts, after ICSI, comming from egg donation treatment were included in this retrospective-multicentre study.Day–5 embryos were evaluated with 3 classification methods:Gardner’s blastocyst grade (GB), the computer derived-output Eeva (EV) and LW AI-supported system. The good quality blastocysts were first evaluated using the GB and EV scores and subsequently compared with the LW scores.The sensitivity and specificity of LW was assessed to validate this system as a clinical pregnancy predictor. Participants/materials, setting, methods A total of 513 Day–5 blastocysts, from 134 oocyte donation cycles, were evaluated first by GB score: expansion (1–6), inner cell mass and throphoectoderm (A-C).EV analyses the cell division timing P2 (2cells stage duration) and P3 (3cells stage duration) differentiating three categories:High,Medium and Low(VerMilyea et al.,2014).LW scores ranked 1–10 from a single Day–5 blastocyst HR Image performed on inverted microscope,with a threshold >5 for defining a viable blastocyst.T-test and ROC-curves were used for statistical analysis. Main results and the role of chance The average of LW score obtained from GB higher blastocyst expansion score (≥4) was 7.48±0.09, while the average of LW score obtained from GB lower blastocyst expansion score (<4) was 4.69±0.3 (P < 0.001). The average of LW score yielded from GB good morphology of Inner Cell Mass and trophoectoderm (AA,AB,BA) was 7.98±0.1 while the average of LW score obtained from GB lower quality blastocyst score (BB,BC,CB,CA,AC) was 6.36±0.156 (P < 0.001).The average of LW score resulted from EV High blastocysts was 7.42±0.17, while the average of this obtained from EV low score was 6.43±0.3 (P = 0.009).A correlation between EV and LW score could be assesed, except for the blastocyst that are considered Medium score from EV. Therefore, a strong correlation between GB and LW system, as well GB+EV and LW, was found and an equivalent usability of the LW tool could be confirmed. The analyse of LW score for transferred embryos (N = 156), using ROC curve, showed a high sensitivity (0,928) but a low specificity (0,154) with a threshold of 5. Regarding our data, ROC curve shows that a threshold of 8,46 could enhance the prediction of CPR because in this point the specifity value is higher than 0.5. Limitations, reasons for caution The LW score validation compared to GB and EV methodology was carried out on a small number of embryos.Additionally,not all embryos had been transferred at the time of the analysis.Thus to enhance the accuracy of these data and the specificity of the clinical prediction, a higher sample size is needed. Wider implications of the findings: Blastocyst selection looks equivalent between all systems,but the LW tool is more objective and faster, saving time and costs significantly, without needing substantial hardware investments. Additionally,the LW-system shows almost the highest sensibility and may also improve the specificity by self-learning feeding the AI-system, thus tailoring predictions to each laboratory unique environment. Trial registration number NA

Download Full-text

Embryo selection using time-lapse analysis (Early Embryo Viability Assessment) in conjunction with standard morphology: a prospective two-center pilot study

Human Reproduction ◽

10.1093/humrep/dew207 ◽

2016 ◽

Vol 31 (11) ◽

pp. 2450-2457 ◽

Cited By ~ 32

Author(s):

Dorit C. Kieslinger ◽

Stefanie De Gheselle ◽

Cornelis B. Lambalk ◽

Petra De Sutter ◽

E. Hanna Kostelijk ◽

...

Keyword(s):

Pilot Study ◽

Time Lapse ◽

Early Embryo ◽

Embryo Selection ◽

Embryo Viability ◽

Viability Assessment

Download Full-text

P–141 Artificial intelligence system for the automation of the blastocyst morphology evaluation in GERI Time-lapse Incubator

Human Reproduction ◽

10.1093/humrep/deab130.140 ◽

2021 ◽

Vol 36 (Supplement_1) ◽

Author(s):

E Pay. Bosch ◽

L Bori ◽

A Beltran ◽

V Naranjo ◽

M Meseguer

Keyword(s):

Artificial Intelligence ◽

Deep Learning ◽

Evaluation Process ◽

Time Lapse ◽

Embryo Selection ◽

Learning Approach ◽

Initial Attempt ◽

High Quality ◽

Blind Test ◽

Medium Quality

Abstract Study question Can an Artificial Intelligence (AI) system (hand-crafted vs. deep learning techniques) based on single embryo image analysis from a GERI time-lapse incubator (TL) evaluate the blastocyst morphology? Summary answer Our hand-crafted method trained with blastocyst images from Geri-TL evaluated and classified parameters regarding to embryo quality with a global precision of 63.7% in blind-test. What is known already Recent studies have shown that AI can improve automatic grading and embryo selection. The approaches that have been carried out are very different, but all they conclude that there is a great potential (Rad2019, Manoj2020, Thirumalaraju2020). As we know, conventional embryo evaluation is performed manually based on the morphology of the blastocyst, therefore, it should be possible to replicate this process. In this study, we implemented different methods to analyse the behaviour and performance of an AI doing embryology tasks. Study design, size, duration Our study consisted of a retrospective analysis for the automatization of embryo evaluation with different approaches. We developed our models based on 715 images extracted from GERI TL Videos (Genea, Australia) from a single IVF center. Database was divided into 3 classes depending on the quality of the embryo according to ASEBIR morphology criteria (high; medium and low-quality). All the images were divided into 70% for training, 15% for validating and 15% for testing. Participants/materials, setting, methods We developed an automated AI algorithm to extract and classify features from images at 111,5 hpi of embryos cultured in GERI TL. Hand-crafted features from texture information are extracted to feed the classification algorithm. A statistical analysis is carried out to select the more discriminative variables. Parallelly, a deep neural network was built to compare performance of automatic and hand-crafted features. Additionally, we trained a model to detect embryo in the well. Main results and the role of chance High-quality, medium-quality and low-quality sensitivity were 73%, 56% and 72% for hand-crafted method and 76%, 53% and 22% for deep learning approach, respectively. High-quality, medium-quality and low-quality precision were 66%, 56% and 76% for hand-crafted method and 40%, 60% and 55% for deep learning approach, respectively. The global accuracy associated with each method was 64% and 50%. Also, we noticed that results were higher when we applied our embryo masks that avoid irrelevant information. In this initial attempt, our results showed that it is possible to replicate the embryo evaluation process. Limitations, reasons for caution The low results obtained in our deep learning model due to the absence of an extent dataset did not allow to obtain a model applicable to the clinic. However, the preliminary study let us to conclude the high potential of the approach. Wider implications of the findings: Our results showed a potential automatization of the embryo evaluation process in Geri TL where the available software for embryo selection does not provide such option. Our findings leaded to an increase in objectification, a reduction of the workload of the embryologist and the research of new unknown morphological variables. Trial registration number Not applicable

Download Full-text

O-088 Performance of a commercial artificial intelligence software for embryo selection (Embryoscope/KIDScore™) on predicting biopsied and non-biopsied blastocyst clinical pregnancy according to score subgroups

Human Reproduction ◽

10.1093/humrep/deab125.018 ◽

2021 ◽

Vol 36 (Supplement_1) ◽

Author(s):

R Erberelli ◽

C K Jacobs ◽

M Nicolielo ◽

E L Motta ◽

J R Alegretti ◽

...

Keyword(s):

Artificial Intelligence ◽

Maternal Age ◽

Time Lapse ◽

Clinical Pregnancy ◽

Embryo Selection ◽

Chi Square ◽

Biochemical Pregnancy ◽

Registration Number ◽

Gestational Sac ◽

N 93

Abstract Study question How informative is the score grade of KIDScore version 3 for day 5 blastocyst for clinical pregnancy in biopsied and non-biopsied embryos? Summary answer Potential clinical pregnancy is predicable according to score grades (above 7.0), regardless the use of PGT-A, in blastocysts on day 5. What is known already Time-lapse technology has promoted, along with the use of artificial intelligence (A.I.), a new spectrum of tools to improve embryo selection. Several software and algorithms have been launched in ART field in the last years, with the perspective of providing a substantial boost in IVF outcomes. KIDScore is one of these new tools, developed based on morphology and morphokinetics of embryo development with known clinical outcome and validated with transfer of blastocyst on day 3 or 5. Yet, it is highly recommended an in-house validation of any A.I. tool before it started to be apply in clinical decisions. Study design, size, duration Retrospective cohort study in a single private IVF center. Positive or negative clinical pregnancy (fetal heartbeat and gestational sac presence/absence) record of patient’s autologous and donated cycles using fresh and frozen oocytes, with or without PGT-A embryos transfers using the Embryocope® Plus incubator, that underwent single embryo transfers (total sET, n = 415; euploid = 228, non-biopsied = 187) of blastocysts developed on day 5 were included. Biochemical pregnancy and miscarriage were excluded of this analysis. Participants/materials, setting, methods Negative and positive clinical pregnancy KIDScoreTMDay 5’s were stratified in three subgroups, according to V3 score intervals: subgroup 1: range between 1.0-3.9 (n = 29), subgroup 2: 4.0-6.9 (n = 154) and subgroup 3: 7.0-9.9 (n = 232). sET of euploid embryos (n = 228) were also analyzed in the described subgroups (subgroup 1: n = 17; subgroup 2: n = 93 and subgroup 3: n = 118, respectively). For the analysis, Mann-Whitney, Chi-square and Fisher tests were used for statistical analysis, values of p < 0.05 were considered significant. Main results and the role of chance Maternal age between overall positive and negative pregnancies were similar (38,48±3,86 versus 38,75±3,83,p = 0,3573). When comparing score subgroups, overall positive clinical pregnancy rates were significant different [subgroup 1: 20.7% (6/29); subgroup 2: 43.5% (67/154); subgroup 3: 63.8% (148/232),p < 0.0001]. When analyzing subgroup 1 versus subgroup 2 there was also a difference in positive clinical pregnancy (p = 0.023) and subgroup 3 also showed a higher rate in clinical pregnancy when compared to subgroup 1 and 2 together (scores from 1.0 to 6.9,p < 0.0001). Analyzing only euploid embryos, the results on positive clinical pregnancy were also significant different between subgroups [subgroup 1: 35.3% (6/17); subgroup 2: 45.2% (42/93); subgroup 3: 61.0% (72/118),p = 0,024, and subgroup 1 + 2 versus subgroup 3,p = 0,0115]. Maternal age between positive and negative clinical pregnancies in PGT-A cycles were similar (37,81±1,61 versus 38,38±3,25,p = 0,069). Analyzing only non-biopsied embryos, the results on positive clinical pregnancy were also significant different between subgroups [subgroup 1: 0.0% (0/12); subgroup 2: 41.0% (25/61); subgroup 3: 66.7% (76/114),p = 0,0343, and subgroup 1 + 2 versus subgroup 3,p < 0.0001]. Maternal age between positive and negative clinical pregnancies in non-biopsied cycles were also similar (39,40±4,75 versus 39,22±4,43,p = 0,7816). Positive clinical pregnancy in subgroup 3 were similar in biopsied and non-biopsied subgroups (61% versus 66.7%,p = 0.4133). Limitations, reasons for caution The retrospective nature and low data of subgroup 1 (1.0-3.9 score), since they naturally are the last option to be chosen for transfer. Wider implications of the findings Differences on positive clinical pregnancy between subgroups (mainly scores greater than 7.0) reinforce the use of A.I. as a complementary tool for embryo selection. Interestingly, positive clinical pregnancy in 7.0-9.9 subgroup were similar in euploid and non-biopsied embryos, strengthening another potential application of A.I. in transposing embryo aneuploidy barrier. Trial registration number Not Applicable

Download Full-text

Prenatal Imaging: Egg Freezing, Embryo Selection and the Visual Politics of Reproductive Time

Catalyst Feminism Theory Technoscience ◽

10.28968/cftt.v4i2.29908 ◽

2018 ◽

Vol 4 (2) ◽

pp. 1-35 ◽

Cited By ~ 2

Author(s):

Lucy Van de Wiel

Keyword(s):

Visual Information ◽

Time Lapse ◽

Reproductive Technologies ◽

Embryo Selection ◽

Embryo Viability ◽

Reproductive Process ◽

Vitro Fertilization ◽

The Face ◽

Egg Freezing

In the last decade, two influential new reproductive technologies have been introduced that are changing the face of in vitro fertilization (IVF): egg freezing for “fertility preservation” and time-lapse embryo imaging for embryo selection. With these technologies emerge alternative visual representations of the assisted reproductive process and its relation to time. First, frozen egg photographs provide a lens onto contemporary reconfigurations of reproductive aging and stage a life-death dyad between the frozen cell and the embodied self, which drives treatment rationales for egg freezing. Second, time-lapse embryo imaging creates visual recordings of developing embryos in the incubator; the resultant quantified visual information can then be repurposed as a tool for predicting embryo viability. As these two sets of prenatal images reference dying eggs and non-viable embryos, they demonstrate a necropolitics of reproductive time, in which not only the generativity of new life but also the encounter with the death, finitude and fallibility of reproductive substances drives a widespread and intensified engagement with reproductive technologies.

Download Full-text

Change in the Strategy of Embryo Selection with Time-Lapse System Implementation—Impact on Clinical Pregnancy Rates

Journal of Clinical Medicine ◽

10.3390/jcm10184111 ◽

2021 ◽

Vol 10 (18) ◽

pp. 4111

Author(s):

Lisa Boucret ◽

Léa Tramon ◽

Patrick Saulnier ◽

Véronique Ferré-L’Hôtellier ◽

Pierre-Emmanuel Bouet ◽

...

Keyword(s):

Conventional Method ◽

Time Lapse ◽

Pregnancy Rates ◽

Clinical Pregnancy ◽

Embryo Selection ◽

Control Group ◽

Morphological Evaluation ◽

Early Embryos ◽

Clinical Pregnancy Rates ◽

Day 3 Embryos

Time-lapse systems (TLS) and associated algorithms are interesting tools to improve embryo selection. This study aimed to evaluate how TLS and KIDScore™ algorithm changed our practices of embryo selection, as compared to a conventional morphological evaluation, and improved clinical pregnancy rates (CPR). In the study group (year 2020, n = 303 transfers), embryos were cultured in an EmbryoScope+ time-lapse incubator. A first team observed embryos conventionally once a day, while a second team selected the embryos for transfer based on time-lapse recordings. In the control group (year 2019, n = 279 transfers), embryos were selected using the conventional method, and CPR were recorded. In 2020, disagreement between TLS and the conventional method occurred in 32.1% of transfers, more often for early embryos (34.7%) than for blastocysts (20.5%). Irregular morphokinetic events (direct or reverse cleavage, multinucleation, abnormal pronuclei) were detected in 54.9% of the discordant embryos. When it was available, KIDScore™ was decreased for 73.2% of the deselected embryos. Discordant blastocysts mainly corresponded with a decrease in KIDScore™ (90.9%), whereas discordant Day 3 embryos resulted from a decreased KIDScore™ and/or an irregular morphokinetic event. CPR was significantly improved in the TLS group (2020), as compared to the conventional group (2019) (32.3% vs. 21.9%, p = 0.005), even after multivariate analysis. In conclusion, TLS is useful to highlight some embryo development abnormalities and identify embryos with the highest potential for pregnancy.

Download Full-text

Experience of using time-lapse microscopy in IVF and ICSI programs

Medical alphabet ◽

10.33667/2078-5631-2020-16-47-50 ◽

2020 ◽

pp. 47-50

Author(s):

N. V. Saraeva ◽

N. V. Spiridonova ◽

M. T. Tugushev ◽

O. V. Shurygina ◽

A. I. Sinitsyna

Keyword(s):

Pregnancy Rate ◽

Embryo Transfer ◽

Selection Procedure ◽

Time Lapse ◽

Single Embryo Transfer ◽

Main Group ◽

Clinical Pregnancy ◽

Control Group ◽

Time Lapse Microscopy

In order to increase the pregnancy rate in the assisted reproductive technology, the selection of one embryo with the highest implantation potential it is very important. Time-lapse microscopy (TLM) is a tool for selecting quality embryos for transfer. This study aimed to assess the benefits of single-embryo transfer of autologous oocytes performed on day 5 of embryo incubation in a TLM-equipped system in IVF and ICSI programs. Single-embryo transfer following incubation in a TLM-equipped incubator was performed in 282 patients, who formed the main group; the control group consisted of 461 patients undergoing single-embryo transfer following a traditional culture and embryo selection procedure. We assessed the quality of transferred embryos, the rates of clinical pregnancy and delivery. The groups did not differ in the ratio of IVF and ICSI cycles, average age, and infertility factor. The proportion of excellent quality embryos for transfer was 77.0% in the main group and 65.1% in the control group (p = 0.001). In the subgroup with receiving eight and less oocytes we noted the tendency of receiving more quality embryos in the main group (р = 0.052). In the subgroup of nine and more oocytes the quality of the transferred embryos did not differ between two groups. The clinical pregnancy rate was 60.2% in the main group and 52.9% in the control group (p = 0.057). The delivery rate was 45.0% in the main group and 39.9% in the control group (p > 0.050).

Download Full-text

Embryo selection of euploid embryos by an automated time-lapse prediction is superior to conventional morphological analysis: a retrospective study.

10.26226/morressier.5912d9eed462b80292386547 ◽

2017 ◽

Author(s):

Eugenia Rocafort

Keyword(s):

Retrospective Study ◽

Morphological Analysis ◽

Time Lapse ◽

Embryo Selection ◽

Selection Of

Download Full-text

Risk prediction in multicentre studies when there is confounding by cluster or informative cluster size

BMC Medical Research Methodology ◽

10.1186/s12874-021-01321-x ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Menelaos Pavlou ◽

Gareth Ambler ◽

Rumana Z. Omar

Keyword(s):

Risk Prediction ◽

Cluster Size ◽

Linear Models ◽

Prediction Models ◽

Predictive Accuracy ◽

Clustered Data ◽

Predictor Variable ◽

Simulated Data ◽

Predictive Ability ◽

Informative Cluster Size

Abstract Background Clustered data arise in research when patients are clustered within larger units. Generalised Estimating Equations (GEE) and Generalised Linear Models (GLMM) can be used to provide marginal and cluster-specific inference and predictions, respectively. Methods Confounding by Cluster (CBC) and Informative cluster size (ICS) are two complications that may arise when modelling clustered data. CBC can arise when the distribution of a predictor variable (termed ‘exposure’), varies between clusters causing confounding of the exposure-outcome relationship. ICS means that the cluster size conditional on covariates is not independent of the outcome. In both situations, standard GEE and GLMM may provide biased or misleading inference, and modifications have been proposed. However, both CBC and ICS are routinely overlooked in the context of risk prediction, and their impact on the predictive ability of the models has been little explored. We study the effect of CBC and ICS on the predictive ability of risk models for binary outcomes when GEE and GLMM are used. We examine whether two simple approaches to handle CBC and ICS, which involve adjusting for the cluster mean of the exposure and the cluster size, respectively, can improve the accuracy of predictions. Results Both CBC and ICS can be viewed as violations of the assumptions in the standard GLMM; the random effects are correlated with exposure for CBC and cluster size for ICS. Based on these principles, we simulated data subject to CBC/ICS. The simulation studies suggested that the predictive ability of models derived from using standard GLMM and GEE ignoring CBC/ICS was affected. Marginal predictions were found to be mis-calibrated. Adjusting for the cluster-mean of the exposure or the cluster size improved calibration, discrimination and the overall predictive accuracy of marginal predictions, by explaining part of the between cluster variability. The presence of CBC/ICS did not affect the accuracy of conditional predictions. We illustrate these concepts using real data from a multicentre study with potential CBC. Conclusion Ignoring CBC and ICS when developing prediction models for clustered data can affect the accuracy of marginal predictions. Adjusting for the cluster mean of the exposure or the cluster size can improve the predictive accuracy of marginal predictions.

Download Full-text

O-123 Calibration of artificial intelligence (AI) models is necessary to reflect actual implantation probabilities with image-based embryo selection

Human Reproduction ◽

10.1093/humrep/deab126.048 ◽

2021 ◽

Vol 36 (Supplement_1) ◽

Author(s):

M F Kragh ◽

J T Lassen ◽

J Rimestad ◽

J Berntsen

Keyword(s):

Clinical Practice ◽

Model Calibration ◽

Goodness Of Fit ◽

Age Groups ◽

Time Lapse ◽

Embryo Selection ◽

Age Group ◽

Representative Data ◽

Patient Demographics ◽

Implantation Rates

Abstract Study question Do AI models for embryo selection provide actual implantation probabilities that generalise across clinics and patient demographics? Summary answer AI models need to be calibrated on representative data before providing reasonable agreements between predicted scores and actual implantation probabilities. What is known already AI models have been shown to perform well at discriminating embryos according to implantation likelihood, measured by area under curve (AUC). However, discrimination performance does not relate to how models perform with regards to predicting actual implantation likelihood, especially across clinics and patient demographics. In general, prediction models must be calibrated on representative data to provide meaningful probabilities. Calibration can be evaluated and summarised by “expected calibration error” (ECE) on score deciles and tested for significant lack of calibration using Hosmer-Lemeshow goodness-of-fit. ECE describes the average deviation between predicted probabilities and observed implantation rates and is 0 for perfect calibration. Study design, size, duration Time-lapse embryo videos from 18 clinics were used to develop AI models for prediction of fetal heartbeat (FHB). Model generalisation was evaluated on clinic hold-out models for the three largest clinics. Calibration curves were used to evaluate the agreement between AI-predicted scores and observed FHB outcome and summarised by ECE. Models were evaluated 1) without calibration, 2) calibration (Platt scaling) on other clinics’ data, and 3) calibration on the clinic’s own data (30%/70% for calibration/evaluation). Participants/materials, setting, methods A previously described AI algorithm, iDAScore, based on 115,842 time-lapse sequences of embryos, including 14,644 transferred embryos with known implantation data (KID), was used as foundation for training hold-out AI models for the three largest clinics (n = 2,829;2,673;1,327 KID embryos), such that their data were not included during model training. ECEs across the three clinics (mean±SD) were compared for models with/without calibration using KID embryos only, both overall and within subgroups of patient age (<36,36-40,>40 years). Main results and the role of chance The AUC across the three clinics was 0.675±0.041 (mean±SD) and unaffected by calibration. Without calibration, overall ECE was 0.223±0.057, indicating weak agreements between scores and actual implantation rates. With calibration on other clinics’ data, overall ECE was 0.040±0.013, indicating considerable improvements with moderate clinical variation. As implantation probabilities are both affected by clinical practice and patient demographics, subgroup analysis was conducted on patient age (<36,36-40,>40 years). With calibration on other clinics’ data, age-group ECEs were (0.129±0.055 vs. 0.078±0.033 vs. 0.072±0.015). These calibration errors were thus larger than the overall average ECE of 0.040, indicating poor generalisation across age. Including age as input to the calibration, age-group ECEs were (0.088±0.042 vs. 0.075±0.046 vs. 0.051±0.025), indicating improved agreements between scores and implantation rates across both clinics and age groups. With calibration including age on the clinic’s own data, however, the best calibrations were obtained with ECEs (0.060±0.017 vs. 0.040±0.010 vs. 0.039±0.009). The results indicate that both clinical practice and patient demographics influence calibration and thus ideally should be adjusted for. Testing lack of calibration using Hosmer-Lemeshow goodness-of-fit, only one age-group from one clinic appeared miscalibrated (P = 0.02), whereas all other age-groups from the three clinics were appropriately calibrated (P > 0.10). Limitations, reasons for caution In this study, AI model calibration was conducted based on clinic and age. Other patient metadata such as BMI and patient diagnosis may be relevant to calibrate as well. However, for both calibration and evaluation on the clinic’s own data, a substantiate amount of data for each subgroup is needed. Wider implications of the findings With calibrated scores, AI models can predict actual implantation likelihood for each embryo. Probability estimates are a strong tool for patient communication and clinical decisions such as deciding when to discard/freeze embryos. Model calibration may thus be the next step in improving clinical outcome and shortening time to live birth. Trial registration number This work is partly funded by the Innovation Fund Denmark (IFD) under File No. 7039-00068B and partly funded by Vitrolife A/S

Download Full-text