scholarly journals Machine Learning Methods to Identify Missed Cases of Bladder Cancer in Population-Based Registries

2021 ◽  
pp. 641-653
Author(s):  
Anne-Michelle Noone ◽  
Clara J. K. Lam ◽  
Angela B. Smith ◽  
Matthew E. Nielsen ◽  
Eric Boyd ◽  
...  

PURPOSE Population-based cancer incidence rates of bladder cancer may be underestimated. Accurate estimates are needed for understanding the burden of bladder cancer in the United States. We developed and evaluated the feasibility of a machine learning–based classifier to identify bladder cancer cases missed by cancer registries, and estimated the rate of bladder cancer cases potentially missed. METHODS Data were from population-based cohort of 37,940 bladder cancer cases 65 years of age and older in the SEER cancer registries linked with Medicare claims (2007-2013). Cases with other urologic cancers, abdominal cancers, and unrelated cancers were included as control groups. A cohort of cancer-free controls was also selected using the Medicare 5% random sample. We used five supervised machine learning methods: classification and regression trees, random forest, logic regression, support vector machines, and logistic regression, for predicting bladder cancer. RESULTS Registry linkages yielded 37,940 bladder cancer cases and 766,303 cancer-free controls. Using health insurance claims, classification and regression trees distinguished bladder cancer cases from noncancer controls with very high accuracy (95%). Bacille Calmette-Guerin, cystectomy, and mitomycin were the most important predictors for identifying bladder cancer. From 2007 to 2013, we estimated that up to 3,300 bladder cancer cases in the United States may have been missed by the SEER registries. This would result in an average of 3.5% increase in the reported incidence rate. CONCLUSION SEER cancer registries may potentially miss bladder cancer cases during routine reporting. These missed cases can be identified leveraging Medicare claims and data analytics, leading to more accurate estimates of bladder cancer incidence.

10.2196/13837 ◽  
2019 ◽  
Vol 21 (9) ◽  
pp. e13837 ◽  
Author(s):  
Sepideh Modrek ◽  
Bozhidar Chakalov

Background The #MeToo movement sparked an international debate on the sexual harassment, abuse, and assault and has taken many directions since its inception in October of 2017. Much of the early conversation took place on public social media sites such as Twitter, where the hashtag movement began. Objective The aim of this study is to document, characterize, and quantify early public discourse and conversation of the #MeToo movement from Twitter data in the United States. We focus on posts with public first-person revelations of sexual assault/abuse and early life experiences of such events. Methods We purchased full tweets and associated metadata from the Twitter Premium application programming interface between October 14 and 21, 2017 (ie, the first week of the movement). We examined the content of novel English language tweets with the phrase “MeToo” from within the United States (N=11,935). We used machine learning methods, least absolute shrinkage and selection operator regression, and support vector machine models to summarize and classify the content of individual tweets with revelations of sexual assault and abuse and early life experiences of sexual assault and abuse. Results We found that the most predictive words created a vivid archetype of the revelations of sexual assault and abuse. We then estimated that in the first week of the movement, 11% of novel English language tweets with the words “MeToo” revealed details about the poster’s experience of sexual assault or abuse and 5.8% revealed early life experiences of such events. We examined the demographic composition of posters of sexual assault and abuse and found that white women aged 25-50 years were overrepresented in terms of their representation on Twitter. Furthermore, we found that the mass sharing of personal experiences of sexual assault and abuse had a large reach, where 6 to 34 million Twitter users may have seen such first-person revelations from someone they followed in the first week of the movement. Conclusions These data illustrate that revelations shared went beyond acknowledgement of having experienced sexual harassment and often included vivid and traumatic descriptions of early life experiences of assault and abuse. These findings and methods underscore the value of content analysis, supported by novel machine learning methods, to improve our understanding of how widespread the revelations were, which likely amplified the spread and saliency of the #MeToo movement.


2019 ◽  
Author(s):  
Sepideh Modrek ◽  
Bozhidar Chakalov

BACKGROUND The #MeToo movement sparked an international debate on the sexual harassment, abuse, and assault and has taken many directions since its inception in October of 2017. Much of the early conversation took place on public social media sites such as Twitter, where the hashtag movement began. OBJECTIVE The aim of this study is to document, characterize, and quantify early public discourse and conversation of the #MeToo movement from Twitter data in the United States. We focus on posts with public first-person revelations of sexual assault/abuse and early life experiences of such events. METHODS We purchased full tweets and associated metadata from the Twitter Premium application programming interface between October 14 and 21, 2017 (ie, the first week of the movement). We examined the content of novel English language tweets with the phrase “MeToo” from within the United States (N=11,935). We used machine learning methods, least absolute shrinkage and selection operator regression, and support vector machine models to summarize and classify the content of individual tweets with revelations of sexual assault and abuse and early life experiences of sexual assault and abuse. RESULTS We found that the most predictive words created a vivid archetype of the revelations of sexual assault and abuse. We then estimated that in the first week of the movement, 11% of novel English language tweets with the words “MeToo” revealed details about the poster’s experience of sexual assault or abuse and 5.8% revealed early life experiences of such events. We examined the demographic composition of posters of sexual assault and abuse and found that white women aged 25-50 years were overrepresented in terms of their representation on Twitter. Furthermore, we found that the mass sharing of personal experiences of sexual assault and abuse had a large reach, where 6 to 34 million Twitter users may have seen such first-person revelations from someone they followed in the first week of the movement. CONCLUSIONS These data illustrate that revelations shared went beyond acknowledgement of having experienced sexual harassment and often included vivid and traumatic descriptions of early life experiences of assault and abuse. These findings and methods underscore the value of content analysis, supported by novel machine learning methods, to improve our understanding of how widespread the revelations were, which likely amplified the spread and saliency of the #MeToo movement.


2021 ◽  
Author(s):  
Xue Hu ◽  
Jinhui Jeanne Huang ◽  
Yu Li

<p>Chlorophyll a (CHLA) is a key water quality indicator for the eutrophication of Lake Erie. In order to better predict the concentration of CHLA, this study divided Lake Erie into the United States and Canada according to national boundaries, and found the input variables most relevant to CHLA. It is concluded that the United States is total phosphorus (TP), and Canada is total nitrogen (TN), and it is analyzed that industrial and agricultural pollution around Lake Erie has caused excessive TP and TN content. The study used machine learning methods to model the water quality of the two parts respectively. The data used in the modelling was obtained from the Canadian Environment and Climate Change Agency for Lake Erie between 2000 and 2018. Several neural network (NN) models and other machine learning methods are used for data analysis, including standard neural network (NN) models, simple recurrent neural network (SRN) models, backpropagation neural network (BPNN) models, jump connections neural network (JCNN) model, random forest (RF) and support vector machine (SVM). At the same time, the most suitable combinations of input variables for CHLA prediction was found. The United States was TP, TN, DO, and T, and Canada was TP, TN, PH, and DO. Combining this result with the environmental protection policies of the United States and Canada, recommendations for improving the pollutant content of Lake Erie were proposed. This will help reduce the risk of eutrophication in Lake Erie.</p>


2021 ◽  
Vol 11 (23) ◽  
pp. 11227
Author(s):  
Arnold Kamis ◽  
Yudan Ding ◽  
Zhenzhen Qu ◽  
Chenchen Zhang

The purpose of this paper is to model the cases of COVID-19 in the United States from 13 March 2020 to 31 May 2020. Our novel contribution is that we have obtained highly accurate models focused on two different regimes, lockdown and reopen, modeling each regime separately. The predictor variables include aggregated individual movement as well as state population density, health rank, climate temperature, and political color. We apply a variety of machine learning methods to each regime: Multiple Regression, Ridge Regression, Elastic Net Regression, Generalized Additive Model, Gradient Boosted Machine, Regression Tree, Neural Network, and Random Forest. We discover that Gradient Boosted Machines are the most accurate in both regimes. The best models achieve a variance explained of 95.2% in the lockdown regime and 99.2% in the reopen regime. We describe the influence of the predictor variables as they change from regime to regime. Notably, we identify individual person movement, as tracked by GPS data, to be an important predictor variable. We conclude that government lockdowns are an extremely important de-densification strategy. Implications and questions for future research are discussed.


2019 ◽  
Vol 147 (3) ◽  
pp. 887-896
Author(s):  
Rebecca Landy ◽  
Peter D. Sasieni ◽  
Christopher Mathews ◽  
Charles L. Wiggins ◽  
Michael Robertson ◽  
...  

2019 ◽  
Vol 37 (15_suppl) ◽  
pp. 4538-4538
Author(s):  
Tamer Dafashy ◽  
Daniel Phillips ◽  
Mohamed Danny Ray-Zack ◽  
Preston Kerr ◽  
Yong Shan ◽  
...  

4538 Background: Exposure to aromatic amines is a risk factor for bladder cancer. Incidence rates according to proximity to oil refineries are largely unknown. We sought to determine proximity of oil refineries and bladder cancer incidence in the State of Texas which is home to the largest number of oil refineries in the United States. Methods: We used the Texas Cancer Registry database to identify patients diagnosed with bladder cancer from January 1, 2001 to December 31, 2014. The U.S. census data from 2010 was used to ascertain overall population size, age and sex distributions. Heat maps of the 28 active oil refineries in Texas were developed. Incidence of bladder cancer were compared according to proximity ( < 10 vs. ≥ 10 miles) to an oil refinery. Risk ratios were adjusted using a Poisson regression model. Results: A total of 45,517 incident bladder cancer cases were identified of which 5,501 cases were within 10 miles of an oil refinery. In adjusted analyses, bladder cancer risk was significantly greater among males vs. females (Relative Risk (RR) 3.41, 95% Confidence Interval (CI), 3.33-3.50), and greater among people living within 10 miles from an oil refinery than those living outside a 10-mile radius from an oil refinery (RR 1.19, 95% CI, 1.08-1.31). Conclusions: People living within 10 miles from oil refineries were at greater risk for bladder cancer. Further research into exposure to oil refineries and bladder cancer incidence is warranted.


Blood ◽  
2008 ◽  
Vol 112 (1) ◽  
pp. 45-52 ◽  
Author(s):  
Dana E. Rollison ◽  
Nadia Howlader ◽  
Martyn T. Smith ◽  
Sara S. Strom ◽  
William D. Merritt ◽  
...  

Abstract Reporting of myelodysplastic syndromes (MDSs) and chronic myeloproliferative disorders (CMDs) to population-based cancer registries in the United States was initiated in 2001. In this first analysis of data from the North American Association of Central Cancer Registries (NAACCR), encompassing 82% of the US population, we evaluated trends in MDS and CMD incidence, estimated case numbers for the entire United States, and assessed trends in diagnostic recognition and reporting. Based on more than 40 000 observations, average annual age-adjusted incidence rates of MDS and CMD for 2001 through 2003 were 3.3 and 2.1 per 100 000, respectively. Incidence rates increased with age for both MDS and CMD (P < .05) and were highest among whites and non-Hispanics. Based on follow-up data through 2004 from the Surveillance, Epidemiology, and End Results (SEER) Program, overall relative 3-year survival rates for MDS and CMD were 45% and 80%, respectively, with males experiencing poorer survival than females. Applying the observed age-specific incidence rates to US Census population estimates, approximately 9700 patients with MDS and 6300 patients with CMD were estimated for the entire United States in 2004. MDS incidence rates significantly increased with calendar year in 2001 through 2004, and only 4% of patients were reported to registries by physicians' offices. Thus, MDS disease burden in the United States may be underestimated.


Sign in / Sign up

Export Citation Format

Share Document