Machine Learning Methods to Identify Missed Cases of Bladder Cancer in Population-Based Registries

JCO Clinical Cancer Informatics ◽

10.1200/cci.20.00170 ◽

2021 ◽

pp. 641-653

Author(s):

Anne-Michelle Noone ◽

Clara J. K. Lam ◽

Angela B. Smith ◽

Matthew E. Nielsen ◽

Eric Boyd ◽

...

Keyword(s):

United States ◽

Machine Learning ◽

Bladder Cancer ◽

Cancer Incidence ◽

Cancer Registries ◽

The United States ◽

Population Based ◽

Learning Methods ◽

Machine Learning Methods ◽

Classification And Regression

PURPOSE Population-based cancer incidence rates of bladder cancer may be underestimated. Accurate estimates are needed for understanding the burden of bladder cancer in the United States. We developed and evaluated the feasibility of a machine learning–based classifier to identify bladder cancer cases missed by cancer registries, and estimated the rate of bladder cancer cases potentially missed. METHODS Data were from population-based cohort of 37,940 bladder cancer cases 65 years of age and older in the SEER cancer registries linked with Medicare claims (2007-2013). Cases with other urologic cancers, abdominal cancers, and unrelated cancers were included as control groups. A cohort of cancer-free controls was also selected using the Medicare 5% random sample. We used five supervised machine learning methods: classification and regression trees, random forest, logic regression, support vector machines, and logistic regression, for predicting bladder cancer. RESULTS Registry linkages yielded 37,940 bladder cancer cases and 766,303 cancer-free controls. Using health insurance claims, classification and regression trees distinguished bladder cancer cases from noncancer controls with very high accuracy (95%). Bacille Calmette-Guerin, cystectomy, and mitomycin were the most important predictors for identifying bladder cancer. From 2007 to 2013, we estimated that up to 3,300 bladder cancer cases in the United States may have been missed by the SEER registries. This would result in an average of 3.5% increase in the reported incidence rate. CONCLUSION SEER cancer registries may potentially miss bladder cancer cases during routine reporting. These missed cases can be identified leveraging Medicare claims and data analytics, leading to more accurate estimates of bladder cancer incidence.

Download Full-text

Bladder Cancer Stage Development, 2004-2014 in Europe Compared With the United States: Analysis of European Population-based Cancer Registries, the United States SEER Database, and a Large Tertiary Institutional Cohort

Clinical Genitourinary Cancer ◽

10.1016/j.clgc.2019.10.008 ◽

2020 ◽

Vol 18 (3) ◽

pp. 162-170.e4 ◽

Cited By ~ 2

Author(s):

Gerald B. Schulz ◽

Tobias Grimm ◽

Alexander Buchner ◽

Friedrich Jokisch ◽

Alexander Kretschmer ◽

...

Keyword(s):

United States ◽

Bladder Cancer ◽

Cancer Registries ◽

The United States ◽

Population Based ◽

European Population ◽

Cancer Stage ◽

Seer Database ◽

Stage Development

Download Full-text

The #MeToo Movement in the United States: Text Analysis of Early Twitter Conversations

Journal of Medical Internet Research ◽

10.2196/13837 ◽

2019 ◽

Vol 21 (9) ◽

pp. e13837 ◽

Cited By ~ 2

Author(s):

Sepideh Modrek ◽

Bozhidar Chakalov

Keyword(s):

United States ◽

Machine Learning ◽

Sexual Assault ◽

Sexual Harassment ◽

Early Life ◽

English Language ◽

Life Experiences ◽

The United States ◽

Learning Methods ◽

Machine Learning Methods

Background The #MeToo movement sparked an international debate on the sexual harassment, abuse, and assault and has taken many directions since its inception in October of 2017. Much of the early conversation took place on public social media sites such as Twitter, where the hashtag movement began. Objective The aim of this study is to document, characterize, and quantify early public discourse and conversation of the #MeToo movement from Twitter data in the United States. We focus on posts with public first-person revelations of sexual assault/abuse and early life experiences of such events. Methods We purchased full tweets and associated metadata from the Twitter Premium application programming interface between October 14 and 21, 2017 (ie, the first week of the movement). We examined the content of novel English language tweets with the phrase “MeToo” from within the United States (N=11,935). We used machine learning methods, least absolute shrinkage and selection operator regression, and support vector machine models to summarize and classify the content of individual tweets with revelations of sexual assault and abuse and early life experiences of sexual assault and abuse. Results We found that the most predictive words created a vivid archetype of the revelations of sexual assault and abuse. We then estimated that in the first week of the movement, 11% of novel English language tweets with the words “MeToo” revealed details about the poster’s experience of sexual assault or abuse and 5.8% revealed early life experiences of such events. We examined the demographic composition of posters of sexual assault and abuse and found that white women aged 25-50 years were overrepresented in terms of their representation on Twitter. Furthermore, we found that the mass sharing of personal experiences of sexual assault and abuse had a large reach, where 6 to 34 million Twitter users may have seen such first-person revelations from someone they followed in the first week of the movement. Conclusions These data illustrate that revelations shared went beyond acknowledgement of having experienced sexual harassment and often included vivid and traumatic descriptions of early life experiences of assault and abuse. These findings and methods underscore the value of content analysis, supported by novel machine learning methods, to improve our understanding of how widespread the revelations were, which likely amplified the spread and saliency of the #MeToo movement.

Download Full-text

The #MeToo Movement in the United States: Text Analysis of Early Twitter Conversations (Preprint)

10.2196/preprints.13837 ◽

2019 ◽

Author(s):

Sepideh Modrek ◽

Bozhidar Chakalov

Keyword(s):

United States ◽

Machine Learning ◽

Sexual Assault ◽

Sexual Harassment ◽

Early Life ◽

English Language ◽

Life Experiences ◽

The United States ◽

Learning Methods ◽

Machine Learning Methods

BACKGROUND The #MeToo movement sparked an international debate on the sexual harassment, abuse, and assault and has taken many directions since its inception in October of 2017. Much of the early conversation took place on public social media sites such as Twitter, where the hashtag movement began. OBJECTIVE The aim of this study is to document, characterize, and quantify early public discourse and conversation of the #MeToo movement from Twitter data in the United States. We focus on posts with public first-person revelations of sexual assault/abuse and early life experiences of such events. METHODS We purchased full tweets and associated metadata from the Twitter Premium application programming interface between October 14 and 21, 2017 (ie, the first week of the movement). We examined the content of novel English language tweets with the phrase “MeToo” from within the United States (N=11,935). We used machine learning methods, least absolute shrinkage and selection operator regression, and support vector machine models to summarize and classify the content of individual tweets with revelations of sexual assault and abuse and early life experiences of sexual assault and abuse. RESULTS We found that the most predictive words created a vivid archetype of the revelations of sexual assault and abuse. We then estimated that in the first week of the movement, 11% of novel English language tweets with the words “MeToo” revealed details about the poster’s experience of sexual assault or abuse and 5.8% revealed early life experiences of such events. We examined the demographic composition of posters of sexual assault and abuse and found that white women aged 25-50 years were overrepresented in terms of their representation on Twitter. Furthermore, we found that the mass sharing of personal experiences of sexual assault and abuse had a large reach, where 6 to 34 million Twitter users may have seen such first-person revelations from someone they followed in the first week of the movement. CONCLUSIONS These data illustrate that revelations shared went beyond acknowledgement of having experienced sexual harassment and often included vivid and traumatic descriptions of early life experiences of assault and abuse. These findings and methods underscore the value of content analysis, supported by novel machine learning methods, to improve our understanding of how widespread the revelations were, which likely amplified the spread and saliency of the #MeToo movement.

Download Full-text

Exploring the Relationship Between Chlorophyll-a and Other Water Quality Parameters by Using Machine Learning Methods:A Case Study of Lake Erie

10.5194/egusphere-egu21-14933 ◽

2021 ◽

Author(s):

Xue Hu ◽

Jinhui Jeanne Huang ◽

Yu Li

Keyword(s):

Neural Network ◽

United States ◽

Machine Learning ◽

Water Quality ◽

Chlorophyll A ◽

Lake Erie ◽

The United States ◽

Learning Methods ◽

Machine Learning Methods ◽

Input Variables

<p>Chlorophyll a (CHLA) is a key water quality indicator for the eutrophication of Lake Erie. In order to better predict the concentration of CHLA, this study divided Lake Erie into the United States and Canada according to national boundaries, and found the input variables most relevant to CHLA. It is concluded that the United States is total phosphorus (TP), and Canada is total nitrogen (TN), and it is analyzed that industrial and agricultural pollution around Lake Erie has caused excessive TP and TN content. The study used machine learning methods to model the water quality of the two parts respectively. The data used in the modelling was obtained from the Canadian Environment and Climate Change Agency for Lake Erie between 2000 and 2018. Several neural network (NN) models and other machine learning methods are used for data analysis, including standard neural network (NN) models, simple recurrent neural network (SRN) models, backpropagation neural network (BPNN) models, jump connections neural network (JCNN) model, random forest (RF) and support vector machine (SVM). At the same time, the most suitable combinations of input variables for CHLA prediction was found. The United States was TP, TN, DO, and T, and Canada was TP, TN, PH, and DO. Combining this result with the environmental protection policies of the United States and Canada, recommendations for improving the pollutant content of Lake Erie were proposed. This will help reduce the risk of eutrophication in Lake Erie.</p>

Download Full-text

Machine Learning Models of COVID-19 Cases in the United States: A Study of Initial Lockdown and Reopen Regimes

Applied Sciences ◽

10.3390/app112311227 ◽

2021 ◽

Vol 11 (23) ◽

pp. 11227

Author(s):

Arnold Kamis ◽

Yudan Ding ◽

Zhenzhen Qu ◽

Chenchen Zhang

Keyword(s):

United States ◽

Machine Learning ◽

Additive Model ◽

Regression Tree ◽

Predictor Variable ◽

The United States ◽

Predictor Variables ◽

Future Research ◽

Machine Learning Methods ◽

Variance Explained

The purpose of this paper is to model the cases of COVID-19 in the United States from 13 March 2020 to 31 May 2020. Our novel contribution is that we have obtained highly accurate models focused on two different regimes, lockdown and reopen, modeling each regime separately. The predictor variables include aggregated individual movement as well as state population density, health rank, climate temperature, and political color. We apply a variety of machine learning methods to each regime: Multiple Regression, Ridge Regression, Elastic Net Regression, Generalized Additive Model, Gradient Boosted Machine, Regression Tree, Neural Network, and Random Forest. We discover that Gradient Boosted Machines are the most accurate in both regimes. The best models achieve a variance explained of 95.2% in the lockdown regime and 99.2% in the reopen regime. We describe the influence of the predictor variables as they change from regime to regime. Notably, we identify individual person movement, as tracked by GPS data, to be an important predictor variable. We conclude that government lockdowns are an extremely important de-densification strategy. Implications and questions for future research are discussed.

Download Full-text

PD57-03 ASSESSMENT OF THE ECOLOGICAL ASSOCIATION BETWEEN TOBACCO SMOKING EXPOSURE AND BLADDER CANCER INCIDENCE OVER THE PAST HALF-CENTURY IN THE UNITED STATES

The Journal of Urology ◽

10.1016/j.juro.2017.02.2605 ◽

2017 ◽

Vol 197 (4S) ◽

Author(s):

Thomas Seisen ◽

Stuart R. Lipsitz ◽

Joaquim Bellmunt ◽

Mani Menon ◽

Nicolas von Landenberg ◽

...

Keyword(s):

United States ◽

Bladder Cancer ◽

Cancer Incidence ◽

Tobacco Smoking ◽

The United States ◽

The Past ◽

Past Half Century ◽

Smoking Exposure ◽

Ecological Association ◽

Past Half

Download Full-text

Impact of screening on cervical cancer incidence: A population‐based case–control study in the United States

International Journal of Cancer ◽

10.1002/ijc.32826 ◽

2019 ◽

Vol 147 (3) ◽

pp. 887-896

Author(s):

Rebecca Landy ◽

Peter D. Sasieni ◽

Christopher Mathews ◽

Charles L. Wiggins ◽

Michael Robertson ◽

...

Keyword(s):

United States ◽

Cervical Cancer ◽

Cancer Incidence ◽

Case Control Study ◽

The United States ◽

Population Based ◽

Case Control ◽

Cervical Cancer Incidence ◽

Control Study

Download Full-text

Proximity to oil refineries and risk of bladder cancer: A population-based analysis.

Journal of Clinical Oncology ◽

10.1200/jco.2019.37.15_suppl.4538 ◽

2019 ◽

Vol 37 (15_suppl) ◽

pp. 4538-4538

Author(s):

Tamer Dafashy ◽

Daniel Phillips ◽

Mohamed Danny Ray-Zack ◽

Preston Kerr ◽

Yong Shan ◽

...

Keyword(s):

Bladder Cancer ◽

Cancer Incidence ◽

Census Data ◽

The United States ◽

Population Based ◽

Incidence Rates ◽

Oil Refinery ◽

Oil Refineries ◽

Background Exposure ◽

Cancer Incidence Rates

4538 Background: Exposure to aromatic amines is a risk factor for bladder cancer. Incidence rates according to proximity to oil refineries are largely unknown. We sought to determine proximity of oil refineries and bladder cancer incidence in the State of Texas which is home to the largest number of oil refineries in the United States. Methods: We used the Texas Cancer Registry database to identify patients diagnosed with bladder cancer from January 1, 2001 to December 31, 2014. The U.S. census data from 2010 was used to ascertain overall population size, age and sex distributions. Heat maps of the 28 active oil refineries in Texas were developed. Incidence of bladder cancer were compared according to proximity ( < 10 vs. ≥ 10 miles) to an oil refinery. Risk ratios were adjusted using a Poisson regression model. Results: A total of 45,517 incident bladder cancer cases were identified of which 5,501 cases were within 10 miles of an oil refinery. In adjusted analyses, bladder cancer risk was significantly greater among males vs. females (Relative Risk (RR) 3.41, 95% Confidence Interval (CI), 3.33-3.50), and greater among people living within 10 miles from an oil refinery than those living outside a 10-mile radius from an oil refinery (RR 1.19, 95% CI, 1.08-1.31). Conclusions: People living within 10 miles from oil refineries were at greater risk for bladder cancer. Further research into exposure to oil refineries and bladder cancer incidence is warranted.

Download Full-text

Epidemiology of myelodysplastic syndromes and chronic myeloproliferative disorders in the United States, 2001-2004, using data from the NAACCR and SEER programs

Blood ◽

10.1182/blood-2008-01-134858 ◽

2008 ◽

Vol 112 (1) ◽

pp. 45-52 ◽

Cited By ~ 402

Author(s):

Dana E. Rollison ◽

Nadia Howlader ◽

Martyn T. Smith ◽

Sara S. Strom ◽

William D. Merritt ◽

...

Keyword(s):

United States ◽

Myelodysplastic Syndromes ◽

Survival Rates ◽

Myeloproliferative Disorders ◽

Cancer Registries ◽

The United States ◽

Population Based ◽

Incidence Rates ◽

Chronic Myeloproliferative Disorders ◽

The North

Abstract Reporting of myelodysplastic syndromes (MDSs) and chronic myeloproliferative disorders (CMDs) to population-based cancer registries in the United States was initiated in 2001. In this first analysis of data from the North American Association of Central Cancer Registries (NAACCR), encompassing 82% of the US population, we evaluated trends in MDS and CMD incidence, estimated case numbers for the entire United States, and assessed trends in diagnostic recognition and reporting. Based on more than 40 000 observations, average annual age-adjusted incidence rates of MDS and CMD for 2001 through 2003 were 3.3 and 2.1 per 100 000, respectively. Incidence rates increased with age for both MDS and CMD (P < .05) and were highest among whites and non-Hispanics. Based on follow-up data through 2004 from the Surveillance, Epidemiology, and End Results (SEER) Program, overall relative 3-year survival rates for MDS and CMD were 45% and 80%, respectively, with males experiencing poorer survival than females. Applying the observed age-specific incidence rates to US Census population estimates, approximately 9700 patients with MDS and 6300 patients with CMD were estimated for the entire United States in 2004. MDS incidence rates significantly increased with calendar year in 2001 through 2004, and only 4% of patients were reported to registries by physicians' offices. Thus, MDS disease burden in the United States may be underestimated.

Download Full-text

Nonfilter and filter cigarette consumption and the incidence of lung cancer by histological type in Japan and the United States: Analysis of 30-year data from population-based cancer registries

International Journal of Cancer ◽

10.1002/ijc.25531 ◽

2010 ◽

Vol 128 (8) ◽

pp. 1918-1928 ◽

Cited By ~ 30

Author(s):

Hidemi Ito ◽

Keitaro Matsuo ◽

Hideo Tanaka ◽

Devin C. Koestler ◽

Hernando Ombao ◽

...

Keyword(s):

United States ◽

Lung Cancer ◽

Cancer Registries ◽

The United States ◽

Population Based ◽

Histological Type ◽

Cigarette Consumption

Download Full-text