scholarly journals KazNewsDataset: Single Country Overall Digital Mass Media Publication Corpus

Data ◽  
2021 ◽  
Vol 6 (3) ◽  
pp. 31
Author(s):  
Kirill Yakunin ◽  
Maksat Kalimoldayev ◽  
Ravil I. Mukhamediev ◽  
Rustam Mussabayev ◽  
Vladimir Barakhnin ◽  
...  

Mass media is one of the most important elements influencing the information environment of society. The mass media is not only a source of information about what is happening but is often the authority that shapes the information agenda, the boundaries, and forms of discussion on socially relevant topics. A multifaceted and, where possible, quantitative assessment of mass media performance is crucial for understanding their objectivity, tone, thematic focus and, quality. The paper presents a corpus of Kazakhstan media, which contains over 4 million publications from 36 primary sources (which has at least 500 publications). The corpus also includes more than 2 million texts of Russian media for comparative analysis of publication activity of the countries, also about 4000 sections of state policy documents. The paper briefly describes the natural language processing and multiple-criteria decision-making methods, which are the algorithmic basis of the text and mass media evaluation method, and describes the results of several research cases, such as identification of propaganda, assessment of the tone of publications, calculation of the level of socially relevant negativity, comparative analysis of publication activity in the field of renewable energy. Experiments confirm the general possibility of evaluating the socially significant news, identifying texts with propagandistic content, evaluating the sentiment of publications using the topic model of the text corpus since the area under receiver operating characteristics curve (ROC AUC) values of 0.81, 0.73 and 0.93 were achieved on abovementioned tasks. The described cases do not exhaust the possibilities of thematic, tonal, dynamic, etc., analysis of the considered corpus of texts. The corpus will be interesting to researchers considering both multiple publications and mass media analysis, including comparative analysis and identification of common patterns inherent in the media of different countries.

Symmetry ◽  
2020 ◽  
Vol 12 (12) ◽  
pp. 1945
Author(s):  
Ravil I. Mukhamediev ◽  
Kirill Yakunin ◽  
Rustam Mussabayev ◽  
Timur Buldybayev ◽  
Yan Kuchin ◽  
...  

Mass media not only reflect the activities of state bodies but also shape the informational context, sentiment, depth, and significance level attributed to certain state initiatives and social events. Multilateral and quantitative (to the practicable extent) assessment of media activity is important for understanding their objectivity, role, focus, and, ultimately, the quality of the society’s “fourth power”. The paper proposes a method for evaluating the media in several modalities (topics, evaluation criteria/properties, classes), combining topic modeling of the text corpora and multiple-criteria decision making. The evaluation is based on an analysis of the corpora as follows: the conditional probability distribution of media by topics, properties, and classes is calculated after the formation of the topic model of the corpora. Several approaches are used to obtain weights that describe how each topic relates to each evaluation criterion/property and to each class described in the paper, including manual high-level labeling, a multi-corpora approach, and an automatic approach. The proposed multi-corpora approach suggests assessment of corpora topical asymmetry to obtain the weights describing each topic’s relationship to a certain criterion/property. These weights, combined with the topic model, can be applied to evaluate each document in the corpora according to each of the considered criteria and classes. The proposed method was applied to a corpus of 804,829 news publications from 40 Kazakhstani sources published from 01 January 2018 to 31 December 2019, to classify negative information on socially significant topics. A BigARTM model was derived (200 topics) and the proposed model was applied, including to fill a table of the analytical hierarchical process (AHP) and all of the necessary high-level labeling procedures. Experiments confirm the general possibility of evaluating the media using the topic model of the text corpora, because an area under receiver operating characteristics curve (ROC AUC) score of 0.81 was achieved in the classification task, which is comparable with results obtained for the same task by applying the BERT (Bidirectional Encoder Representations from Transformers) model.


2021 ◽  
Vol 10 (7) ◽  
pp. 474
Author(s):  
Bingqing Wang ◽  
Bin Meng ◽  
Juan Wang ◽  
Siyu Chen ◽  
Jian Liu

Social media data contains real-time expressed information, including text and geographical location. As a new data source for crowd behavior research in the era of big data, it can reflect some aspects of the behavior of residents. In this study, a text classification model based on the BERT and Transformers framework was constructed, which was used to classify and extract more than 210,000 residents’ festival activities based on the 1.13 million Sina Weibo (Chinese “Twitter”) data collected from Beijing in 2019 data. On this basis, word frequency statistics, part-of-speech analysis, topic model, sentiment analysis and other methods were used to perceive different types of festival activities and quantitatively analyze the spatial differences of different types of festivals. The results show that traditional culture significantly influences residents’ festivals, reflecting residents’ motivation to participate in festivals and how residents participate in festivals and express their emotions. There are apparent spatial differences among residents in participating in festival activities. The main festival activities are distributed in the central area within the Fifth Ring Road in Beijing. In contrast, expressing feelings during the festival is mainly distributed outside the Fifth Ring Road in Beijing. The research integrates natural language processing technology, topic model analysis, spatial statistical analysis, and other technologies. It can also broaden the application field of social media data, especially text data, which provides a new research paradigm for studying residents’ festival activities and adds residents’ perception of the festival. The research results provide a basis for the design and management of the Chinese festival system.


2018 ◽  
Vol 7 (3.33) ◽  
pp. 168
Author(s):  
Yonglak SHON ◽  
Jaeyoung PARK ◽  
Jangmook KANG ◽  
Sangwon LEE

The LOD data sets consist of RDF Triples based on the Ontology, a specification of existing facts, and by linking them to previously disclosed knowledge based on linked data principles. These structured LOD clouds form a large global data network, which provides a more accurate foundation for users to deliver the desired information. However, it is difficult to identify that, if the presence of the same object is identified differently across several LOD data sets, they are inherently identical. This is because objects with different URIs in the LOD datasets must be different and they must be closely examined for similarities in order to judge them as identical. The aim of this study is that the prosed model, RILE, evaluates similarity by comparing object values of existing specified predicates. After performing experiments with our model, we could check the improvement of the confidence level of the connection by extracting the link value.  


The online discussion forums and blogs are very vibrant platforms for cancer patients to express their views in the form of stories. These stories sometimes become a source of inspiration for some patients who are anxious in searching the similar cases. This paper proposes a method using natural language processing and machine learning to analyze unstructured texts accumulated from patient’s reviews and stories. The proposed methodology aims to identify behavior, emotions, side-effects, decisions and demographics associated with the cancer victims. The pre-processing phase of our work involves extraction of web text followed by text-cleaning where some special characters and symbols are omitted, and finally tagging the texts using NLTK’s (Natural Language Toolkit) POS (Parts of Speech) Tagger. The post-processing phase performs training of seven machine learning classifiers (refer Table 6). The Decision Tree classifier shows the higher precision (0.83) among the other classifiers while, the Area under the operating Characteristics (AUC) for Support Vector Machine (SVM) classifier is highest (0.98).


2020 ◽  
Author(s):  
Daisy Massey ◽  
Chenxi Huang ◽  
Yuan Lu ◽  
Alina Cohen ◽  
Yahel Oren ◽  
...  

BACKGROUND The coronavirus disease 2019 (COVID-19) has continued to spread in the US and globally. Closely monitoring public engagement and perception of COVID-19 and preventive measures using social media data could provide important information for understanding the progress of current interventions and planning future programs. OBJECTIVE To measure the public’s behaviors and perceptions regarding COVID-19 and its daily life effects during the recent 5 months of the pandemic. METHODS Natural language processing (NLP) algorithms were used to identify COVID-19 related and unrelated topics in over 300 million online data sources from June 15 to November 15, 2020. Posts in the sample were geotagged, and sensitivity and specificity were both calculated to validate the classification of posts. The prevalence of discussion regarding these topics was measured over this time period and compared to daily case rates in the US. RESULTS The final sample size included 9,065,733 posts, 70% of which were sourced from the US. In October and November, discussion including mentions of COVID-19 and related health behaviors did not increase as it had from June to September, despite an increase in COVID-19 daily cases in the US beginning in October. Additionally, counter to reports from March and April, discussion was more focused on daily life topics (69%), compared with COVID-19 in general (37%) and COVID-19 public health measures (20%). CONCLUSIONS There was a decline in COVID-19-related social media discussion sourced mainly from the US, even as COVID-19 cases in the US have increased to the highest rate since the beginning of the pandemic. Targeted public health messaging may be needed to ensure engagement in public health prevention measures until a vaccine is widely available to the public.


Author(s):  
Rashida Ali ◽  
Ibrahim Rampurawala ◽  
Mayuri Wandhe ◽  
Ruchika Shrikhande ◽  
Arpita Bhatkar

Internet provides a medium to connect with individuals of similar or different interests creating a hub. Since a huge hub participates on these platforms, the user can receive a high volume of messages from different individuals creating a chaos and unwanted messages. These messages sometimes contain a true information and sometimes false, which leads to a state of confusion in the minds of the users and leads to first step towards spam messaging. Spam messages means an irrelevant and unsolicited message sent by a known/unknown user which may lead to a sense of insecurity among users. In this paper, the different machine learning algorithms were trained and tested with natural language processing (NLP) to classify whether the messages are spam or ham.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Rohit Ramakrishna Nadkarni ◽  
Bimal Puthuvayi

PurposeThe identification (listing) and classification (grading) of urban heritage buildings for conservation is a challenging task for urban planners and conservation architects. Most of the world's cities depend on the expert-based evaluation method (EBEM) for listing and grading heritage buildings. The Panaji city in India provided a unique opportunity to assess the performance of the EBEM as two independent agencies carried out the heritage listing and grading process. Considering the case of Panaji, this research aims to measure the performance of EBEM used for listing and grading heritage buildings and identify the issues associated with the existing methodology.Design/methodology/approachThis research presents a comparative analysis of the building listed and graded by the two agencies. The buildings that both agencies graded were identified and analysed using a confusion matrix. The grading classification was tested for accuracy, precision, sensitivity and F-score.FindingsThe result shows a low accuracy and F-score, which reflects the level of buildings misclassified. The misclassification is the product of the lack of standardisation of methodology and the subjectivity level involved in the EBEM.Originality/valueHeritage listing and grading is a time-consuming process, and no city has the time and resource to conduct studies to check the accuracy. The cities in India and across the world, which follow a similar EBEM process, should consider this study's finding and revisit their methodology and develop a more reliable methodology for listing and grading heritage buildings.


2021 ◽  
pp. 1-10
Author(s):  
Wang Gao ◽  
Hongtao Deng ◽  
Xun Zhu ◽  
Yuan Fang

Harmful information identification is a critical research topic in natural language processing. Existing approaches have been focused either on rule-based methods or harmful text identification of normal documents. In this paper, we propose a BERT-based model to identify harmful information from social media, called Topic-BERT. Firstly, Topic-BERT utilizes BERT to take additional information as input to alleviate the sparseness of short texts. The GPU-DMM topic model is used to capture hidden topics of short texts for attention weight calculation. Secondly, the proposed model divides harmful short text identification into two stages, and different granularity labels are identified by two similar sub-models. Finally, we conduct extensive experiments on a real-world social media dataset to evaluate our model. Experimental results demonstrate that our model can significantly improve the classification performance compared with baseline methods.


Sign in / Sign up

Export Citation Format

Share Document