Design and Implementation of English Intelligent Communication Platform Based on Similarity Algorithm

Complexity ◽

10.1155/2021/5575417 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Yujie Chai

Keyword(s):

Natural Language Processing ◽

Language Processing ◽

Vector Model ◽

Vector Representation ◽

Text Similarity ◽

Calculation Algorithm ◽

Similarity Calculation ◽

Calculation Results ◽

Similarity Computation ◽

Intelligent Communication

Intelligent communication processing in English aims to obtain effective information from unstructured text data using various text processing techniques. Text vector representation and text similarity calculation are important fundamental tasks in the whole field of natural language processing. In response to the shortcomings of existing sentence vector representation models and the singularity of text similarity algorithms, improved models and algorithms are proposed based on a thorough study of related domain technologies. This paper presents an in-depth and comprehensive study of text vectorization representation and text similarity calculation algorithms in the field of natural language processing. The existing text vectorized representation models and text similarity computation algorithms are described, and their shortcomings are summarized to provide a basis for the background and significance of this paper, as well as to provide ideas for improvement directions. It is experimentally verified that the sentence vector model proposed in this paper achieves higher accuracy than the SIF sentence vector model for text classification tasks. In the task of text similarity computation, it achieves better results in three evaluation metrics: accuracy, recall, and F1 value. The algorithm also improves the computational efficiency of the model to a certain extent by removing feature words with low feature contribution. The algorithm first improves the deficiencies of the traditional word-shift distance algorithm by defining multifeature fusion weights and realizes a text similarity calculation algorithm based on multifeature weighted fusion with better similarity calculation results. Then, a linear weighting model is constructed to further combine the similarity calculation results of the hierarchical pooled IIG-SIF sentence vectors to realize the multimodel fusion text similarity calculation algorithm.

Download Full-text

An Improved Text Similarity Calculation Algorithm Based on VSM

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.225-226.1105 ◽

2011 ◽

Vol 225-226 ◽

pp. 1105-1108

Author(s):

Lian Li ◽

Ai Hong Zhu ◽

Tao Su

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Text Clustering ◽

Experimental Results ◽

Measured Parameter ◽

Text Similarity ◽

Calculation Algorithm ◽

Key Technology ◽

Similarity Calculation

Text similarity calculation is a key technology in the fields of text clustering, Web intelligent retrieval and natural language processing etc. Because the traditional text similarity calculation algorithm does not consider the affect of same feature words between texts, sometimes this algorithm may lead to inaccurate results. To solve this problem, this paper gives an improved text similarity calculation algorithm. Considering that the amount of same feature words reflects two texts’ similarity in some extent, the improved algorithm adds in the coverage measured parameter, which effectively reduces the interference of texts with lower similarity. The simulation and experimental results verify the improved algorithm’s correctness and effectiveness.

Download Full-text

VNLP: Visible natural language processing

Information Visualization ◽

10.1177/14738716211038898 ◽

2021 ◽

pp. 147387162110388

Author(s):

Mohammad Alharbi ◽

Matthew Roach ◽

Tom Cheesman ◽

Robert S Laramee

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Black Box ◽

User Preferences ◽

Text Similarity ◽

Input Text ◽

Visually Based ◽

Pipeline Design

In general, Natural Language Processing (NLP) algorithms exhibit black-box behavior. Users input text and output are provided with no explanation of how the results are obtained. In order to increase understanding and trust, users value transparent processing which may explain derived results and enable understanding of the underlying routines. Many approaches take an opaque approach by default when designing NLP tools and do not incorporate a means to steer and manipulate the intermediate NLP steps. We present an interactive, customizable, visual framework that enables users to observe and participate in the NLP pipeline processes, explicitly manipulate the parameters of each step, and explore the result visually based on user preferences. The visible NLP (VNLP) pipeline design is then applied to a text similarity application to demonstrate the utility and advantages of a visible and transparent NLP pipeline in supporting users to understand and justify both the process and results. We also report feedback on our framework from a modern languages expert.

Download Full-text

Sentence Similarity Calculation Based on Probabilistic Tolerance Rough Sets

Mathematical Problems in Engineering ◽

10.1155/2021/1635708 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Ruiteng Yan ◽

Dong Qiu ◽

Haihuan Jiang

Keyword(s):

Language Processing ◽

Rough Set ◽

Time Complexity ◽

Computation Model ◽

Text Data ◽

Efficient Performance ◽

Sentence Similarity ◽

Similarity Calculation ◽

Tolerance Rough Set ◽

Similarity Computation

Sentence similarity calculation is one of the important foundations of natural language processing. The existing sentence similarity calculation measurements are based on either shallow semantics with the limitation of inadequately capturing latent semantics information or deep learning algorithms with the limitation of supervision. In this paper, we improve the traditional tolerance rough set model, with the advantages of lower time complexity and becoming incremental compared to the traditional one. And then we propose a sentence similarity computation model from the perspective of uncertainty of text data based on the probabilistic tolerance rough set model. It has the ability of mining latent semantics information and is unsupervised. Experiments on SICK2014 task and STSbenchmark dataset to calculate sentence similarity identify a significant and efficient performance of our model.

Download Full-text

Reusable Component Retrieval from a Large Repository Using Word2Vec with Continuous Bag of Words

Ingénierie des systèmes d information ◽

10.18280/isi.260504 ◽

2021 ◽

Vol 26 (5) ◽

pp. 453-460

Author(s):

Krishna Chythanya Nagaraju ◽

Cherku Ramesh Kumar Reddy

Keyword(s):

Natural Language Processing ◽

Language Processing ◽

Bag Of Words ◽

Vector Representation ◽

Component Retrieval ◽

Textual Information ◽

Reusable Components ◽

Reusable Component ◽

The One ◽

Almost All

A reusable code component is the one which can be easily used with a little or no adaptation to fit in to the application being developed. The major concern in such process is the maintenance of these reusable components in one place called ‘Repository’, so that those code components can be effectively identified as well as reused. Word embedding allows us to numerically represent our textual information. They have become so pervasive that almost all Natural Language Processing projects make use of them. In this work, we considered to use Word2Vec concept to find vector representation of features of a reusable component. The features of a reusable component in the form of sequence of words are input to Word2Vec network. Our method using Word2Vec with Continuous Bag of Words out performs existing method in the market. The proposed methodology has shown an accuracy of 94.8% in identifying the existing reusable component.

Download Full-text

On the Usage of Semantic Text-Similarity Metrics for Natural Language Processing in Russian

2020 13th International Conference "Management of large-scale system development" (MLSD) ◽

10.1109/mlsd49919.2020.9247691 ◽

2020 ◽

Author(s):

Mikhail Koroteev

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Similarity Metrics ◽

Text Similarity

Download Full-text

Legal Document Summarization Using Nlp and Ml Techniques

International Journal Of Engineering And Computer Science ◽

10.18535/ijecs/v9i05.4488 ◽

2020 ◽

Vol 9 (05) ◽

pp. 25039-25046 ◽

Cited By ~ 1

Author(s):

Rahul C Kore ◽

Prachi Ray ◽

Priyanka Lade ◽

Amit Nerurkar

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Domain Knowledge ◽

Vector Representation ◽

Essential Information ◽

Legal Documents ◽

Ranking Algorithms ◽

Legal Document ◽

Text Ranking

Reading legal documents are tedious and sometimes it requires domain knowledge related to that document. It is hard to read the full legal document without missing the key important sentences. With increasing number of legal documents it would be convenient to get the essential information from the document without having to go through the whole document. The purpose of this study is to understand a large legal document within a short duration of time. Summarization gives flexibility and convenience to the reader. Using vector representation of words, text ranking algorithms, similarity techniques, this study gives a way to produce the highest ranked sentences. Summarization produces the result in such a way that it covers the most vital information of the document in a concise manner. The paper proposes how the different natural language processing concepts can be used to produce the desired result and give readers the relief from going through the whole complex document. This study definitively presents the steps that are required to achieve the aim and elaborates all the algorithms used at each and every step in the process.

Download Full-text

Enriching Word Vectors with Subword Information

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00051 ◽

2017 ◽

Vol 5 ◽

pp. 135-146 ◽

Cited By ~ 1156

Author(s):

Piotr Bojanowski ◽

Edouard Grave ◽

Armand Joulin ◽

Tomas Mikolov

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

State Of The Art ◽

Training Data ◽

Vector Representation ◽

New Approach ◽

Word Similarity ◽

Art Performance ◽

N Gram

Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram; words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpora quickly and allows us to compute word representations for words that did not appear in the training data. We evaluate our word representations on nine different languages, both on word similarity and analogy tasks. By comparing to recently proposed morphological word representations, we show that our vectors achieve state-of-the-art performance on these tasks.

Download Full-text

A Polarity Capturing Sphere for Word to Vector Representation

Applied Sciences ◽

10.3390/app10124386 ◽

2020 ◽

Vol 10 (12) ◽

pp. 4386 ◽

Cited By ~ 1

Author(s):

Sandra Rizkallah ◽

Amir F. Atiya ◽

Samir Shaheen

Keyword(s):

Natural Language Processing ◽

Language Processing ◽

State Of The Art ◽

Unrelated Word ◽

Research Field ◽

Word Embedding ◽

Vector Representation ◽

Active Research ◽

Embedding Methods ◽

Better Than

Embedding words from a dictionary as vectors in a space has become an active research field, due to its many uses in several natural language processing applications. Distances between the vectors should reflect the relatedness between the corresponding words. The problem with existing word embedding methods is that they often fail to distinguish between synonymous, antonymous, and unrelated word pairs. Meanwhile, polarity detection is crucial for applications such as sentiment analysis. In this work we propose an embedding approach that is designed to capture the polarity issue. The approach is based on embedding the word vectors into a sphere, whereby the dot product between any vectors represents the similarity. Vectors corresponding to synonymous words would be close to each other on the sphere, while a word and its antonym would lie at opposite poles of the sphere. The approach used to design the vectors is a simple relaxation algorithm. The proposed word embedding is successful in distinguishing between synonyms, antonyms, and unrelated word pairs. It achieves results that are better than those of some of the state-of-the-art techniques and competes well with the others.

Download Full-text

A Financial Embedded Vector Model and Its Applications to Time Series Forecasting

International Journal of Computers Communications & Control ◽

10.15837/ijccc.2018.5.3286 ◽

2018 ◽

Vol 13 (5) ◽

pp. 881-894

Author(s):

Yanfeng Sun ◽

Minglei Zhang ◽

Si Chen ◽

Xiaohu Shi

Keyword(s):

Neural Networks ◽

Time Series ◽

Language Processing ◽

Prediction Accuracy ◽

Financial Time Series ◽

Vector Model ◽

Rbf Neural Networks ◽

Vector Representation ◽

Financial Time ◽

Proposed Model

Inspired by the embedding representation in Natural Language Processing (NLP), we develop a financial embedded vector representation model to abstract the temporal characteristics of financial time series. Original financial features are discretized firstly, and then each set of discretized features is considered as a “word” of NLP, while the whole financial time series corresponds to the “sentence” or “paragraph”. Therefore the embedded vector models in NLP could be applied to the financial time series. To test the proposed model, we use RBF neural networks as regression model to predict financial series by comparing the financial embedding vectors as input with the original features. Numerical results show that the prediction accuracy of the test data is improved for about 4-6 orders of magnitude, meaning that the financial embedded vector has a strong generalization ability.

Download Full-text

Towards Bengali Word Embedding: Corpus Creation, Intrinsic and Extrinsic Evaluations

10.20944/preprints202012.0600.v1 ◽

2020 ◽

Author(s):

Md. Rajib Hossain ◽

Mohammed Moshiul Hoque

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Language Processing ◽

Text Classification ◽

Feature Space ◽

Vector Representation ◽

Word Similarity ◽

Corpus Creation ◽

Classification Information ◽

Correlation Accuracy

Distributional word vector representation orword embedding has become an essential ingredient in many natural language processing (NLP) tasks such as machine translation, document classification, information retrieval andquestion answering. Investigation of embedding model helps to reduce the feature space and improves textual semantic as well as syntactic relations.This paper presents three embedding techniques (such as Word2Vec, GloVe, and FastText) with different hyperparameters implemented on a Bengali corpusconsists of180 million words. The performance of the embedding techniques is evaluated with extrinsic and intrinsic ways. Extrinsic performance evaluated by text classification, which achieved a maximum of 96.48% accuracy. Intrinsic performance evaluatedby word similarity (e.g., semantic, syntactic and relatedness) and analogy tasks. The maximum Pearson (&circ;r) correlation accuracy of 60.66% (Ss&circ;r) achieved for semantic similarities and 71.64% (Sy&circ;r) for syntactic similarities whereas the relatedness obtained 79.80% (Rs&circ;r). The semantic word analogy tasks achieved 44.00% of accuracy while syntactic word analogy tasks obtained 36.00%

Download Full-text