Focused Crawler Strategy Based on Improved Energy Landscape Paving Algorithm
The traditional crawlers have difficulty in implementing semantic analysis. Therefore, the focused crawler technologies with topic preference characteristics have received many attentions in the recent years. To increase the precision of focused crawlers and prevent “topic drifting”, this paper adopts the comprehensive relevancy evaluation (CRE) of hyperlinks based on the combination of web content and link structure. In addition, the improved version of the energy landscape paving (ELP) algorithm that is a class of metropolis-sampling-based global optimization method is proposed to avoid the focused crawler falling into local optima. By incorporating the CRE strategy into the improved ELP, a novel focused crawler strategy denoted by IELP is proposed. The experimental results on rainstorm disasters domain show that the precision of the proposed focused crawler is obviously promoted compared to other focused crawlers in literature, illustrating the ability of the IELP to retrieve topic-related web pages.