A WEB information extraction method based on DOM tree structure and information entropy

Author(s):  
Yu-Ling Wang
2013 ◽  
Vol 774-776 ◽  
pp. 1802-1806
Author(s):  
Zhi Ming Zhang ◽  
Shuai Shuai Huang ◽  
Ping Li

With the rapid development of Internet, and surge in the amount of information on the Internet, how to accurately and quickly get the information of the users really need, such as the title, links, and pictures, is the hotspot. This paper proposed a fast web information extraction method based on html parser, this paper validated the effect of the proposed method by extracting commodities information of e-commerce website, the results show that the accuracy of the information extraction by our method is higher than the extraction method based on regular expressions, and the extraction time is greatly shortened.


2014 ◽  
Vol 614 ◽  
pp. 503-506
Author(s):  
Qi Shen ◽  
Qing Ming Song ◽  
Bo Chen

With the development of web technology, the use of dynamic web pages and the personalization of page contents become more and more popular. Currently, the information of page is protean and the structures of different pages are vastly different, the traditional thinking of web information extraction technology has been difficult to adapt to the situation. In this paper, proposes a web information extraction method based on extended XPath policy through the analysis of structural features of web pages on tourist theme. This algorithm avoids the defects of traditional web information extraction technology; it is simple, practical, high cleaning efficiency, accuracy, and saving the overhead of the system.


2013 ◽  
Vol 397-400 ◽  
pp. 1972-1978
Author(s):  
Song Pu Wu ◽  
Qing Wang

An adaptive web information extraction approach is presented in this paper. Most of the traditional web information extraction approaches depend on the templates of web sites. If the templates are changed, the information extraction rules should be redesigned. To reduce the maintenance costs and improve the adaptability of information extractors, an adaptive web information extraction approach is proposed based on the STU-DOM tree. The webpage is parsed into DOM Trees based on HTML Parser. Then DOM trees are filtered into STU-DOM trees to confirm blocks which contain keywords of a certain topic. The proposed approach is applied to webpages and the results show that the approach not only extracts information efficiently, but also is irrelevant to site structures.


Sign in / Sign up

Export Citation Format

Share Document