(北京航空航天大學,計算機學院,北京,100191)
摘 要:本文在Dom Tree匹配分析網頁數據的基礎上,提出了一種基于白名單策略的Dom Tree簡化方法,這種簡化方法根據白名單匹配原則對網頁嵌套結構進行剪枝和壓縮,其生成的網頁文本樹結構只包含與檢索相關的內容區塊。本文提出了一種基于簡化Dom Tree結構進行網頁數據提取的方法。這種方法可以在保證網頁主要數據信息不丟失的基礎上,提高網頁數據分析及獲取的速度,縮短網頁數據分析的時間。本文利用電子商務網頁文本對分析方法進行評估,實驗表明提取得到的數據信息完整,主題相關程度高,取得了較好的結果。
關 鍵 詞:內容挖掘;網頁數據提;簡化Dom Tree
Web data extraction based on a simplified Dom Tree
Shi Chenfang Wang Peng
(School of Computer Science and Engineering, Beihang University, Beijing 100191, China)
Abstract: In this paper, a Dom Tree simplified method based on the white list strategy is proposed. This method is an extension of web data analysis based on Dom Tree matching. According to white list principle, this Dom Tree simplified method prunes and compresses the web nested structure. The generated tree structure contains only relevant content block. In this paper, a web data extraction method based on simplified Dom Tree is also proposed. This extraction method can raise the extraction speed and shorten the time of web data analysis while ensuring web data integrity. Finally, some web pages of E-commerce website are used to evaluate the analysis method. Experiments show that the extracted data is integral, and has high degree of correlation. The experiment result can fit the expectations.
Key words:Content mining; web data extraction; simplified Dom Tree
參考文獻
Manuel A lvarez, Alberto Pan, Juan Raposo, Fernando Bellas, Fidel Cacheda. Extracting lists of data records from semi-structured web pages. [J] Data & Knowledge Engineering. 2008(64-2):491-509.
Mahmoud Shaker, Hamidah Ibrahim, Aida Mustapha, Lili Nurliyana Abdullah. A Framework for Extracting Information from Semi-Structured Web Data Sources [A]. ICCIT '08 Proceedings of the 2008 Third International Conference on Convergence and Hybrid Information Technology - Volume 01[C]. IEEE Computer Society Washington, DC, USA. 2008: 27-31.
Yi-Ting Peng,Jau-Hwang Wang.Link analysis based on webpage co-occurrence mining - a case study on a notorious gang leader in Taiwan[C]. //Radar (CIE ICR '06), 2006 CIE International Conference on; Taipei,Taiwan.2008:31-34.
Bing Liu著.Web數據挖掘.北京:清華大學出版社,2009.
J. Caverlee, L. Liu, D. Buttler, Probe, cluster, and discover: focused extraction of QA-Pagelets from the Deep Web, in: Proceedings of the 20th International Conference ICDE, 2004, pp. 103–115.
附作者簡介:
史辰方(1988-),女,漢族,北京航空航天大學計算機學院碩士研究生,主要從事互聯網應用、Web數據挖掘方面的研究。