(國防科學技術大學計算機學院,長沙 410073)1
(國防科學技術大學信息中心,長沙 410073)2
摘 要:一個普通的網頁可以被分成正文和噪聲兩個部分,噪聲影響了對網頁進行正文提取、聚類等處理,因此快速準確地去除網頁中的噪聲是網頁信息處理的關鍵技術之一。本文根據網頁文本格式屬性的相似性,提出一種基于網頁標簽屬性的去噪算法,并將通過此算法處理的網頁用于K-MEANS聚類算法。實驗結果表明本文提出的去噪算法是有效的,并且聚類結果的準確性有了較好的改進。
關鍵詞: 網頁去噪;標簽屬性;DOM樹;極大相容類
An Approach for Noise Reduction in Web Pages Based on Tag Attributes
JIANG Kun1 YANG Yue-xiang2 FANG Hong2
(1. School of Computer Science, National University of Defense Technology, Changsha, 410073;
2. Information Center, National University of Defense Technology, Changsha, 410073)
Abstract: A common web page could be separated into two categories:valuable segments and noise segments, which affects web page extraction, clustering and other processing, so eliminating noise accurately and efficiently is a key technique in web disposal. According to the similarity of web page's text format, we present a new approach of noise reduction algorithm based on the tag attributes of web page, and apply this algorithm to the K-MEANS clustering experiment. The experimental results show that the proposed algorithm is effective, and the accuracy of clustering has been improved.
Key words: Noise Reduction; Tag Attributes; DOM Tree; Maximal Compatible Classes
作者簡介:姜琨(1984-),男,碩士研究生,主要研究方向為計算機網絡與安全;楊岳湘,教授、博士生導師;方宏,高級工程師。