(北京航空航天大學計算機學院,北京,100191)
摘 要:本文針對混合類型元素組成的向量,即包含值域離散型、值域連續型元素的向量,提出了一種基于數據驅動的屬性權重計算方法。根據查詢向量的取值確定搜索空間范圍,并統計搜索空間內屬性取值分布情況,動態的統計出各個屬性在搜索中的區分度大小,進而計算出各屬性在相似度計算時所占權重值,并將權重值引入到基于向量距離的檢索中。本文利用服裝數據庫對檢索方法進行評估,實驗結果表明基于數據驅動的權重計算方法很好地分析出屬性區分度,使檢索結果更加符合用戶預期,取得較好的結果。
關 鍵 詞:數據驅動;權重計算;搜索空間;向量距離
Similarity Search Method Based on Weighted Vector Distance
Wang Peng , Shi Chenfang
(School of Computer Science and Engineering, Beijing University of Aeronautics and Astronautics, Beijing 100191, China)
Abstract:In this paper, a new property data-driven based weight calculation method is proposed, which is applicable to vectors consisting both categorical and continuous attributes. According to the query vector, a corresponding search space is specified. In this space, data distribution is analyzed dynamically, thus to gain the discrimination degree and then the weight of each attribute. To evaluating our method, we experiment with an actual database containing several attributes of clothes. Experiments show our method improve the accuracy and the results can better fit users’ expectations.
Key words:Similarity search; data-driven; weighted vector distance; search space
參考文獻
Shyam Boriah, Varun Chandola, Vipin Kumar. Similarity Measures for Catogerical Data: A Comparative Evaluation(J). In Proceedings of the eighth SIAM International Conference on Data Mining.2008,243-254.
Rui Yang,Panos Kalnis,Anthony K. H.Tung+. Similarity Evaluation on Tree-structured Data(J). 2005 ACM SIGMOD.2005,754-765.
Christopher D.Manning,Prabhakar Raghavan,Hinrich Schutze. Introduction to Information Retrieval(M).北京:人民郵電出版社,2011, 12。
Jiawei Han, Micheline Kamber. Data Mining Conceptes and Techniques,Second Edition(M).北京:機械工業出版社,2011, 10。
作者簡介:
王鵬(1988-),男,漢族,北京航空航天大學計算機學院碩士研究生,主要從事互聯網應用、信息檢索技術方面的研究,信息檢索。