(1、并行與分布式處理國防科技重點實驗室,國防科學技術大學,長沙,410073
2、計算機學院,國防科學技術大學,長沙,410073)
摘 要:隨著系統規模的不斷增加,大型計算機系統的可靠性問題日益突出,系統的管理和維護工作變的越來越復雜。本文提出了一種新的故障管理系統設計方案,使計算機本身通過實施自我管理,對故障進行檢測、診斷、隔離和修復,降低系統故障開銷,為用戶提供穩定的計算環境,提高大型計算機系統的可用性。
關鍵詞:大型計算機系統 ;故障管理
The Design of A New Fault Management System
Long Cheng1,2 Kai Lu 1,2XiaoPing Wang1,2
1Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha, 410073
2 College of Computer, National University of Defense Technology, Changsha, 410073
Abstract: As with the escalating of system scale, the reliability of large-scale computer systems has become increasingly prominent, making system management and maintenance more and more complex. To address this problem, this paper proposes a design offault management system, where the computer system is able to conduct fault detection, diagnosis, isolation and repair through the implementation of self-management, thus providing a stable computing environment for users and meanwhile improving the usability of large-scale computer systems.
Key words: large-scale computer systems;Fault management
參考文獻:
[1] Jack Dongarra,PeteBeckman,Terry Moore. he International Exascale Software Project RoadMap[J]. International Journal of High Performance Computing
Applications,Feb.2011,vol.25(1):3-60.
[2]張琨,許滿武,劉玉鳳.面向自主計算的主體服務匹配:研究綜述[J].計算機科學,2008, 35(12):1-4.
[3]馬會彬,趙曉南,李戰懷. 具有自律特征的網絡故障管理框架[J].微電子學與計算機, 2006,23(8):49-52.
[4]Ada Diaconescu,AdrianMos,JohnMurphy.Automatic performance management in component based software systems.Proceedings of the International Conference on Autonomic Computing,2004;6-18.
[5]林成.高可用服務器故障管理板設計與實現[D].哈爾濱工業大學,2012.
作者簡介:
程龍,(1985- ),男,漢族,國防科學技術大學計算機學院計算機科學與技術專業工程碩士。主要研究方向為計算機軟件理論。