Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
FAULT FILE STORAGE METHOD AND RELATED APPARATUS
Document Type and Number:
WIPO Patent Application WO/2023/165512
Kind Code:
A1
Abstract:
Provided in the present application is a fault file storage method, which is applied to a distributed training scenario in the field of artificial intelligence (AI). A distributed system comprises a management node and a plurality of training nodes, wherein the plurality of training nodes are used for cooperatively executing a training task. The method comprises: a management node acquiring a real-time signal from at least one training node among a plurality of training nodes, wherein the real-time signal is used for representing the state of the at least one training node; and the management node performing fault detection according to the real-time signal, and storing a fault file after a fault is detected, wherein the fault file is used for recovering a training task. In the method, fault detection is performed in real time, and a fault file is stored after a fault is detected, such that a training result of an iteration round when the fault occurs can be retained, and the training of the iteration round is prevented from being restarted on the basis of a large amount of sample data, thereby guaranteeing the training efficiency.

Inventors:
HAO RIPEI (CN)
WANG YIBIN (CN)
Application Number:
PCT/CN2023/078980
Publication Date:
September 07, 2023
Filing Date:
March 01, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
International Classes:
G06F16/182; G06F16/172; G06N3/04; G06N3/08
Foreign References:
CN114968947A2022-08-30
CN113569987A2021-10-29
US20190114537A12019-04-18
CN105095008A2015-11-25
CN105515847A2016-04-20
Download PDF: