Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
A METHOD FOR INVENTORY MANAGEMENT
Document Type and Number:
WIPO Patent Application WO/2023/180421
Kind Code:
A1
Abstract:
The invention relates to a method for inventory management of different types of products in a storage space, each type of product being associated with a state at each time, the state of each type of product being modified at each time by an action consisting in ordering a quantity of said type of product, the method comprising the following steps: - obtaining a training database, - training a control model for determining at each time an ordered quantity for each type of product depending on the state of each type of product, the control model being trained in a training environment on the basis of the training database so as to maximize a reward function for each cluster, and - operating the control model.

Inventors:
GOURVENEC SÉBASTIEN (FR)
LELUC RÉMI (FR)
Application Number:
PCT/EP2023/057416
Publication Date:
September 28, 2023
Filing Date:
March 23, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
TOTALENERGIES ONETECH (FR)
International Classes:
G06Q10/087
Foreign References:
EP3757915A12020-12-30
Other References:
HACHAICHI YASSINE ET AL: "A Policy Gradient Based Reinforcement Learning Method for Supply Chain Management", 2020 4TH INTERNATIONAL CONFERENCE ON ADVANCED SYSTEMS AND EMERGENT TECHNOLOGIES (IC_ASET), IEEE, 15 December 2020 (2020-12-15), pages 135 - 140, XP033879239, DOI: 10.1109/IC_ASET49463.2020.9318258
CHRISTIAN D HUBBS ET AL: "OR-Gym: A Reinforcement Learning Library for Operations Research Problems", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 17 October 2020 (2020-10-17), XP081789485
Attorney, Agent or Firm:
DOMENEGO, Bertrand et al. (FR)
Download PDF:
Claims:
CLAIMS

1 A method for inventory management of different types of products in a storage space, the types of products being classified into M different clusters corresponding to different storage subspaces of the storage space, each cluster (k) comprising at least one type of product (i), each storage subspace having a maximum storage capacity (A(k)), each type of product (i) being associated with a state (st(i)) at each time (t), the state (st(i)) of each type of product (i) being modified at each time (t) by an action consisting in ordering a quantity (at(i)) of said type of product (i), each ordered quantity (at(i)) being received after a lead-time (Tt(i)), each state (st(i)) of a type of product (i) comprising at least an inventory level (xt(i)) and a demand quantity (δt(i)), the inventory level (xt(i)) being the on-hand inventory of the considered type of product (i) at time (t), the demand quantity (δt(i)) being the quantity of the considered type of product (i) in demand, the method comprising the following steps which are computer-implemented:

- obtaining a training database comprising, for each type of product (i), an initial inventory level (xo°), a demand quantity (δt(i)) for each time (t) and a lead-time (Tt(i)) associated to an ordered quantity (at(i)) at said time (t),

- training a control model (M) for determining at each time (t) an ordered quantity (at(i)) for each type of product (i) depending on the state (st(i)) of each type of product (i), the control model (M) being trained in a training environment (E) on the basis of the training database so as to maximize a reward function (Rt(k)) for each cluster (k), the training environment being a reinforcement learning environment with an agent for each type of product (i), each reward function being relative to inventory costs, and

- operating the control model (M) in order to determine at each time (t) an ordered quantity (at(i)) for each type of product (i) of the storage space, following the reception by the control model (M), of the current state (st(i)) of each type of product (i).

2.- A method according to claim 1 , wherein each state (st(i)) of a type of product (i) comprises also a backlog level (pt(i)) and a replenishment quantity (ρt(i)), the backlog level (Pt(i)) being an unavailable quantity of the considered type of product (i) when demand exceeds the available inventory level, the replenishment quantity (ρt(i)) being a quantity of the considered type of product (i) that will be added to the inventory level when the inventory check is done at the next time (t+1), the replenishment quantity (ρt(i)) depending on the previous ordered quantity (ai(i), at(i)) for said type of product (i) and their corresponding lead-time (T1(i) ... , Tt(i)).

3.- A method according to claim 2, wherein during the training step and the operating step, the inventory level (xt+i(i)) for each type of product (i) at the next time (t+1 ) is computed as a function of the corresponding previous inventory level (xt(i)), of the corresponding previous demand quantity (δt(i)), and of the corresponding previous replenishment quantity (ρt(i)), the backlog level (βt+i(i)) for each type of product (i) at the next time (t+1 ) being computed as a function of the corresponding previous backlog level (pt(i)), of the corresponding previous inventory level (xt(i)), of the corresponding previous demand quantity (δt(i)) and of the corresponding previous replenishment quantity (ρt(i)).

4.- A method according to claim 3, wherein during the training step and the operating step, a temporary inventory level (λt(i)) is calculated at each time (t), for each type of product (i), as a function of the replenishment quantity (ρt(i)) at said time (t) and of the inventory level (xt(i)) at said time (t), for each cluster (k), when the sum of the temporary inventory levels (λt(i)) of the corresponding types of products is superior to the maximum storage capacity (Λ(k)) of said cluster (k) which causes a storage overflow, an overflow weight (wt(i)) is affected to the replenishment quantity (ρt(i)) of each type of product (i) during computation of the inventory level (xt+i (i)) at the next time (t+1 ) and during computation of the backlog level (pt+i (i)) at the next time (t+1 ).

5.- A method according to claim 4, wherein the inventory level (xt+i(i)) and the backlog level (βt+i(i)) for each type of product (i) at the next time (t+1 ) are given by the following formula:

Where: is the inventory level of the type of product i at time t+1 , is the inventory level of the type of product i at time t, is the overflow weight affected to the replenishment quantity is the replenishment quantity of the type of product i at time t, is the demand quantity of the type of product i at time t, is the backlog level of the type of product i at time t+1 , is the backlog level of the type of product i at time t,

• [X] is the integer part of X, max[ ; 0] is the maximum between X and 0, and min[ ; 0] is the minimum between X and 0.

6.- A method according to any one of claims 1 to 5, wherein at each time (t), each type of product (i) incurs some inventory costs (Ct(i)) given by the combination of ordering costs (Co(i)), holding costs (CH(i)) and shortage costs (Cs(i)) for the considered type of product (i), the ordering costs (Co(i)) being the costs associated with the ordering of the considered type of product (i), the holding costs (CH01) being the costs associated with storing inventory that remains unused, the shortage costs (Cs(i)) being the costs when demands exceeds the available inventory for the considered type of product (i), the reward function (Rtk) of each cluster (k) depending on the inventory costs (Ct(i)) of each type of product (i) of said cluster (k).

7.- A method according to claim 6 in its dependency with claims 4 or 5, wherein when an overflow occurs for a cluster (k), the overflow weights (wt(i)) affected to the replenishment quantity (ρt(i)) of each type of product (i) of the cluster (k) are chosen so as to favor the replenishment of the types of products associated with the largest shortage costs (Cs(i)) among the types of products with low inventory level (xt(i)).

8.- A method according to claim 6 or 7, wherein the reward function (Rtk) of each cluster (k) is given by the following formula: where:

• is the reward function for cluster k at time t,

• being the reward function for the type of product i at time t and being the inventory costs for the type of product i at time t, and

• Nk being the number of different types of products of cluster k.

9.- A method according to any one of claims 6 to 8, wherein the inventory costs (Ct(i)) of each type of product (i) at each time (t) is given by the following formula: where: is the inventory costs for the type of product i at time t, • a0, aH, as e [0,1] with a0 + aH + as = l , a0, aH, as being predefined weighting coefficients,

• is the ordered quantity of the type of product i at time t,

• is the inventory level of the type of product i at time t,

• is the backlog level of the type of product i at time t+1 ,

• are the ordering costs for the type of product i,

• are the holding costs for the type of product i, and

• are the shortage costs for the type of product i.

10.- A method according to any one of claims 1 to 9, wherein at least a cluster comprises different types of products, the training environment (E) being a multi-agent reinforcement learning environment, each agent corresponding to a different type of product of each cluster.

1 1.- A method according to any one of claims 1 to 10, wherein, in the training database, for each type of product (i), the demand quantity (δt(i)) at each time (t) and the lead-time (Tt(i)) at each time (t), are stochastic with stationary distributions, the stationary distributions being determined on the basis of real data.

12.- A method according to any one of claims 1 to 1 1 , wherein during the training step, each agent is trained according to a proximal policy optimization algorithm.

13.- A method according to any one of claims 1 to 12, wherein the method comprises a step of ordering at each time (t) the determined ordered quantity (at(i)) for each type of product (i) of the storage space.

14.- A method according to any one of claims 1 to 13, wherein during the operating step, operating data are memorized so as to carry out continual learning on the control model (M), the operating data comprising, for each type of product (i) at each considered time (t), the state (st(i)), the ordered quantity (at(i)) and the updated state (st+i(i)).

15.- A computer program product comprising a readable information carrier having stored thereon a computer program comprising program instructions, the computer program being loadable onto a data processing unit and causing at least the obtaining step, the training step and the operating step of a method according to any one of claims 1 to 14 to be carried out when the computer program is carried out on the data processing unit.

Description:
A method for inventory management

TECHNICAL FIELD OF THE INVENTION

The present invention concerns a method for inventory management of different types of products in a storage space. The present invention also concerns an associated computer program product. The present invention also relates to an associated readable information carrier.

BACKGROUND OF THE INVENTION

Inventory control is one of the major problems in the supply chain industry. The main goal is to ensure the right balance between the supply and demand of products by optimizing replenishment decisions. More precisely, a controller observes the past demands and local information of the inventory and has to decide about the next ordering values. Accurate inventory management is key to running a successful product business with benefits ranging from better inventory accuracy and insights to cost savings and avoidance of shortage and stock overflows.

The main issue of the supply chain is the environment uncertainty. In an ideal world with deterministic demand and lead-times, an inventory controller would be able to place a perfect order equal to the demand size at the right time. However, in practice, both the demands and lead-times are stochastic with potentially high volatility, making the inventory management problem hard to solve. In most cases, the inventory controller may exceedingly or insufficiently order. The former case leads to paying unnecessary ordering and holding costs while the latter results in shortage costs which may jeopardize the company’s performance.

Classical methods for solving the inventory management problem rely on basic heuristics due to the complexity of the mathematical modeling of the inventory system. While these approaches are easy to implement, they are not able to capture the randomness of the demand and lead-times.

Another way of solving the inventory management problem is dynamic programming. Despite being efficient, this technique requires exact knowledge of the mathematical model of the inventory system, which becomes intractable for big companies with very large inventories. Inventory management models can quickly become too complex and timeconsuming, leading to unworkable models. To escape this curse of dimensionality, one may apply approximate dynamic programming which performs well in specific settings at the cost of strong assumptions and simplifications. SUMMARY OF THE INVENTION

There exists a need for a method enabling to perform the inventory management of products in a more precise way to reduce the inventory costs.

To this end, the invention relates to a method for inventory management of different types of products in a storage space, the types of products being classified into M different clusters corresponding to different storage subspaces of the storage space, each cluster comprising at least one type of product, each storage subspace having a maximum storage capacity, each type of product being associated with a state at each time, the state of each type of product being modified at each time by an action consisting in ordering a quantity of said type of product, each ordered quantity being received after a lead-time, each state of a type of product comprising at least an inventory level and a demand quantity, the inventory level being the on-hand inventory of the considered type of product at time, the demand quantity being the quantity of the considered type of product in demand, the method comprising the following steps which are computer-implemented:

- obtaining a training database comprising, for each type of product, an initial inventory level, a demand quantity for each time and a lead-time associated to an ordered quantity at said time,

- training a control model for determining at each time an ordered quantity for each type of product depending on the state of each type of product, the control model being trained in a training environment on the basis of the training database so as to maximize a reward function for each cluster, the training environment being a reinforcement learning environment with an agent for each type of product, each reward function being relative to inventory costs, and

- operating the control model in order to determine at each time an ordered quantity for each type of product of the storage space, following the reception by the control model, of the current state of each type of product.

The method according to the invention may comprise one or more of the following features considered alone or in any combination that is technically possible:

- each state of a type of product comprises also a backlog level and a replenishment quantity, the backlog level being an unavailable quantity of the considered type of product when demand exceeds the available inventory level, the replenishment quantity being a quantity of the considered type of product that will be added to the inventory level when the inventory check is done at the next time, the replenishment quantity depending on the previous ordered quantity for said type of product and their corresponding lead-time;

- during the training step and the operating step, the inventory level for each type of product at the next time is computed as a function of the corresponding previous inventory level, of the corresponding previous demand quantity, and of the corresponding previous replenishment quantity, the backlog level for each type of product at the next time being computed as a function of the corresponding previous backlog level, of the corresponding previous inventory level, of the corresponding previous demand quantity and of the corresponding previous replenishment quantity;

- during the training step and the operating step, a temporary inventory level is calculated at each time, for each type of product, as a function of the replenishment quantity at said time and of the inventory level at said time, for each cluster, when the sum of the temporary inventory levels of the corresponding types of products is superior to the maximum storage capacity of said cluster which causes a storage overflow, an overflow weight is affected to the replenishment quantity of each type of product during computation of the inventory level at the next time and during computation of the backlog level at the next time;

- the inventory level and the backlog level for each type of product at the next time are given by the following formula:

Where: is the inventory level of the type of product i at time t+1 , is the inventory level of the type of product i at time t, is the overflow weight affected to the replenishment quantity is the replenishment quantity of the type of product i at time t, is the demand quantity of the type of product i at time t, is the backlog level of the type of product i at time t+1 , is the backlog level of the type of product i at time t,

• [X] is the integer part of X,

• max[X ; 0] is the maximum between X and 0, and

• min[X ; 0] is the minimum between X and 0;

- at each time, each type of product incurs some inventory costs given by the combination of ordering costs, holding costs and shortage costs for the considered type of product, the ordering costs being the costs associated with the ordering of the considered type of product, the holding costs being the costs associated with storing inventory that remains unused, the shortage costs being the costs when demands exceeds the available inventory for the considered type of product, the reward function of each cluster depending on the inventory costs of each type of product of said cluster;

- when an overflow occurs for a cluster, the overflow weights affected to the replenishment quantity of each type of product of the cluster are chosen so as to favor the replenishment of the types of products associated with the largest shortage costs among the types of products with low inventory level;

- the reward function of each cluster is given by the following formula: where:

• is the reward function for cluster k at time t,

• r t w = -C t w ,r t w being the reward function for the type of product i at time t and C t w being the inventory costs for the type of product i at time t, and

• N k being the number of different types of products of cluster k;

- the inventory costs of each type of product at each time is given by the following formula: where:

• C t w is the inventory costs for the type of product i at time t,

• a 0 , a H , a s e [0,1] with a 0 + a H + a s = 1 , a 0 , a H , a s being predefined weighting coefficients, is the ordered quantity of the type of product i at time t, is the inventory level of the type of product i at time t, is the backlog level of the type of product i at time t+1 , are the ordering costs for the type of product i, are the holding costs for the type of product i, and are the shortage costs for the type of product i;

- at least a cluster comprises different types of products, the training environment being a multi-agent reinforcement learning environment, each agent corresponding to a different type of product of each cluster;

- in the training database, for each type of product, the demand quantity at each time and the lead-time at each time, are stochastic with stationary distributions, the stationary distributions being determined on the basis of real data; - during the training step, each agent is trained according to a proximal policy optimization algorithm;

- the method comprises a step of ordering at each time the determined ordered quantity for each type of product of the storage space;

- during the operating step, operating data are memorized so as to carry out continual learning on the control model, the operating data comprising, for each type of product at each considered time, the state, the ordered quantity and the updated state.

The invention also relates to a computer program product comprising a readable information carrier having stored thereon a computer program comprising program instructions, the computer program being loadable onto a data processing unit and causing at least the obtaining step, the training step and the operating step of a method as previously described to be carried out when the computer program is carried out on the data processing unit.

The invention also relates to a readable information carrier on which is stored a computer program product as previously described.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention will be easier to understand in view of the following description, provided solely as an example and with reference to the appended drawings in which:

Figure 1 is a schematic view of an example of a computer for implementing a method for inventory management,

Figure 2 is a flowchart of an example of implementation of a method for inventory management, and

Figure 3 is a schematic representation illustrating an example of the interactions between a control model and a training environment during a training step of a method for inventory management.

DETAILED DESCRIPTION OF SOME EMBODIMENTS

An example of a calculator 20 and of a computer program product 22 are illustrated on figure 1 .

The calculator 20 is preferably a computer.

More generally, the calculator 20 is a computer or computing system, or similar electronic computing device adapted to manipulate and/or transform data represented as physical, such as electronic, quantities within the computing system's registers and/or memories into other data similarly represented as physical quantities within the computing system's memories, registers or other such information storage, transmission or display devices.

The calculator 20 interacts with the computer program product 22.

As illustrated on figure 1 , the calculator 20 comprises a processor 24 comprising a data processing unit 26, memories 28 and a reader 30 for information media. In the example illustrated on figure 3, the calculator 20 comprises a human machine interface 32, such as a keyboard, and a display 34.

The computer program product 22 comprises an information medium 36.

The information medium 36 is a medium readable by the calculator 20, usually by the data processing unit 26. The readable information medium 36 is a medium suitable for storing electronic instructions and capable of being coupled to a computer system bus.

By way of example, the information medium 36 is a USB key, a floppy disk or flexible disk (of the English name "Floppy disc"), an optical disk, a CD-ROM, a magneto-optical disk, a ROM memory, a memory RAM, EPROM memory, EEPROM memory, magnetic card or optical card.

On the information medium 36 is stored the computer program 22 comprising program instructions.

The computer program 22 is loadable on the data processing unit 26 and is adapted to entail the implementation of a method for inventory management, when the computer program 22 is loaded on the processing unit 26 of the calculator 20.

In a variant, the calculator 20 is in communication with a distant server on which the computer program is stored.

A method for inventory management of different types of products in a storage space, will now be described with reference to figures 2 and 3.

A number N = {1 ,...,n} of different types of products is considered in this method, where i e N refers to the i th type of product in the inventory and n is the total number of different types of products.

The different types of product are for example industrial components, such as fixations means (bolts, screws...) or replacement parts (hoses). The storage space is for example the space defined by shelves in a room.

The different types of product are classified into M different clusters corresponding to different storage subspaces of the storage space (for example, different sections of shelves). A number M = {1 ,...,m} of different clusters is considered in this method, where k e M refers to the k th cluster and m is the total number of clusters.

Each cluster k comprises at least one type of product. The types of products associated with one cluster k share a same storage subspace. When each cluster comprises only one type of product, the number of clusters m is equal to the number n of different types of products.

Preferably, the classification of the different types of products in clusters depends on at least one of the following parameter among: the size of the types of products, the weight of the types of products and the specific climatic conditions of the types of products, such as temperature, pressure, humidity levels, ventilation and light.

Each storage subspace has a maximum storage capacity A (k) . The maximum storage capacity A (k) is the maximum quantity of products corresponding to the cluster k that could be stored in the storage subspace of such a cluster k.

Each type of product i is associated with a state s t (i) at each time t. The state s t (i) of each type of product i is modified at each time t by an action consisting in ordering a quantity a t (i) of said type of product i. Each ordered quantity a t (i) is received after a lead- time Tt (i) .

Each state s t (i) of a type of product i comprises at least an inventory level x t (i) and a demand quantity δ t (i) .

The inventory level x t (i) is the on-hand inventory of the considered type of product i at time t, i.e., the quantity of products sitting on the shelves in the inventory.

The demand quantity δ t (i) is the quantity of the considered type of product i in demand. The demand quantity δ t (i) corresponds to a stochastic demand.

Preferably, each state s t (i) of a type of product i also comprises a backlog level p t (i) and a replenishment quantity p t (i) .

The backlog level p t (i) is an unavailable quantity of the considered type of product i when demand exceeds the available inventory level.

The replenishment quantity p t (i) is a quantity of the considered type of product i that will be added to the inventory level when the inventory check is done at the next time t+1 . In other words, the replenishment quantity p t (i) is the associated quantity of products which is on its way to the inventory. The replenishment quantity p t (i) depends on the previous ordered quantity ai (i) , a t (i) for said type of product i and their corresponding lead- time Ti (i) ),... , Tt (i) . Indeed, the received quantity of product i at time t corresponds to all the previous orders aj (i) made at time step j such that the receiving time (j+ Tj) obtained by adding the associated lead-time matches the current time step t, i.e. for where the function I is equal to 1 when , and is equal to 0 otherwise. The inventory management method comprises a step 100 of obtaining a training database. The obtaining step 100 is, for example, implemented by the calculator 20 interacting with the computer program product 22, that is to say is computer-implemented.

The training database comprises, for each type of product i:

- an initial inventory level xo (i) ,

- a demand quantity δ t (i) for each time t, and

- a lead-time T t (i) associated to an ordered quantity a t (i) at said time t.

It is noted that the initial backlog level βo (i) is considered being equal to zero.

Preferably, the demand quantity δ t (i) at each time t and the lead-time T t (i) at each time t, are stochastic with stationary distributions. Preferably, the stationary distributions have been determined on the basis of real data.

In a variant, the demand quantity δ t (i) at each time t is stochastic with a variable distribution. In other words, the distribution of the demand quantity δ t (i) varies over time. This enables training the control model M with different possible distributions of the demand quantity δ t (i) so as to be adaptable to change in the distribution of the demand quantity δ t (i) when operating the trained model M.

In a variant or in addition, the lead-time T t (i) at each time t is stochastic with a variable distribution. In other words, the distribution of the lead-time T t (i) varies over time. This enables training the control model M with different possible distributions of the lead-time T t (i) so as to be adaptable to change in the distribution of the lead-time T t (i) when operating the trained model M.

The inventory management method comprises a step 110 of training a control model M. The training step 110 is, for example, implemented by the calculator 20 interacting with the computer program product 22, that is to say is computer-implemented.

The obtained trained model M is a control model M suitable to be used for inventory management of different types of products in a storage space. The obtained trained model M is preferably specific to the different types of products and to the storage space for which the model M has been trained. Eventually, the control model M extends to categories of types of products having the same features than the types of products for which the model M has been trained and/or to categories of storage space having the same features than the storage space for which the model M has been trained.

The control model M is trained for determining at each time t an ordered quantity a t (i) (output) for each type of product i depending on the state s t (i) of each type of product i (input).

The control model M to be trained interacts with an environment E according to the principle of deep reinforcement learning. The model M to be trained is, for example, a neural network. More precisely, the control model M is trained in a training environment E on the basis of the training database so as to maximize a reward function R t (k) for each cluster k. The training environment is a reinforcement learning environment with an agent for each type of product i.

Preferably, when at least a cluster comprises different types of products, the training environment E is a multi-agent reinforcement learning environment (MARL), each agent corresponding to a different type of product of each cluster.

Preferably, each agent is trained according to a proximal policy optimization algorithm (PPO).

As illustrated on figure 3, during training, the model M to be trained is suitable for determining actions in response to states generated by a training environment E for each type of product i of the storage space. The actions a generated by the model M are suitable to be processed by the environment E. The training environment E calculates a reward function for each cluster and generates the resulting next states The model M is then trained on the basis of data comprising the sta tes the determined actions and the corresponding rewards The training of the model M is, for example, carried out according to a Q-learning algorithm (value based) or a policy-based algorithm. The training is, for example, done on an ongoing basis (for each time step) or later.

The reward function R t (k) for each cluster k is relative to inventory costs.

More precisely, at each time t, each type of product i incurs some inventory costs C t (i) . The reward function R t k of each cluster k depends on the inventory costs C t (i) of each type of product i of said cluster k.

Preferably, the inventory costs C t (i) of each type of product i is given by the combination of:

- ordering costs Co (i) for the considered type of product i. The ordering costs Co (i) are the costs associated with the ordering of the considered type of product i. For example, in the case of an ordering setting, these costs comprise the functioning costs, reception and tests costs, the salaries of the personnel, information systems costs and customs costs. In the case of a production setting, these costs comprise the raw material costs, labor costs, fixed and variable overheads, e.g.rent of a factory or the energy consumption allocated for production.

- holding costs CH (i) for the considered type of product i. The holding costs CH (i) are the costs associated with storing inventory that remains unused. The holding costs CH (i) may be divided into two categories: financial and functional costs. The former represent the financial interest of the money invested in procuring the stocked products. The latter comprises the rent and maintenance of the required space, insurance costs, equipment costs, interwarehouses transportation and obsolescence costs.

- shortage costs Cs (i) for the considered type of product i. The shortage costs Cs (i) are the costs when demands exceeds the available inventory for the considered type of product i. For example, the related costs fall in one of the two following categories: lost sales costs and backlogging costs. When considering lost sales, the unsatisfied demands are completely lost whereas with backlogging costs, there is a penalty shortage cost. Note that in the case where the stock is internal, the inventory shortage will induce the stop of production and therefore all the consequent costs.

Preferably, the inventory costs C t (i) of each type of product i at each time t is given by the following formula: where:

• C t w is the inventory costs for the type of product i at time t,

• a 0 , a H , a s e [0,1] with a 0 + a H + a s = 1 , a 0 , a H , a s being predefined weighting coefficients (translating some expert’s knowledge about the desired training strategy),

• is the ordered quantity of the type of product i at time t,

• is the inventory level of the type of product i at time t,

• is the backlog level of the type of product i at time t+1 ,

• are the ordering costs for the type of product i,

• are the holding costs for the type of product i, and

• are the shortage costs for the type of product i.

The corresponding reward function r t w for each type of product i is the opposite of the corresponding inventory costs C t w , given by the following formula:

Preferably, the reward function R t k of each cluster k is given by the following formula: where: is the reward function for cluster k at time t, being the reward function for the type of product i at time t and being the inventory costs for the type of product i at time t, and

• N k being the number of different types of products of cluster k.

Hence, for the cluster(s) comprising only one type of product i, the reward function R t k of the cluster corresponds to the reward function for the only type of product i of the cluster.

Considering the above reward functions, it appears clearly that the goal is to train each agent to order low quantities of products that are costly to stock and supply, while maintaining sufficient inventory levels to avoid stock-outs for critical products associated to large shortage costs.

Preferably, the inventory level x t+ i (i) for each type of product i at the next time t+1 is computed as a function of the corresponding previous inventory level x t (i) , of the corresponding previous demand quantity δ t (i) , and of the corresponding previous replenishment quantity p t (i) .

Preferably, the backlog level β t+ i (i) for each type of product i at the next time t+1 is computed as a function of the corresponding previous backlog level β t (i) , of the corresponding previous inventory level x t (i) , of the corresponding previous demand quantity δ t (i) and of the corresponding previous replenishment quantity p t (i) .

More precisely, in a preferred embodiment, a temporary inventory level λ t (i) is calculated at each time t, for each type of product i, as a function of the replenishment quantity p t (i) at said time t and of the inventory level x t (i) at said time t. Preferably, the temporary inventory level λ t (i) at time t is the sum of the replenishment quantity p t (i) at said time t and of the inventory level x t (i) at said time t.

In this preferred embodiment, for each cluster k, when the sum of the temporary inventory levels λ t (i) of all the corresponding types of products of the cluster k is superior to the maximum storage capacity Λ (k) of said cluster k, this incurs a storage overflow. An overflow weight w t (i) is then affected to the replenishment quantity p t (i) of each type of product i during computation of the inventory level x t+ i (i) at the next time t+1 and during computation of the backlog level β t+ i (i) at the next time t+1. Preferably, the replenishment quantities p t (i) are weighted so as to completely fill the available storage space, i.e., some overflow weights are chosen such that:

Where : is an overflow weight for the type of product i at time t, is the replenishment quantity of the type of product i at time t, is the number of different types of products of cluster k, and with the maximum storage capacity for cluster k, and the inventory level of the type of product i at time t. It should be noted that it is supposed that all the types of products of a cluster have the same volume. However, it is also possible to take into account the specific volume of each type of product i. In this case, is replaced by in the above i ) formula, with v ( the volume of the type of product i. In this preferred embodiment, the inventory level x t+1 (i) and the backlog level β t+1 (i) for each type of product i at the next time t+1 are given by the following formula: Where: is the inventory level of the type of product i at time t+1, s the inventory level of the type of product i at time t, s the overflow weight affected to the replenishment quantity s the replenishment quantity of the type of product i at time t, s the demand quantity of the type of product i at time t, is the backlog level of the type of product i at time t+1, is the backlog level of the type of product i at time t, [ X] is the integer part of X, max[ X; 0] is the maximum between X and 0, and min[ X ; 0] is the minimum between X and 0. Preferably, when an overflow occurs for a cluster k, the overflow weights w t (i) affected to the replenishment quantity of each type of product i of the cluster k are chosen so as to favor the replenishment of the types of products associated with the largest shortage costs among the types of products with low inventory level. The inventory management method comprises a step 120 of operating the control model M in order to determine at each time t an ordered quantity a t (i) for each type of product i of the storage space, following the reception by the control model M, of the current state s t (i) of each type of product i. The operating step 120 is, for example, implemented by the calculator 20 interacting with the computer program product 22, that is to say is computer- implemented. During the operating step, the inventory level x t+ i (i) and the backlog level p t+ i (i) for each type of product i at the next time t+1 , is computed the same way as during the training step.

Preferably, during the operating step, operating data are memorized so as to carry out continual learning on the control model M. The operating data comprises, for each type of product i at each considered time t, the state s t (i) , the ordered quantity a t (i) and the updated state s t+ i (i) .

The inventory management method comprises a step 130 of ordering at each time t the determined ordered quantity a t (i) for each type of product i of the storage space. The ordering step 130 is, for example, implemented by the calculator 20 interacting with the computer, in an automatic way. In a variant, the ordering step is carried out with the intervention of an operator.

Hence, the inventory management method uses reinforcement learning to address the inventory management problem with stochastic demands and lead-times for real-world applications, such as a multiproduct supply chain on a production line. This enables to perform the inventory management of products in a more precise way to reduce the inventory costs.

In an embodiment, the method also addresses through clusters the joint replenishment problem, i.e. when one considers the inter-dependency among different groups of products in a same order provided by a single supplier. The objective is to optimize a global replenishment cost composed of inventory ordering and holding costs but also to avoid stock-outs of items.

The person skilled in the art will understand that the embodiments and variants described above can be combined to form new embodiments provided that they are technically compatible.

EXPERIMENTAL EXAMPLES OF IMPLEMENTATION OF THE METHOD

This section is dedicated to numerical experiments on real data. We shall consider various scenarios regarding the capacity constraints of the warehouses: (i) when the items in a product cluster have their own capacity constraints then it is enough to train a single reinforcement learning (RL) agent and apply the resulting behavior to all the items in that cluster; (ii) when the items compete for storage space then we may apply some MARL algorithm to deal with the interdependency of items.

Stochastic Model The demands and lead-times are assumed to be stochastic with stationary distributions. The lead-times follow geometric distributions and the demands follow a mixed law of Poisson process with zero plateaus. More precisely, for each item i, the lead-time is given by T (i) ~ G(p) with parameter pi e (0,1) and the demand distribution is given by :

In other words, the demand process of item i is equal to δ( ~ X i with Xi ~ B(bi) and Yi ~ P(μi) with B the function corresponding to Bernoulli law and P the function corresponding to Poisson law.

Real data

The data comes from a company with warehouses in different countries. The data is used to find the model parameters of the stochastic distributions using some maximum likelihood estimators. Consider a fixed item i ∈ N with historical data of demands The model parameters bi and pi of the demands are estimated using the MLE estimators and (^defined as the empirical averages : where the function I is equal to 1 when and is equal to 0 otherwise.

Similarly, using a data history of lead-times the model parameter pi is estimated using p t defined by :

For the case study, the total number of products is equal to n = 50 with time steps equal to months. The history is ranging from 2002 to 2021 on different warehouses leading to a maximum number of historical data points equal to Nmax = 420. The horizon is set to T = 240 months and the methods are tested over 100 replications (independent random seeds).

Agent battle

In order to evaluate the performance of the developed approach, different baselines are implemented along with the reinforcement learning methods.

• MinMax agents: a standard min-max strategy (s,S) from operation research. For these agents, each item has a safety stock where Φ is the cumulative distribution function of the standard normal law N(0, 1), a e {0.90,0.95,0.99} represents the target service level and p,o 2 are the mean and variance of the demands and lead-times. The value of the service level is set to the classical value a = 0.90 and the means (E) and variances (Var) are computed using the MLE estimators and the closed form formulas:

As soon as the inventory level goes below the safety stock, the controller orders the maximum capacity of the corresponding item. Note that such approach is easy to imple- ment but may fail to anticipate spikes in the demand signal.

• Oracle agents: such agents implement the following heuristic: given the knowledge of the mean μδ and variance of the demand signal, it is natural to order, at each timestep, according to a random law with mean μδ and variance σ δ 2 . More precisely, the oracle agents have access to the estimated mean and standard deviation j of their associated products and order at each time according to a normal law which is clamped to fit the bounds of the action space.

• MARL agents: The reinforcement learning agents are trained using Proximal Policy Optimization (PPO) algorithms which are stable and effective policy gradient methods. When working with capacity constraints per item, both discrete and continuous policies are considered, denoted by PPO-D and PPO-C respectively. When the items compete for storage space, we implement IPPO which is an independent version of PPO for multi-agent frameworks.

Results

In the idea that similar items among a cluster share some intrinsic features, we first raise the question of generalization for a single agent. For that matter, we consider a first scenario with a cluster of 5 items with their own capacity constraints and a single RL agent trained on an average item, i.e., we train a ’’virtual” item whose features are given by taking the mean of the features of all the items in that cluster. Then we test the learned behavior on the 5 items of the cluster and report the different cumulative costs. Table 1 hereinafter presents the average cumulative costs in $ obtained over 100 replications of the different methods for the 5 items over an horizon of T = 240 months. | 4 | 79,235,272 | 16,976,630 | 8,801 ,628 | 4,808,191 |

Regarding the average cumulative costs, the clear winners are the RL-based methods. Indeed, the PPO methods are statistically better than the two other baselines with a cost reduction factor ranging from 8 (for item ID=1 ) to 16 (for item ID=4) compared to the standard (s,S) strategy. Interestingly, among the PPO methods, the one with continuous actions presents the best performance.

Another important result concerns the general behavior of the different methods regarding the number of items shortages. Table 2 hereinafter presents the average item shortages obtained over 100 replications of the different methods for the 5 items over an horizon of T = 240 months. Observe that, also for this criterion, the RL-based methods outperform the two standard baselines. Furthermore, the PPO agents learned an optimal strategy in terms of stock-outs since the number of item shortages is always equal to zero.

Another scenario involving storage capacity constraint is considered. In this case, we consider three different clusters Ni, N 2 and N 3 composed of 5,10 and 20 items respectively. Similarly to the single agent case, we compare the different methods based on two criteria, namely the average cumulative cost over a cluster and the average item shortages over a cluster over an horizon of T = 240 months.

The table hereinafter shows the average cumulative cost over the different items, obtained over 100 replications and an horizon of T = 240 months.

The table hereinafter shows the average stock-outs over the different items, obtained over 100 replications and an horizon of T = 240 months. Once again, the MARL-based agents present the best performance, both in terms of cost savings and avoidance of shortages. Observe that for cluster N1 , the MARL agents allows an overall cost reduction of 75% compared to the standard (s,S) strategy, 85% for cluster N2 and about 78% for cluster N3.