The inventory stock management includes aspects such as controlling and overseeing purchases from suppliers as well as customers, maintaining the storage of stock, controlling the amount of product for sale, and order fulfillment. A decision maker (learning agent) observes the random stochastic demands and local information of inventory such as inventory levels as its inputs to make decisions about the next ordering values as its actions. Since the inventory on-hand (the available amount of stock in inventory), unmet demands (backorders), and the existence of ordering are costly, the optimization problem is designed to minimize the overall cumulative costs.
As a result, the objective function is to reduce the long-run cost (cumulative reward) whose components are linear holding, linear penalties, and fixed ordering costs. In most inventory management policies, this is done using basic heuristics that are not always able to account for the complexity of the system and the stochasticity of the demand.
This results in two possible scenarios: the first is to exceedingly order which results in paying unnecessary costs, the second is to make an insufficient order which results in unsatisfied demands.