Defining a reward function

One of the most important steps for reinforcement learning is the definiton of the reward function. This example shows how to do this in StableRLS.

[1]:
# this contains the environment class
import stablerls.gymFMU as gymFMU
# this will read our config file
import stablerls.configreader as cfg_reader

import numpy as np
import logging
[2]:
class my_env(gymFMU.StableRLS):
    def get_reward(self, action, observation):
        """This is my custom reward function"""
        info = {}
        reward = observation**2
        terminated = False
        truncated = False
        return reward, terminated, truncated, info

For simplicity we already included the compiled FMU models for Linux and Windows. However, if you own Matlab you can compile the *.slx models on your own. If you want to compile the model you can keep the default FMU_path in the config file. Otherwise please change it to 00-Simulink_Windows.fmu or 00-Simulink_Linux.fmu depending on your operating system.

[3]:
# First of all we have to read the config file
config = cfg_reader.configreader('00-config.cfg')

# if we want to we can compile the simulink model.
# Matlab and Matlab Engine for python is required!
if False:
    import stablerls.createFMU as createFMU
    createFMU.createFMU(config,'SimulinkExample00.slx')

The FMU is available now and the default options of the StableRLS gymnasium environment are sufficient to run the first simulation.

[4]:
# create instance of the model
env = my_env(config)

# default reset call bevor the simulation starts
obs = env.reset()

# we wont change the action
action = np.array([1,2,3,4])

terminated = False
truncated = False
while not (terminated or truncated):

    observation, reward, terminated, truncated, info  = env.step(action)
    print(f'Action: {action}\nObservation: {observation}\nReward: {reward}\n')
    action = action * 2

env.close()
Action: [1 2 3 4]
Observation: [3.]
Reward: [9.]

Action: [2 4 6 8]
Observation: [6.]
Reward: [36.]

If you want to include previous results you can use env.inputs/self.inputs or env.outputs/self.outputs for more complex reward calculation.