Professional Usage¶
The previous page demonstrated how to directly run an algorithm by calling the runner. In order to help users better understand the internal implementation process of “XuanCe”, and facilitate further algorithm development and implementation of their own reinforcement learning tasks, this section will take the PPO algorithm training on the MuJoCo environment task as an example, and provide a detailed introduction on how to call the API from the bottom level to implement reinforcement learning model training.
Step 1: Create config file¶
A config file should contains the necessary arguments of a PPO agent, and should be a YAML file. Here we show a config file named “mujoco.yaml” for MuJoCo environment in gym.
dl_toolbox: "torch" # The deep learning toolbox. Choices: "torch", "mindspore", "tensorlayer"
project_name: "XuanCe_Benchmark"
logger: "tensorboard" # Choices: tensorboard, wandb.
wandb_user_name: "your_user_name"
render: False
render_mode: 'rgb_array' # Choices: 'human', 'rgb_array'.
test_mode: False
device: "cuda:0"
agent: "PPO_Clip" # choice: PPO_Clip, PPO_KL
env_name: "MuJoCo"
vectorize: "Dummy_Gym"
runner: "DRL"
representation_hidden_size: [256,]
actor_hidden_size: [256,]
critic_hidden_size: [256,]
activation: "LeakyReLU"
seed: 79811
parallels: 16
running_steps: 1000000
n_steps: 256
n_epoch: 16
n_minibatch: 8
learning_rate: 0.0004
use_grad_clip: True
vf_coef: 0.25
ent_coef: 0.0
target_kl: 0.001 # for PPO_KL agent
clip_range: 0.2 # for PPO_Clip agent
clip_grad_norm: 0.5
gamma: 0.99
use_gae: True
gae_lambda: 0.95
use_advnorm: True
use_obsnorm: True
use_rewnorm: True
obsnorm_range: 5
rewnorm_range: 5
test_steps: 10000
eval_interval: 5000
test_episode: 5
log_dir: "./logs/ppo/"
model_dir: "./models/ppo/"
Step 2: Get the attributes of the example¶
This section mainly includes parameter reading, environment creation, model creation, and model training. First, create a ppo_mujoco.py file. The code writing process can be divided into the following steps:
Step 2.1 Get the hyper-parameters of command in console
Define the following function parse_args()
,
which uses the Python package argparse to read the command line instructions and obtain the instruction parameters.
import argparse
def parse_args():
parser = argparse.ArgumentParser("Example of XuanCe.")
parser.add_argument("--method", type=str, default="ppo")
parser.add_argument("--env", type=str, default="mujoco")
parser.add_argument("--env-id", type=str, default="InvertedPendulum-v4")
parser.add_argument("--test", type=int, default=0)
parser.add_argument("--device", type=str, default="cuda:0")
parser.add_argument("--benchmark", type=int, default=1)
parser.add_argument("--config", type=str, default="./ppo_mujoco_config.yaml")
return parser.parse_args()
Step 2.2 Get all attributes of the example
First, the parse_args()
function from Step 2.1 is called to read the command line instruction parameters,
and then the configuration parameters from Step 1 are obtained.
from xuance import get_arguments
if __name__ == "__main__":
parser = parse_args()
args = get_arguments(method=parser.method,
env=parser.env,
env_id=parser.env_id,
config_path=parser.config,
parser_args=parser)
run(args)
In this step, the get_arguments()
function from “XuanCe” is called.
In this function, it first searches for readable parameters based on the combination of the env
and env_id
variables in the xuance/configs/ directory.
If default parameters already exist, they are all read. Then, the function continues to index the configuration file from Step 1 using the config.path
path and reads all the parameters from the .yaml file.
Finally, it reads all the parameters from the parser
.
During the three reading processes, if there are duplicate variable names, the latter parameters will overwrite the former ones.
Ultimately, the get_arguments()
function will return the args
variable, which contains all the parameter information and is input into the run()
function.
Step 3: Define run(), create and run the model¶
Define the run() function with the input as the args variable obtained in Step 2. In this function, environment creation is implemented, and modules such as representation, policy, and agent are instantiated to perform the training.
Here is an example definition of the run() function with comments:
import os
from copy import deepcopy
import numpy as np
import torch.optim
from xuance.common import space2shape
from xuance.environment import make_envs
from xuance.torch.utils.operations import set_seed
from xuance.torch.utils import ActivationFunctions
def run(args):
agent_name = args.agent # get the name of Agent.
set_seed(args.seed) # set random seed.
# prepare directories for results
args.model_dir = os.path.join(os.getcwd(), args.model_dir, args.env_id) # the path for saved model.
args.log_dir = os.path.join(args.log_dir, args.env_id) # the path for logger file.
# build environments
envs = make_envs(args) # create simulation environments
args.observation_space = envs.observation_space # get observation space
args.action_space = envs.action_space # get action space
n_envs = envs.num_envs # get the number of vectorized environments.
# prepare representation
from xuance.torch.representations import Basic_MLP
representation = Basic_MLP(input_shape=space2shape(args.observation_space),
hidden_sizes=args.representation_hidden_size,
normalize=None,
initialize=torch.nn.init.orthogonal_,
activation=ActivationFunctions[args.activation],
device=args.device) # create representation
# prepare policy
from xuance.torch.policies import Gaussian_AC_Policy
policy = Gaussian_AC_Policy(action_space=args.action_space,
representation=representation,
actor_hidden_size=args.actor_hidden_size,
critic_hidden_size=args.critic_hidden_size,
normalize=None,
initialize=torch.nn.init.orthogonal_,
activation=ActivationFunctions[args.activation],
device=args.device) # create Gaussian policy
# prepare agent
from xuance.torch.agents import PPOCLIP_Agent, get_total_iters
optimizer = torch.optim.Adam(policy.parameters(), args.learning_rate, eps=1e-5) # create optimizer
lr_scheduler = torch.optim.lr_scheduler.LinearLR(optimizer, start_factor=1.0, end_factor=0.0,
total_iters=get_total_iters(agent_name, args)) # for learning rate decay
agent = PPOCLIP_Agent(config=args,
envs=envs,
policy=policy,
optimizer=optimizer,
scheduler=lr_scheduler,
device=args.device) # create a PPO agent
# start running
envs.reset() # reset the environments
if args.benchmark: # run benchmark
def env_fn(): # for creating testing environments
args_test = deepcopy(args)
args_test.parallels = args_test.test_episode # set number of testing environments.
return make_envs(args_test) # make testing environments.
train_steps = args.running_steps // n_envs # calculate the total running steps.
eval_interval = args.eval_interval // n_envs # calculate the number of training steps per epoch.
test_episode = args.test_episode # calculate the number of testing episodes.
num_epoch = int(train_steps / eval_interval) # calculate the number of epochs.
test_scores = agent.test(env_fn, test_episode) # first test
best_scores_info = {"mean": np.mean(test_scores), # average episode scores.
"std": np.std(test_scores), # the standard deviation of the episode scores.
"step": agent.current_step} # current step
for i_epoch in range(num_epoch): # begin benchmarking
print("Epoch: %d/%d:" % (i_epoch, num_epoch))
agent.train(eval_interval) # train the model for some steps
test_scores = agent.test(env_fn, test_episode) # test the model for some episodes
if np.mean(test_scores) > best_scores_info["mean"]: # if current score is better than history
best_scores_info = {"mean": np.mean(test_scores),
"std": np.std(test_scores),
"step": agent.current_step}
# save best model
agent.save_model(model_name="best_model.pth")
# end benchmarking
print("Best Model Score: %.2f, std=%.2f" % (best_scores_info["mean"], best_scores_info["std"]))
else:
if not args.test: # train the model without testing
n_train_steps = args.running_steps // n_envs # calculate the total steps of training
agent.train(n_train_steps) # train the model directly.
agent.save_model("final_train_model.pth") # save the final model file.
print("Finish training!")
else: # test a trained model
def env_fn():
args_test = deepcopy(args)
args_test.parallels = 1
return make_envs(args_test)
agent.render = True
agent.load_model(agent.model_dir_load, args.seed) # load the model file
scores = agent.test(env_fn, args.test_episode) # test the model
print(f"Mean Score: {np.mean(scores)}, Std: {np.std(scores)}")
print("Finish testing.")
# the end.
envs.close() # close the environment
agent.finish() # finish the example
After finishing the above three steps, you can run the python_mujoco.py file in console and train the model:
python ppo_mujoco.py --method ppo --env mujoco --env-id Ant-v4
The source code of this example can be visited at the following link:
https://github.com/agi-brain/xuance/examples/ppo/ppo_mujoco.py