Agents¶
强化学习Agents(智能体)是能够与环境进行交互的、具有自主决策能力和自主学习能力的独立单元。 在与环境交互过程中,Agents获取观测信息,根据观测信息计算出动作信息并执行该动作,使得环境进入下一步状态。 通过不断和环境进行交互,Agents收集经验数据,再根据经验数据训练模型,从而获得更优的策略。 以下列出了“玄策”平台中包含的单&多智能体强化学习Agents。
Agent |
PyTorch |
TensorFlow |
MindSpore |
---|---|---|---|
DQN: Deep Q-Networks |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
C51DQN: Distributional Reinforcement Learning |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
Double DQN: DQN with Double Q-learning |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
Dueling DQN: DQN with Dueling network |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
Noisy DQN: DQN with Parameter Space Noise |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
PERDQN: DQN with Prioritized Experience Replay |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
QRDQN: DQN with Quantile Regression |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
VPG: Vanilla Policy Gradient |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
PPG: Phasic Policy Gradient |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
PPO: Proximal Policy Optimization |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
PDQN: Parameterised DQN |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
SPDQN: Split PDQN |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
MPDQN: Multi-pass PDQN |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
A2C: Advantage Actor Critic |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
SAC: Soft Actor-Critic |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
SAC-Dis: SAC for Discrete Actions |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
DDPG: Deep Deterministic Policy Gradient |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
TD3: Twin Delayed DDPG |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
Multi-Agent |
PyTorch |
TensorFlow |
MindSpore |
---|---|---|---|
IQL: Independent Q-Learning |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
VDN: Value-Decomposition Networks |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
QMIX: VDN with Q-Mixer |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
WQMIX: Weighted QMIX |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
QTRAN: Q-Transformation |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
DCG: Deep Coordination Graph |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
IDDPG: Independent DDPG |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
MADDPG: Multi-Agent DDPG |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
ISAC: Independent SAC |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
MASAC: Multi-Agent SAC |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
IPPO: Independent PPO |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
MAPPO: Multi-Agent PPO |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
MATD3: Multi-Agent TD3 |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
VDAC: Value-Decomposition Actor-Critic |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
COMA: Counterfacutal Multi-Agent PG |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
MFQ: Mean-Field Q-Learning |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |
MFAC: Mean-Field Actor-Critic |
\(\checkmark\) |
\(\checkmark\) |
\(\checkmark\) |