Multiagent Reinforcement Learning for Joint Spectrum and Energy Optimization in CR-NOMA Enabled Internet of Unmanned Agents

Saleha Ahmed; Muhammad Uzair; Syed Asad Ullah; Saleha Ahmed; Muhammad Uzair; Syed Asad Ullah

doi:10.1109/JIOT.2025.3624882

Islamabad, Pakistan

Muhammad Uzair. Reinforcement learning researcher working on multi-agent DRL and human-guided policy optimization.

Two IEEE publications on cooperative multi-agent RL for wireless systems. Mitacs Globalink researcher at McMaster University (2025), advised by Dr. Istvan David, on integrating human advisory signals into PPO via subjective logic. Open to research-focused Masters opportunities, available to start any upcoming term.

Research Download CV

IEEE publications

Mitacs

Globalink 2025

3.61

CGPA / 4.00

2025

NUST B.E. SE

01 / Research

Research interests.

I am looking for research-focused Masters opportunities in machine learning, available to start any upcoming term. My focus is reinforcement learning under realistic constraints: partial observability, multi-agent coordination, and learning from limited or noisy human feedback.

01 area

Multi-agent reinforcement learning

Cooperative and competitive policy learning under partial observability. Two IEEE publications on decentralized MARL for CR-NOMA wireless systems.

MARLcooperative DRLpartial observability
02 area

Human-guided policy optimization

Integrating advisory signals from humans into on-policy training. Mitacs Globalink work with Dr. Istvan David at McMaster on subjective-logic belief modeling layered onto PPO.

PPOsubjective logichuman-in-the-loop
03 area

RL for wireless and communications

Sample-efficient continuous-action policies for transmit power control, spectrum access, and energy harvesting under stochastic fading. Benchmarks on DDPG, TD3, PPO.

continuous controlNOMAenergy harvesting
04 area

LLM agents and RLHF

Bridging classical RL with language-model post-training. Industry exposure to retrieval, tool use, and multi-agent orchestration informs research questions on alignment and reward modeling.

RLHFreward modelingagentic LLMs

02 / Background

About.

I graduated from NUST in 2025 with a B.E. in software engineering and a 3.61 / 4.00 CGPA. My research track started in my third year at the Information Processing and Transmission Lab under Prof. Dr. Syed Ali Hassan, where I co-authored two IEEE papers on multi-agent DRL for wireless systems.

In summer 2025 I was a fully funded Mitacs Globalink research intern at McMaster University with Dr. Istvan David at the SSM Lab. I worked on guided policy optimization: integrating human advisory signals into PPO via subjective logic belief modeling on Gymnasium PacMan, with measurable convergence gains over baseline PPO.

Alongside research, I work as an AI engineer at Adept Tech Solutions on voice and email agents. This pays the bills and keeps me close to production LLM systems, which is informing my interest in RLHF and reward modeling research.

03 / Peer reviewed

Publications.

Google Scholar ORCID

[01]

Multiagent Reinforcement Learning for Joint Spectrum and Energy Optimization in CR-NOMA Enabled Internet of Unmanned Agents

Saleha Ahmed, Muhammad Uzair, Syed Asad Ullah, et al.

IEEE Internet of Things Journal · 2025

A cooperative multi agent DRL framework for CR-NOMA IoT, where distributed agents jointly learn spectrum access and power control policies under partial observability.

pdf doi
[02]

Energy Efficient Uplink Communications for Wireless Powered Networks with EH Diversity: A DRL-driven Strategy

Saleha Ahmed, Muhammad Uzair, Syed Asad Ullah, et al.

IEEE International Conference on Communications (ICC) · 2025

DRL driven transmit power control for energy harvesting uplink nodes, evaluated against MRC, SC, and EGC diversity combining schemes under Rayleigh fading.

pdf doi

04 / Selected work

Projects.

01

Human-Guided Policy Gradient on Atari Pacman

A REINFORCE policy-gradient agent on ALE Pacman with human advisory signals fused into the policy via subjective-logic belief modeling. Achieves measurable convergence speedup over baseline PPO.

RLREINFORCEPPOSubjective LogicHuman-in-the-loopPyTorchALE

code
02

Multi-Agent DRL for CR-NOMA Secondary Device Networks

Implementation backing two IEEE publications on cooperative MARL for wireless powered networks. From-scratch MADDPG, MATD3, MASAC, and MAPPO over a CR-NOMA environment with energy harvesting and three diversity-combining receiver variants (SC, MRC, EGC).

MARLMADDPGMATD3MASACMAPPOPyTorchWireless / NOMA

code
03

PPO on Gymnasium Inverted Pendulum

Proximal Policy Optimization on the MuJoCo inverted pendulum with a hand-shaped reward and Optuna-driven hyperparameter search. Groundwork for my Mitacs research on guided policy optimization.

RLPPOMuJoCoOptunaStable-Baselines3

code
04

Distributed Learning-to-Rank on ANTIQUE

Comparative study of single-device vs multi-GPU training for a learning-to-rank DNN on the ANTIQUE QA dataset, built on TensorFlow Ranking. Same ranking quality across both strategies, with a discussion of when MirroredStrategy actually buys you speed.

Learning to RankTensorFlow RankingDistributed TrainingInformation Retrieval

code
05

Air Quality Monitoring on Azure IoT

End-to-end IoT and ML pipeline built around a custom dataset gathered from ESP32 sensors deployed across campus. 75% AQI regression accuracy on held-out days.

IoTAzureESP32MLTime Series

code
06

RAG Banking Assistant

Domain-specific LLM assistant for banking queries on LLaMA 3.2 3B Instruct with LoRA fine-tuning, FAISS retrieval, and live document ingestion. Testing how far a small open-weights model can go on a narrow domain when retrieval is well tuned.

LLMRAGPEFT / LoRAFastAPISentence Transformers

code

05 / Timeline

Experience.

research
Jun 2025 to Aug 2025

Research Intern, Mitacs Globalink · McMaster University

Hamilton, ON, Canada

Advised by Dr. Istvan David · SSM Lab
- Fully funded Mitacs Globalink internship on guided policy optimization in sequential decision making under partial observability.
- Benchmarked REINFORCE and PPO on Gymnasium PacMan; tuned reward shaping and entropy regularization for stable convergence.
- Designed a subjective-logic belief model that fuses human advisory signals with the policy gradient, achieving measurable convergence speedup over baseline PPO.
research
Jun 2024 to Sep 2025

Research Collaborator · Information Processing and Transmission Lab, NUST

Islamabad, PK

Advised by Prof. Dr. Syed Ali Hassan
- Co-authored two IEEE publications on multi-agent DRL for cognitive-radio NOMA and wireless powered networks.
- Developed a cooperative MARL framework for joint spectrum access and power control in CR-NOMA IoT under partial observability.
- Benchmarked DDPG, TD3, and PPO for continuous-action transmit power control under stochastic Rayleigh fading.
- Analyzed MRC, SC, and EGC diversity combining schemes for energy harvesting uplink nodes.
industry
Nov 2025 to Present

AI Engineer · Adept Tech Solutions

Islamabad, PK
- Engineering production LLM systems: voice agents on VAPI and Deepgram with sub-400ms transcription latency.
- Multi-agent LLM orchestration over FastAPI microservices, plus a RAG retrieval layer on pgvector with 768-dimensional MPNet embeddings.
- Operational context that keeps me close to alignment, reward modeling, and inference-time control as research questions.

06 / Academic

Education.

education
Nov 2021 to Jun 2025

B.E. Software Engineering · National University of Sciences and Technology

Islamabad, PK
- CGPA 3.61 / 4.00 over 133 credit hours. Degree conferred 13 June 2025.
- School of Electrical Engineering and Computer Science.
- 4x FBISE HSSC merit scholarship recipient.
Relevant coursework

CS-368 Reinforcement Learning A
CS-471 Machine Learning A
MATH-361 Probability & Statistics A
MATH-121 Linear Algebra & ODEs A
MATH-352 Numerical Methods A
MATH-232 Complex Variables & Transforms A
CS-416 Large Language Models B+
CS-250 Data Structures & Algorithms B+
CS-251 Design & Analysis of Algorithms B+
MATH-101 Calculus & Analytical Geometry B+

07 / Get in touch

Contact.

available

Open to research-focused Masters opportunities in machine learning, available to start any upcoming term (Winter, Spring, or Fall 2026 / 2027). My research focus is reinforcement learning, multi-agent systems, and learning from human feedback.

If you are a professor or graduate admissions reviewer, email is the fastest way to reach me and I respond within a day. My CV is linked below.

Download CV

Muhammad Uzair. Reinforcement learning researcher working on multi-agent DRL and human-guided policy optimization.

Research interests.