Say, in any Arabic dialect — «ناوليني سيروم الشعر» — and the arm sees what's on the table, localizes the item, grasps it with a learned policy, retries if it misses, hands it to a fixed delivery zone, and speaks back in Arabic. If the item isn't there, it doesn't guess — it says aloud what is.
Understands dialect, decides, disambiguates, refuses safely.
Reads labels — identifies which serum and localizes where.
Hears the user, replies in Arabic (voice Noor).
Grasps & delivers, with closed-loop retry.
Fanar reasons over four tools; deliver() runs a hybrid perception-guided grasp.
normalize.py fixes diacritics/brand mishears.Fanar is a task-level Vision-Language→Action controller (actions = tool calls); SmolVLA is the low-level motor controller. No hardcoded if/else — every semantic decision is Fanar's.
SYSTEM: أنت بَصير … اعرض المشهد قبل التسليم؛ تكلّم دائماً؛ وضّح عند الالتباس
USER: ناوليني السيروم
Fanar ▶ {"action":"perceive_scene"}
OBS: {"items":["سيروم الشعر","سيروم الوجه"]}
Fanar ▶ {"action":"ask","args":{"text_ar":"عندي سيروم الوجه وسيروم الشعر، أي واحد تبي؟"}}
USER: سيروم الشعر
Fanar ▶ {"action":"deliver","args":{"item":"سيروم الشعر"}}
Fanar ▶ {"action":"say","args":{"text_ar":"تفضّل، سيروم الشعر أمامك"}}
28 episodes (13 hair + 15 face serum) · single front camera 640×480@30 · LeRobot format · two language tasks · varied object placement · quality-checked firm grasps.
SmolVLA (SmolVLM2-500M backbone, ~450M params), language-conditioned · 20k steps, batch 32, loss 0.012 · deploys on MPS or CUDA · swappable with ACT / π0 / GR00T.
episodes
final loss
Fanar models used
control on MPS
| Capability | Model | Role |
|---|---|---|
| Reasoning | Fanar-C-2-27B | dialect understanding, planning, tool selection, refusal |
| Speech-to-text | Fanar-Aura-STT-1 | Arabic voice → text |
| Text-to-speech | Fanar-Aura-TTS-2 | spoken Arabic replies (Noor) |
| Vision | Fanar-Oryx-IVU-2 | reads labels → identifies & localizes the item |
Dialect handling is the star: Gulf / Egyptian / MSA map directly to actions — no English translation step. Oryx keeps perception Fanar-native and supplies the localization that guides the arm.
# voice / agent demo (laptop) pip install -r requirements.txt cp .env.example .env # FANAR_API_KEY=... cd backend && BASEER_PERCEIVE=oryx python server.py → http://localhost:8080 # deploy the grasp (laptop with the SO-100) python backend/robot/calibrate_grasp.py --port <follower> --id follower_so100 python backend/robot/save_delivery_pose.py --port <follower> --id follower_so100 python backend/robot/calibrate_localization.py --port <follower> --id follower_so100 python backend/agent/agent4_grasp.py --policy ~/baseer/policy_vla/pretrained_model \ --task "Pick up the hair serum and place it in the delivery zone" \ --item "سيروم الشعر" --attempts 3