بَصير · Baseer — Arabic voice-controlled assistive arm

01What it does

Say, in any Arabic dialect — «ناوليني سيروم الشعر» — and the arm sees what's on the table, localizes the item, grasps it with a learned policy, retries if it misses, hands it to a fixed delivery zone, and speaks back in Arabic. If the item isn't there, it doesn't guess — it says aloud what is.

Brain

Fanar-C-2-27B

Understands dialect, decides, disambiguates, refuses safely.

Eyes

Fanar-Oryx

Reads labels — identifies which serum and localizes where.

Voice

Aura ASR + TTS

Hears the user, replies in Arabic (voice Noor).

Muscle

SmolVLA · SO-100

Grasps & delivers, with closed-loop retry.

02The full pipeline

Fanar reasons over four tools; deliver() runs a hybrid perception-guided grasp.

Listen — phone tap-to-talk → Aura ASR → normalize.py fixes diacritics/brand mishears.
Reason — Fanar agent loop (ReAct) emits JSON actions: perceive → deliver → say / ask.
See — Fanar-Oryx reads the labels and reports which serums are on the table.
Localize — Oryx returns the target's pixel; a calibrated map → arm hover pose above it.
Grasp — SmolVLA policy does the final descent + close from the hover.
Verify — gripper torque (current) + width say HELD or EMPTY.
Retry — on a miss: open, re-home, re-localize, try again.
Deliver & speak — scripted waypoints to the hand-off zone, release, confirm in Arabic via Aura.

03Hierarchical VLA

Fanar is a task-level Vision-Language→Action controller (actions = tool calls); SmolVLA is the low-level motor controller. No hardcoded if/else — every semantic decision is Fanar's.

SYSTEM:  أنت بَصير … اعرض المشهد قبل التسليم؛ تكلّم دائماً؛ وضّح عند الالتباس
USER:    ناوليني السيروم
  Fanar ▶ {"action":"perceive_scene"}
OBS:     {"items":["سيروم الشعر","سيروم الوجه"]}
  Fanar ▶ {"action":"ask","args":{"text_ar":"عندي سيروم الوجه وسيروم الشعر، أي واحد تبي؟"}}
USER:    سيروم الشعر
  Fanar ▶ {"action":"deliver","args":{"item":"سيروم الشعر"}}
  Fanar ▶ {"action":"say","args":{"text_ar":"تفضّل، سيروم الشعر أمامك"}}

04Dataset & model

📦 Dataset — baseer_serums

28 episodes (13 hair + 15 face serum) · single front camera 640×480@30 · LeRobot format · two language tasks · varied object placement · quality-checked firm grasps.

🤖 Model — baseer-smolvla-serums

SmolVLA (SmolVLM2-500M backbone, ~450M params), language-conditioned · 20k steps, batch 32, loss 0.012 · deploys on MPS or CUDA · swappable with ACT / π0 / GR00T.

episodes

0.012

final loss

Fanar models used

~20Hz

control on MPS

05Use of Fanar

Capability	Model	Role
Reasoning	Fanar-C-2-27B	dialect understanding, planning, tool selection, refusal
Speech-to-text	Fanar-Aura-STT-1	Arabic voice → text
Text-to-speech	Fanar-Aura-TTS-2	spoken Arabic replies (Noor)
Vision	Fanar-Oryx-IVU-2	reads labels → identifies & localizes the item

Dialect handling is the star: Gulf / Egyptian / MSA map directly to actions — no English translation step. Oryx keeps perception Fanar-native and supplies the localization that guides the arm.

06Run it

# voice / agent demo (laptop)
pip install -r requirements.txt
cp .env.example .env          # FANAR_API_KEY=...
cd backend && BASEER_PERCEIVE=oryx python server.py   →  http://localhost:8080

# deploy the grasp (laptop with the SO-100)
python backend/robot/calibrate_grasp.py        --port <follower> --id follower_so100
python backend/robot/save_delivery_pose.py     --port <follower> --id follower_so100
python backend/robot/calibrate_localization.py --port <follower> --id follower_so100
python backend/agent/agent4_grasp.py --policy ~/baseer/policy_vla/pretrained_model \
  --task "Pick up the hair serum and place it in the delivery zone" \
  --item "سيروم الشعر" --attempts 3