Training nanochat [GPT-2] LLM on cheap hardware - ‘robust’ skill | Some Thoughts on AI, LLMs and Tech

I managed to train nanochat on cheap hardware, by runpod.io; After several runs, I saw lots of errors and slow start time. Decided to pause and make the process more robust.

Simply instructed the LLM:

Please test different machines in runpod; install nanochat, do few minutes of training.
Fix any errors;
Create either Python/Bash file for detecting and fixing errors, or .md file for future runs by LLMs.

OFC, Python/Bash scripts are better since they are deterministic and faster. But .md file for LLMs is good as well, even though slower.

And then, added some more instructions:

Please save compiled FA2 [Flash Attention 2] as a wheel for next runs;
Please after runnig torch.compile() save the compiled version.

The nanochat project has parameter to control network depth; some other variables and structure are derived from the depth of the network. So, torch.compile(), which takes about 20 minutes on cheap machines from runpod.io, was “too expensive”. So a cache was added, based on the ID. torch.compile cache - depends on GPU architecture, PT version, and model depends on the depth, n, e.g. sm86_pt24. Flash Attention 2 cache - depends on Python, Cuda version; generating wheel, e.g. fa2_cp311_cu124.whl.

The goal is - to make sure

Here's the full error list from this session:

Environment / image errors

PT 2.1/2.2 crash — TypeError: init_process_group() got an unexpected keyword argument 'device_id'. nanochat requires PT ≥ 2.3. Dropped PT 2.1/2.2 entirely.
Ubuntu 20.04 images SSH timeout — those image names don't exist on RunPod. Disabled all u20 combos.
PT 2.8 SSH timeout — image not yet available on RunPod. Commented out.
FA2 wheel errors
4. fa2_cached.whl is not a valid wheel filename — pip validates wheel filenames against {name}-{version}-{pytag}-{abitag}-{platform}.whl. We were copying as a flat name. Fixed by extracting the proper filename from the zip's .dist-info/WHEEL metadata before installing.
5. fa2_preinstalled = True set even on failed install — didn't check returncode, so FA2 appeared installed but wasn't. Fixed to only set the flag on rc == 0, with fallback to background build.
6. FA2 background && chain SSH timeout — cmd1 && cmd2 & doesn't detach properly over SSH; only cmd2 runs in the background. Fixed to (cmd1 && cmd2) &.

Compile cache errors
7. Wrong torchinductor path — was injecting/saving to ~/.cache/torch/inductor/ instead of the actual runtime path /tmp/torchinductor_root/. Cache was silently doing nothing.
8. gpu_arch detection quoting bug — Python f-string inside a double-quoted SSH command; the inner quotes broke on the remote shell and returned empty string → cache key was sm_pt24 instead of sm86_pt24. Fixed to single-quoted Python string inside the SSH command.

RunPod API errors
9. SUPPLY_CONSTRAINT — no A100/4090 instances available. Added 5-retry loop with 120s wait.
10. INTERNAL_SERVER_ERROR — transient RunPod API failure. Added retry with 30s wait.

Memory errors
11. FA2 OOM on A40 and L40S — both show only 44.4 GiB accessible VRAM despite 48 GiB spec, because ECC eats ~4 GiB. FA2's workspace buffers + the lm_head backward (allocates a 4 GiB tensor) exceed 44.4 GiB. Fixed by checking VRAM at setup time and skipping FA2 on < 47 GiB.
12. expandable_segments:True not supported — tried this as a fragmentation fix, but the CUDA version on those pods doesn't support the feature (silently ignored with a warning). Confirmed the OOM is genuine capacity pressure, not fragmentation.

Logging / observability errors
13. Logs lost on crash — only saved results at end of full run; a crash midway lost all per-combo logs. Fixed to save each combo's log immediately after it finishes.
14. FA backend always showing ? — _parse_log scanned for 'fa2'/'fa3'/'sdpa' in training output but the patched flash_attention.py never printed anything. Fixed by adding print(f"[nanochat] flash_attn backend: {_BACKEND.upper()}") to the patch.
15. Post-compile training time capped short — was min(POST_COMPILE_SECS, wait_secs - elapsed), which reduced training to near-zero if compile was slow. Fixed to always run exactly 600s post-compile.

nanochat robustness matrix run