About
I'm Chethan. I'm doing my MTech in AI at IISc — lived in Bengaluru my whole life, except for undergrad at NIT Surathkal.
I care about deep learning from first principles and hardware-software co-design — specifically: can superintelligence be efficient enough that everyone runs their own, on their own hardware. Local LLMs, open systems, unix philosophy.
Here's what I've been up to →
What I've been up to
Zero-order methods like SPSA estimate gradients with just two forward passes — but they pay a steep price: variance scales with dimension d, so per-coordinate SNR is roughly 1/d. Catastrophic at a trillion parameters.
What rescues this is the NTK regime. In massively overparameterised networks, the loss becomes nearly quadratic in each individual weight — higher-order terms average away across the rest of the network. A centred finite difference gives:
L(w+1) − L(w−1) = 2 · ∂L/∂w + (1/3) · ∂³L/∂w³ + …
error ≈ O(1/N) as parameters N → ∞ [dimensionality blessing]
Sparsity is the other lever. If the gradient is sparse, the effective dimension collapses and sample complexity drops from Ω(d/ε²) to Ω(s/ε²). BitNet suggested that ternary weights {−1, 0, +1} hit a sweet spot: natural sparsity makes zero-order practical, and the finite-difference step Δ = 1 lands exactly on valid weight values — no rounding required.
I've been doing toy work on a BitNet-style architecture trained on FineWeb — currently ~100M params, ~2B tokens. Would love to scale it and see where the zero-order + ternary story actually breaks down.
The memory wall is the real bottleneck in inference — not compute. In-memory computing (IMC) tries to fix this by doing multiply-accumulate directly inside the SRAM array, eliminating the constant weight-fetching that dominates power and latency.
Ternary weights make the hardware story almost embarrassingly clean. Each cell is 2 bits; the multiply collapses to a conditional add; peripheral logic simplifies drastically. A digital ternary IMC array at 7nm sits at ~0.15 µm²/cell. If you dedicate 80–90% of die to SRAM — which, for an inference-only chip, is entirely reasonable — you land at roughly 1–3 billion ternary weights on a single die, with weights never leaving the chip.
I've been writing toy Verilog for ternary mat-mul IMC macros: bit-serial, popcount-based, sign-separated {+1, −1} paths. Looking forward to scaling this and thinking properly about what a full inference chip looks like.
Every deepfake leaves two universal artifacts. Face Inconsistency Artifacts (FIA): seams between the forged region and the real background. Up-Sampling Artifacts (USA): the decoder's spectral fingerprint, unavoidably printed whenever a generator up-samples a latent code back to pixels.
The USA is generator-specific — an SD VAE's fingerprint looks different from StyleGAN's. Which means reading the USA gives you a path to few-shot generalisation: five images from an unseen generator, and you know what to look for.
I'm using prototypical networks for this. The prototype for each generator class should capture that generator's USA signature. I'm working on making the network explicitly learn the USA — contrasting inside-mask (forged, USA-bearing) features against outside-mask (real) features during prototype learning, so the representations are discriminative in the frequency domain where the fingerprint lives.
Slowly, on the side
The unix philosophy — one tool, one job, composable, all text — is quietly the ideal substrate for agentic LLMs. Everything is a file. Everything is inspectable. Pipes compose tools the same way tool calls compose agents. An LLM that can shell out is already a capable agent; unix just makes that surface enormous and coherent.
NixOS takes this one step further: one declarative file describes your entire system state — packages, configs, services, dotfiles. Reproducible. Rollbackable. A future agent managing a NixOS system can reason about it completely because the whole system is just data. No hidden mutable state, no "works on my machine." One file to rule it all.
I'm moving toward this slowly. Currently on Arch, watching NixOS from a distance.
I came across the Stanford seminar on hyperdimensional computing and went deep. The core idea: work in very high-dimensional binary or bipolar vectors where random vectors are nearly orthogonal with overwhelming probability. Encode, bind, and bundle information with XOR, permutation, and addition. The math works out, and no gradient is required.
In a small experiment: trigram features are generated via permutations, and "learning" is done by summing all the class vectors — that's it. No loss function. No backprop. No Taylor series. No optimizer. And it works reasonably well.
This keeps me awake at night. Not because I think it replaces gradient descent — but because it makes me question what "learning" actually means, and whether the objective-function framing is as fundamental as we treat it.
Uses
Hardware
Software — almost all FOSS
Elsewhere