Mengjie Zhao
Infra engineer at LinkedIn working on latency-sensitive realtime ML inference for ads scoring and ranking at internet scale. Currently going deep on LLM inference serving — paged attention, continuous batching, disaggregated serving.
Notes, projects, and writing on infra for ML systems at scale.
CV · Writing · GitHub · LinkedIn · zmj0129@gmail.com
Selected projects
- Differential serving for CPU→GPU ads ranking — Dual-fire architecture that scores live ad requests on a TensorFlow CPU champion and a new PyTorch GPU candidate in parallel, with async overlap, asymmetric resolution, and a three-gate routing decision (experiment, tmax, config kill switch) so we can A/B a new realtime inference stack without risking the served path.
- Early Stage Ranking infra (Two-Tower at ~10M scale) — Co-designed the infra to collapse retrieval + first-pass ranking into a single Two-Tower scoring layer over the active creative corpus, with nearline embedding generation, version alignment between member and creative towers, and dot-product scoring on sharded indexers. First end-to-end experiment delivered an over-20% CTR lift.
- Entity cache rewrite for the ads serving fleet — Lead-authored the RFCs and rollout that moved hundreds of ads-serving hosts off direct Oracle reads onto a per-account-partitioned Kafka + Couchbase pipeline, decoupled the in-memory cache from the indexer, and shipped a portable cache library so adjacent ads services could onboard without a rewrite.