The Discovery of Apache ZooKeeper’s Poison Packet

https://www.pagerduty.com/blog/the-discovery-of-apache-zookeepers-poison-packet/

  • Pagerduty uses IPSec to encrypt IP payloads
  • Zookeeper was reading a scheme_len value off the wire. No bounds check, and this value would occasionally be set to 1+GB, causing a Java OOM. Only real explanation is packet corruption.
  • The Linux kernel ignores TCP checksums, assuming that IPSec checksums (ESP) are sufficient - this is supported by the RFC. This isn’t entirely accurate, and was responsible for the packet corruption that Pagerduty was seeing.
/*
 * 2) ignore UDP/TCP checksums in case
 *    of NAT-T in Transport Mode, or
 *    perform other post-processing fixes
 *    as per draft-ietf-ipsec-udp-encaps-06,
 *    section 3.1.2
 */
if (x->props.mode == XFRM_MODE_TRANSPORT)
  skb->ip_summed = CHECKSUM_UNNECESSARY;
  • Intel’s AES kernel module (loaded for IPSec encryption) was ultimately responsible.
  • Regular SSL uses AES as well, but TCP checksums aren’t skipped there!
Edit