Direct Server Return

More Information

For a more detailed explanation on how to use Direct Server Return (DSR) to build a highly scalable and available ingress for Kubernetes see the following blog post

What is DSR?

When enabled, DSR allows the service endpoint to respond directly to the client request, bypassing the service proxy. When DSR is enabled kube-router will use LVS's tunneling mode to achieve this (more on how later).

Quick Start

You can enable DSR functionality on a per service basis.

Requirements:

ClusterIP type service has an externalIP set on it or is a LoadBalancer type service
kube-router has been started with --service-external-ip-range configured at least once. This option can be specified multiple times for multiple ranges. The external IPs or LoadBalancer IPs must be included in these ranges.
kube-router must be run in service proxy mode with --run-service-proxy (this option is defaulted to true if left unspecified)
If you are advertising the service outside the cluster --advertise-external-ip must be set
If kube-router is deployed as a Kubernetes pod:
hostIPC: true must be set for the pod
hostPID: true must be set for the pod
The container runtime (CRI) socket directory must be mounted into the kube-router pod via a hostPath volume mount. We need to mount the entire directory as the socket file might change in case of the container runtime restarts.
/etc/iproute2/rt_tables (or similar) must be read/write mounted into the kube-router pod via a hostPath volume mount. NOTE: since v6.5.0 of iproute2 this file has been moved underneath /usr in either /usr/lib/iproute2/rt_tables or /usr/share/iproute2/rt_tables instead of in /etc so this mount may need to be updated depending on which version of Linux you're deploying against. kube-router will check all 3 locations and use them in order of the above.
A pod network that allows for IPIP encapsulated traffic. The most notable exception to this is that Azure does not transit IPIP encapsulated packets on their network. In this scenario, the end-user may be able to get around this issue by enabling FoU (--overlay-encap=fou) and full overlay networking (--overlay-type=full) options in kube-router. This hasn't been well tested, but it should allow the DSR encapsulated traffic to route correctly.

To enable DSR you need to annotate service with the kube-router.io/service.dsr=tunnel annotation:

kubectl annotate service my-service "kube-router.io/service.dsr=tunnel"

Things To Lookout For

In the current implementation, DSR will only be available to the external IPs or LoadBalancer IPs
The current implementation does not support port remapping. So you need to use same port and target port for the service.
In order for DSR to work correctly, an ipip tunnel to the pod is used. This reduces the MTU for the packet by 20 bytes. Because of the way DSR works it is not possible for clients to use PMTU to discover this MTU reduction. In TCP based services, we mitigate this by using iptables to set the TCP MSS value to 20 bytes less than kube-router's primary interface MTU size. However, it is not possible to do this for UDP streams. Therefore, UDP streams that continuously use large packets may see a performance impact due to packet fragmentation. Additionally, if clients set the DF (Do Not Fragment) bit, services may see packet loss on UDP services.

Kubernetes Pod Examples

As of kube-router-1.2.X and later, kube-router's DSR mode now works with CRI compliant container runtimes. Officially only containerd has been tested, but this solution should work with cri-o as well.

Most of what was said above also applies for non-docker container runtimes, however, there are some adjustments that you'll need to make:

You'll need to let kube-router know what container runtime socket to use via the --runtime-endpoint CLI parameter
If running kube-router as a Kubernetes deployment you'll need to make sure that you expose the correct socket via hostPath volume mount

Here is an example kube-router daemonset manifest with just the changes needed to enable DSR with containerd (this is not a full manifest, it is just meant to highlight differences):

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: kube-router
spec:
  template:
    spec:
    ...
      volumes:
      - name: containerd-sock
        hostPath:
          path: /run/containerd/
      - name: rt-tables
        hostPath:
          path: /etc/iproute2/rt_tables
      ...
      containers:
      - name: kube-router
        args:
        - --runtime-endpoint=unix:///run/containerd/containerd.sock
        ...
        volumeMounts:
        - name: containerd-sock
          mountPath: /run/containerd/
          readOnly: true
        - name: rt-tables
          mountPath: /etc/iproute2/rt_tables
          readOnly: false
...

For an example manifest please look at the kube-router all features manifest with DSR requirements for containerd enabled.

More Details About DSR

In order to facilitate troubleshooting it is worth while to explain how kube-router accomplishes DSR functionality.

kube-router adds iptables rules to the mangle table which marks incoming packets destined for DSR based services with a unique FW mark. This mark is then used in later stages to identify the packet and route it correctly. Additionally, for TCP streams, there are rules that enable TCP MSS since the packets will change MTU when traversing an ipip tunnel later on.
kube-router adds the marks to an ip rule (see: ip-rule(8)). This ip rule then forces the incoming DSR service packets to use a specific routing table.
kube-router adds a new ip route table (at the time of this writing the table number is 78) which forces the packet to route to the host even though there are no interfaces on the host that carry the DSR IP address
kube-router adds an IPVS server configured for the custom FW mark. When packets arrive on the localhost interface because of the above ip rule and ip route, IPVS will intercept them based on their unique FW mark.
When pods selected by the DSR service become ready, kube-router adds endpoints configured for tunnel mode to the above IPVS server. Each endpoint is configured in tunnel mode (as opposed to masquerade mode), which then encapsulates the incoming packet in an ipip packet. It is at this point that the pod's destination IP is placed on the ipip packet header so that a packet can be routed to the pod via the kube-bridge on either this host or the destination host.
kube-router then finds the targeted pod and enters it's local network namespace. Once inside the pod's linux network namespace, it sets up two new interfaces called kube-dummy-if and ipip. kube-dummy-if is configured with the externalIP address of the service.
When the ipip packet arrives inside the pod, the original source packet with the externalIP is then extracted from the ipip packet via the ipip interface and is accepted to the listening application via the kube-dummy-if interface.
When the application sends its response back to the client, it responds to the client's public IP address (since that is what it saw on the request's IP header) and the packet is returned directly to the client (as opposed to traversing the Kubernetes internal network and potentially making multiple intermediate hops)