How We Solved a Fascinating Issue with - Istio Service Mesh - mTLS - npm
Introduction
Recently we solved a fascinating issue with istio mTLS for a client of ours. It was one of the difficult issues we solved.
We were trying to implement security features using service mesh.
In this post, we detail how the setup was(closely), what issues we faced and how we fixed the issue and lessons learned etc.
Setup
The client was a SaaS provider. The application was based on micro service architecture with REST API.
The infrastructure was on AWS with services like API Gateway, NLB, AWS EKS, Istio, CloudFront, ACM etc.
The architecture was similar to the following diagram.
The applications were developed in nodejs/typescript as docker containers. The docker base was using nodes and alpine. The client had a dev, staging, prepared and prod environments.
They had unit tests, integration tests run based on a CI/CD runner which was running on Kubernetes. The the runner was deployed as shown below
As part of the security improvements, we have decided to implement end to end TLS encryption for the API, as there was no TLS after the NLB.
Improvements
To make end to end encryption of data, we decided to enable mTLS on the cluster level with istio service mesh.
We deployed the isito using automation via a combination of helm, terraform, CI/CD.
Enabled strict mTLS on the cluster using automation and redeployed all the applications via CI/CD to the dev environment.
The cluster level istio policy was like the following
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
spec:
mtls:
mode: STRICT
After the deployment of the application, the pipeline triggered some unit tests and they are all passed, so we decided to deploy the new changes to staging environment.
The deployment of the istio helm chart, policies and applications were successful in the staging environment.
Start of a strange issue
We hit our first hurdle when we ran integration tests in a staging environment.
When we ran the integration tests in staging, most of the tests were failing with Connection Reset error on the runner pod.
The CI/CD runner was running in the test namespace and ran "npm run integration-tests".
The package json for the integration tests looked like the following
{
...
"scripts": {
"integration-tests": "./scripts/run_integration_tests.sh",
},
...
}
The run_integration_tests.sh script was executing some BDD tests on product, order applications using dummy data.
Debugging
When the tests failed, we suspected straightaway that its to do with STRICT mode of istio policy.
So we reverted our change and enabled "PERMISSIVE" mode so that the tests pass while we figure out the cause of the issue.
The permissive cluster level policy was like below.
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
spec:
mtls:
mode: PERMISSIVE
In PERMISSIVE mode, the data from the pod can be encrypted or plaintext. It depends on namespace policy.
We started working on the issue in a separate cluster, with STRICT mode enabled at cluster level and then ran the following manual tests after exec into the CI/CD runner pod.
- Run "npm run integration-tests" on shell. The tests failed.
- Run curl to the product and order micro service service, they return with 200 OK (success)
- Run node using "node test.js" . They were successful.
- Run a new version of runner with base image. The "npm run integration-tests" tests passed.
So we concluded that there is some issue with npm on that particular version of docker node alpine image.
We had three choices,
- Upgrade the base image , upgrade all application and upgrade the environments
- Upgrade the base image of the runner but not upgrade all applications( the application and runner were using same base docker image)
- Investigate further to find why the tests failing and fix the issue
Due to some strategic reasons we decided to investigate further. So we run the following tests to narrow down the issue.
- Checked the logs of the npm, istiod proxy on the pod but no help
- Checked iptables and user details on the container
- Checked strace of the "npm run integration-tests"
With the last two checks we concluded that the traffic is trying to go out of the pod via HTTP instead of HTTPS.
At this point we thought this may be to do with some kind of child process on npm which is misbehaving. The NPM was running as root user when the pod was launched.
The ip tables on the pod looked like following
root@runner-axdyd-project-090-concurrent-0avrdf:/builds/WRszRSzH/0/products/product/mtls-test# iptables -t nat -L -v
Chain PREROUTING (policy ACCEPT 199 packets, 11940 bytes)
pkts bytes target prot opt in out source destination
199 11940 ISTIO_INBOUND tcp -- any any anywhere anywhere
Chain INPUT (policy ACCEPT 199 packets, 11940 bytes)
pkts bytes target prot opt in out source destination
Chain OUTPUT (policy ACCEPT 3802 packets, 331K bytes)
pkts bytes target prot opt in out source destination
748 44880 ISTIO_OUTPUT tcp -- any any anywhere anywhere
Chain POSTROUTING (policy ACCEPT 4171 packets, 353K bytes)
pkts bytes target prot opt in out source destination
Chain ISTIO_INBOUND (1 references)
pkts bytes target prot opt in out source destination
0 0 RETURN tcp -- any any anywhere anywhere tcp dpt:15008
0 0 RETURN tcp -- any any anywhere anywhere tcp dpt:ssh
29 1740 RETURN tcp -- any any anywhere anywhere tcp dpt:15090
170 10200 RETURN tcp -- any any anywhere anywhere tcp dpt:15021
0 0 RETURN tcp -- any any anywhere anywhere tcp dpt:15020
0 0 ISTIO_IN_REDIRECT tcp -- any any anywhere anywhere
Chain ISTIO_IN_REDIRECT (3 references)
pkts bytes target prot opt in out source destination
0 0 REDIRECT tcp -- any any anywhere anywhere redir ports 15006
Chain ISTIO_OUTPUT (1 references)
pkts bytes target prot opt in out source destination
0 0 RETURN all -- any lo ip-127-0-0-6.eu-west-1.compute.internal anywhere
0 0 ISTIO_IN_REDIRECT all -- any lo anywhere !localhost owner UID match 1337
0 0 RETURN all -- any lo anywhere anywhere ! owner UID match 1337
376 22560 RETURN all -- any any anywhere anywhere owner UID match 1337
0 0 ISTIO_IN_REDIRECT all -- any lo anywhere !localhost owner GID match 1337
0 0 RETURN all -- any lo anywhere anywhere ! owner GID match 1337
3 180 RETURN all -- any any anywhere anywhere owner GID match 1337
0 0 RETURN all -- any any anywhere localhost
369 22140 ISTIO_REDIRECT all -- any any anywhere anywhere
Chain ISTIO_REDIRECT (1 references)
pkts bytes target prot opt in out source destination
369 22140 REDIRECT tcp -- any any anywhere anywhere redir ports 15001
Reason of the behaviour
As part of the service mesh, when the pod is deployed, isito-init container is deployed to insert iptables to the pod.
The envoy proxy is deployed as a sidecar so that traffic is routed from the app container to the envoy proxy to encrypt the traffic when it leaves the pod, so mTLS is achieved.
Let's say CI/CD Runner in test namespace would like to reach an app on 9001 TCP port on the product namespace, the traffic flow would look like the following (not showing every step)
So the traffic will be go from Runner(http) -> iptables -> istio proxy (converted to https) -> out of pod(https)
but when the npm is run the traffic going out like Runner(http) -> iptables -> out of pod(http)
We found that even the runner was launched as root , the "npm run integration-tests" process was running with different user id
When we checked the uid of the running process it was 1337. So somehow, even though we run the container as root, the tests were run with a different UID (1337).
If we look at the above IP Tables and Isito documentation closely, the magic UID 1337 appears several times. The envoy proxy is run as a user with UID 1337 in the proxy container. This UID is used in the iptables to bypass the envoy proxy traffic going through itself and becoming an inline loop.
Due to some reason the npm tests were running with the same UID as proxy UID, the traffic was trying to leave pod unencrypted as http(as it was not routed to envoy proxy to encrypt).
Since the strict MTLS mode enabled, the traffic was rejected to leave the pod, hence the failed tests.
To make things worse we didn't know about the magic UID 1337 of istio while we faced the issue until we looked at Iptable rules in the pod and then checked istio documentation.
Solution
To fix the issue without any changes to the runner image, we run 'su node -c "npm run integration-tests" '. This fixed the issue as we run the container with user "node" who has a different UID than 1337 and the process was running as Node User only.
The Actual Reason
After the solution , out of curiosity we digged deep into why this happening with npm and found these issues run-script changing from root-user to non-root user and npm romise-spawn
This is what it has to say "When the current user is root, this will use[`infer-owner`](http://npm.im/infer-owner) to find the owner of the current working directory, and run with that effective uid/gid. Otherwise, it runsas the current user always."
So due to a bug in one of the npm dependencies, when we run npm as root, the user is switched. Since the istio-init container interacted with the file system while creating iptable rules as user with UID 1337 https://github.com/istio/istio/pull/20380/files on the pod, effectively UID 1337 becoming the owner of the directory where the tests were run.
Conclusion
After solving this issue, the client head of product came and told us that it was one of the fascinating issues he has seen, Hope you agree and liked the walk through, in the coming insights we look at how we solved docker container security , upgrade, package management etc.