DevSecOps

Infra Trouble Shooting - 3

istio 도입 과정

Last updated on March 16, 2025, 6:46 p.m.

안녕하세요, 오늘은 k8s 클러스터에 istio를 적용했다가 발생한 에러 아닌 에러에 대해 설명을 드리고자 합니다.

평소에 현업에 계신분들이라면 이런 말 많이 들어보셨을거에요.
“10분 뒤에 접속 해보세요”

저는 부끄럽게도 이제야 그 의미를 깨우칠 수 있었습니다.

우선 항상 그랬던것처럼 환경부터 설명드리겠습니다.
EC2 인스턴스 여러개에 k8s 클러스터를 구축했습니다. 그 안에 리소스로 서비스 메쉬 툴인 istio를 설치하였습니다. (참고로, istio를 선택한 이유는 각 서비스별 트래픽추적과 배포전략이나 여러모로 유용하기때문에 선택했습니다.)

문제

istio 설치 이후 각 서비스별 테스트를 해봤을땐 문제가 없었습니다. 그런데 서버만 재시작하면 꼭 30분동안은 502 gateway 에러가 발생하고, 그 이후엔 모든 서비스가 정상적으로 올라왔습니다.

하지만 로그에는 이렇다 할 에러 원인은 보이지 않았습니다.

우선 현상부터 정리 해보겠습니다.

모든 인스턴스를 내리고 올렸을 때, Istio가 envoy를 전파하는 과정에서 에러가 발생.
30분까지 502 Bad Gateway 에러가 발생 (백엔드 서비스와 istio간의 연결불량)
30분 이후부터는 연결이 재설정 되면서 정상적으로 가동됨
시간이 지남에 따라 정상적으로 서비스가 가동한다는건, gateway ingress(외부로부터 트래픽을 수신) 및 virtual service(수신한 트래픽을 라우팅 규칙에 따라 내부로 라우팅)는 제대로 작동 한다는거겠죠??

그럼 몇가지 의심가는 부분을 실제로 확인하면서 문제가 있는지 확인해볼게요.

우선 에러가 발생했으니 가장먼저 로그를 봐야겠죠?

#log istiod pod

2025-03-16T03:34:24.471264Z info    ads Incremental push, service be-test-community-service.be-namespace.svc.cluster.local at shard Kubernetes/Kubernetes has no endpoints
2025-03-16T03:34:24.571769Z info    ads Push debounce stable[27] 1 for config ServiceEntry/be-namespace/be-test-community-service.be-namespace.svc.cluster.local: 100.268679ms since last change, 100.268497ms since last push, full=false
2025-03-16T03:34:24.571917Z info    ads XDS: Incremental Pushing:2025-03-16T03:33:42Z/7 ConnectedEndpoints:15 Version:2025-03-16T03:33:42Z/7
2025-03-16T03:34:30.036856Z info    ads Incremental push, service test-stock-module-service.be-namespace.svc.cluster.local at shard Kubernetes/Kubernetes has no endpoints
2025-03-16T03:34:30.137290Z info    ads Push debounce stable[28] 1 for config ServiceEntry/be-namespace/test-stock-module-service.be-namespace.svc.cluster.local: 100.2412ms since last change, 100.240879ms since last push, full=false
2025-03-16T03:34:30.137412Z info    ads XDS: Incremental Pushing:2025-03-16T03:33:42Z/7 ConnectedEndpoints:15 Version:2025-03-16T03:33:42Z/7
2025-03-16T03:34:30.269882Z info    ads Incremental push, service be-test-community-service.be-namespace.svc.cluster.local at shard Kubernetes/Kubernetes has no endpoints
2025-03-16T03:34:30.370299Z info    ads Push debounce stable[29] 1 for config ServiceEntry/be-namespace/be-test-community-service.be-namespace.svc.cluster.local: 100.190119ms since last change, 100.189945ms since last push, full=false
2025-03-16T03:34:30.370722Z info    ads XDS: Incremental Pushing:2025-03-16T03:33:42Z/7 ConnectedEndpoints:15 Version:2025-03-16T03:33:42Z/7
2025-03-16T04:00:30.549363Z info    ads ADS: "10.244.1.4:58322" mysql-master-0.db-namespace-7 terminated
2025-03-16T04:00:30.616809Z info    ads ADS: new connection for node:mysql-master-0.db-namespace-19
2025-03-16T04:00:30.617536Z info    ads CDS: PUSH request for node:mysql-master-0.db-namespace resources:50 size:48.7kB cached:38/45
2025-03-16T04:00:30.617996Z info    ads EDS: PUSH request for node:mysql-master-0.db-namespace resources:40 size:8.6kB empty:0 cached:40/40
2025-03-16T04:00:30.623364Z info    ads LDS: PUSH request for node:mysql-master-0.db-namespace resources:51 size:88.6kB
2025-03-16T04:00:30.624581Z info    ads RDS: PUSH request for node:mysql-master-0.db-namespace resources:24 size:18.8kB cached:14/24
2025-03-16T04:00:44.760711Z info    ads ADS: "10.244.3.236:57954" my-redis-node-1.db-namespace-9 terminated
2025-03-16T04:00:45.257313Z info    ads ADS: new connection for node:my-redis-node-1.db-namespace-20
2025-03-16T04:00:45.259955Z info    ads CDS: PUSH request for node:my-redis-node-1.db-namespace resources:50 size:48.8kB cached:38/45
2025-03-16T04:00:45.260356Z info    ads EDS: PUSH request for node:my-redis-node-1.db-namespace resources:40 size:8.6kB empty:0 cached:40/40
2025-03-16T04:00:45.271184Z info    ads LDS: PUSH request for node:my-redis-node-1.db-namespace resources:49 size:83.1kB
2025-03-16T04:00:45.273761Z info    ads RDS: PUSH request for node:my-redis-node-1.db-namespace resources:24 size:18.8kB cached:14/24

타임스탬프를 보니 03:34:30까지 정상적으로 로그가 출력되다가 갑자기 4:00로 건너 뛰었습니다.

그럼 실제로 istio가 sidecar로 설정되어있는 서비스들이 정상적으로 안올라온걸까요??

ubuntu@k8s-master:~$ k get pods -n be-namespace
NAME                            READY   STATUS    RESTARTS       AGE
be-community-6cf6f6bf77-65c9z   2/2     Running   5 (48m ago)    3h27m
be-core-757bf8dd7d-6v2qb        2/2     Running   10 (49m ago)   3d21h
be-core-757bf8dd7d-6vbnr        2/2     Running   10 (49m ago)   3d21h
be-core-757bf8dd7d-9rzng        2/2     Running   10 (49m ago)   3d21h
be-piggy-84499477c5-r95l6       2/2     Running   6 (48m ago)    46h
be-stock-76544cb4cb-w4x58       2/2     Running   5 (48m ago)    46h

아니요, 서비스들은 정상적인거같아요. 혹시 모르니 istio가 주입된 sidecar도 살펴볼게요.

ubuntu@k8s-master:~$ k -n be-namespace logs be-community-6cf6f6bf77-65c9z -c istio-init
2025-03-16T05:18:11.950880Z info    Istio iptables environment:
ENVOY_PORT=
INBOUND_CAPTURE_PORT=
ISTIO_INBOUND_INTERCEPTION_MODE=
ISTIO_INBOUND_TPROXY_ROUTE_TABLE=
ISTIO_INBOUND_PORTS=
ISTIO_OUTBOUND_PORTS=
ISTIO_LOCAL_EXCLUDE_PORTS=
ISTIO_EXCLUDE_INTERFACES=
ISTIO_SERVICE_CIDR=
ISTIO_SERVICE_EXCLUDE_CIDR=
ISTIO_META_DNS_CAPTURE=
INVALID_DROP=

설정파일들은 제대로 올라온거같아요. kiali를 통해 봐도 서비스들은 에러가 없습니다.

해결

💡제가 도입부에 말씀드렸던 “10분 이따가 접속해보세요”가 갑자기 떠오르지 않나요??

저는 이타이밍에 번뜩이면서 떠올랐는데요.

이 문구가 “30분 이따가 접속해보세요”로 변경되면 정확히 맞아 떨어집니다.

그럼 가정해볼게요. 서버가 내려갔다가 다시 올라올때, 모종의 이유로( 제 생각엔 아마 실행순서가 아닐까 싶습니다.) 에러가 발생하고, istio가 정상적으로 실행 할 준비가 안됐을때 → 502 gateway 발생

그 후 자체적으로 30분 이후 재설정 을 진행하게되면, 그 땐 모든 서비스와 Istio가 정상적으로 작동된다.

이 가정이 맞다면 앞뒤가 맞습니다. 그럼 이제 초점을 proxy에서 에러 로그와, 실제 30분이 어디에 적용되어있나 살펴보겠습니다.

우선, istio-gatewayingress 로그만 살펴봤으므로 이번엔 istiod 로그를 살펴볼게요. (istiod는 istio 시스템의 control plane 역할을 합니다.)

2025-03-16T03:33:34.628473Z warning envoy config external/envoy/source/common/config/grpc_subscription_impl.cc:120
gRPC config: initial fetch timed out for type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret thread=15

눈에 띄는 로그를 하나 발견했습니다. gRPC가 timed out이 발생했어요.

gRPC가 뭘까요??
gRPC는 구글에서 만든 원격 프로시저 호출(Remote Procedure Call, RPC) 프레임워크입니다.

그럼 istiod는 gRPC를 사용하겠네요?? 맞습니다.
Istiod는 사이드카 프록시(Envoy)와 통신할 때 gRPC 스트림을 사용하여 설정, 정책, 보안 인증서 등을 전파합니다.

서버 시작시 여러 에러가 발생하게되고, 그에따라 istiod가 다른 서비스에 존재하는 Envoy와 통신이 불가능한 상황이였습니다.
그럼 이 시간을 10분으로 줄여줄까요? 저희도 “10분 뒤에 접속해보세요” 를 해보자구요

어디에서 30분이라는 시간을 잡고있을까요?

ubuntu@k8s-master:~$ k describe deployments.apps istiod | grep -i args -A 7
    Args:
      discovery
      --monitoringAddr=:15014
      --log_output_level=default:info
      --domain
      cluster.local
      --keepaliveMaxServerConnectionAge
      30m

찾았습니다. istio는 기본적으로 keepaliveMaxServerConnectionAge 라는 환경변수를 지정해두고있고
이 옵션은 gRPC 연결을 30분까지만 유지하고 그 이후에는 연결을 강제로 종료하여 재설정하게 만듭니다.

30분으로 설정되어있는 이 값을 10분으로 변경하겠습니다.

2025-03-16T05:18:11.848013Z info    Envoy command: [-c etc/istio/proxy/envoy-rev.json --drain-time-s 45 --drain-strategy immediate --local-address-ip-version v4 --file-flush-interval-msec 1000 --disable-hot-restart --allow-unknown-static-fields --log-format %Y-%m-%dT%T.%fZ   %l  envoy %n %g:%#  %v  thread=%t -l warning --component-log-level misc:error]
2025-03-16T05:18:26.727692Z info    xdsproxy    connected to upstream XDS server: istiod.istio-system.svc:15012
2025-03-16T05:18:26.733952Z info    cache   generated new workload certificate  latency=14.896086408s ttl=23h59m59.266060245s
2025-03-16T05:18:26.733986Z info    cache   Root cert has changed, start rotating root cert
2025-03-16T05:18:26.734001Z info    ads XDS: Incremental Pushing:0 ConnectedEndpoints:0 Version:
2025-03-16T05:18:26.734054Z info    cache   returned workload trust anchor from cache   ttl=23h59m59.265949333s
2025-03-16T05:18:27.646842Z info    ads ADS: new connection for node:.-2
2025-03-16T05:18:27.646852Z info    ads ADS: new connection for node:.-1
2025-03-16T05:18:27.647108Z info    cache   returned workload trust anchor from cache   ttl=23h59m58.352899309s
2025-03-16T05:18:27.647396Z info    cache   returned workload certificate from cache    ttl=23h59m58.352609151s
2025-03-16T05:18:27.647734Z info    ads SDS: PUSH request for node:. resources:1 size:1.1kB resource:ROOTCA
2025-03-16T05:18:27.647765Z info    ads SDS: PUSH request for node:. resources:1 size:4.0kB resource:default
2025-03-16T05:18:42.676390Z warning envoy config external/envoy/source/common/config/grpc_subscription_impl.cc:120  gRPC config: initial fetch timed out for type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret   thread=15
2025-03-16T05:18:43.387930Z info    Readiness succeeded in 31.825214172s
2025-03-16T05:18:43.390113Z info    Envoy proxy is ready
2025-03-16T05:29:22.252320Z info    xdsproxy    connected to upstream XDS server: istiod.istio-system.svc:15012
2025-03-16T05:39:41.966829Z info    xdsproxy    connected to upstream XDS server: istiod.istio-system.svc:15012
2025-03-16T05:49:28.312224Z info    xdsproxy    connected to upstream XDS server: istiod.istio-system.svc:15012
2025-03-16T05:59:08.640998Z info    xdsproxy    connected to upstream XDS server: istiod.istio-system.svc:15012

아까는 30분이 걸리던 재설정이 10분으로 줄었습니다.

이렇게 저희도 30분뒤에 접속해보세요 대신, 10분뒤에 접속해보세요 라고 할 수 있겠네요!

감사합니다.

ec2 k8s istio

Infra Trouble Shooting - 3

istio 도입 과정

문제

해결

Leave a Comment:

Search

Categories