EC2で直接構築したprometheus/alert managerをECSに移行した時に考えたこと(AWS.ECSで監視)

モチベーション

元々amazon-linuxにprometheusをぶち込んで監視していた。prometheusを実際に運用するとなると、

俗人化を防ぎたい
AWSのクレデンシャルの管理
どの設定ファイルを上書すれば良いか
ローカルでも動作確認したい

という課題/要望が出てきた。現場に適用して3ヶ月経って、現在行っている工夫やprometheusのCI/CD周り、設定ファイルの管理方法などをまとめておく。

前提

監視対象のタスク定義には、exporterが入っている。
今回はECSで稼働しているサービスに焦点を当てる。
alert managerやprometheusの具体的な設定やパラメータの意味は省略する。

実現したこと

prometheus/alertmanagerの設定ファイルが安全に管理され、開発者が編集/閲覧可能な状態。
スケールしたインスタンスをサービスディスカバリで検知し、モニタリング可能な状態(grafanaなどで)。

やったこと

exporter編

下準備として、exporterをタスク定義のコンテナに追加する。 containerDefinitionsにprom/node_exporterを追加する。SGのport開放も忘れずに行っておく(9100)。

今回は、公式のdocker imageであるprom/node-exporterを設定した。

{
  "containerDefinitions": [
    {
      // アプリケーションのコンテナ定義
      ...
    },
    {
      // node_exporterのコンテナ定義
      "dnsSearchDomains": null,
      "environmentFiles": null,
      "logConfiguration": null,
      "entryPoint": null,
      "portMappings": [
        {
          "hostPort": 9100,
          "protocol": "tcp",
          "containerPort": 9100
        }
      ],
      "command": null,
      "linuxParameters": null,
      "cpu": 0,
      "environment": [],
      "resourceRequirements": null,
      "ulimits": null,
      "dnsServers": null,
      "mountPoints": [],
      "workingDirectory": null,
      "secrets": null,
      "dockerSecurityOptions": null,
      "memory": null,
      "memoryReservation": null,
      "volumesFrom": [],
      "stopTimeout": null,
      "image": "prom/node-exporter",
      "startTimeout": null,
      "firelensConfiguration": null,
      "dependsOn": null,
      "disableNetworking": null,
      "interactive": null,
      "healthCheck": null,
      "essential": true,
      "links": null,
      "hostname": null,
      "extraHosts": null,
      "pseudoTerminal": null,
      "user": null,
      "readonlyRootFilesystem": null,
      "dockerLabels": null,
      "systemControls": null,
      "privileged": null,
      "name": "node-exporter"
    }
  ],
  ...
}

タスクを起動して、アクセスできたらOK.

$ curl ${public_ip}:9100/metrics
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 1.3784e-05
go_gc_duration_seconds{quantile="0.25"} 1.3784e-05
go_gc_duration_seconds{quantile="0.5"} 1.6655e-05
go_gc_duration_seconds{quantile="0.75"} 1.6655e-05
go_gc_duration_seconds{quantile="1"} 1.6655e-05
go_gc_duration_seconds_sum 3.0439e-05
go_gc_duration_seconds_count 2

prometheus編

本題。

githubで管理するにあたり

ざっくり下記を工夫した。

AWSのクレデンシャルは含めない
最低限の設定ファイルだけを管理する
設定ファイルに含まれるクレデンシャルはdocker build時に変数を適用する

ローカルで動作確認したい場合には、docker-compose.yamlをtemplateから作成して(.gitignoreしている)、envにAWSのクレデンシャル情報を記載する。それでもミスるので、git-secretは入れた方がいいと思う。

docker build時に、envsubstを利用して、各yamlの任意の変数を、環境変数の値に適用する仕組みにしている。

ディレクトリ構成

├── _alert.dockerfile            // alert manager
├── _prom.dockerfile             // prometheus
├── alert_manager
│   └── template.config.yaml     // alert managerが参照する通知先の設定
├── prom
│   ├── rules.yaml               // alert条件の設定
│   └── template.prometheus.yml  // prometheusのメトリクス監視の設定 & サービスディスカバリ
└── template.docker-compose.yaml

prometheus.yaml

※AWSの変数はenvsubstで上書きされる。

global:
  scrape_interval:     15s 
  evaluation_interval: 15s 
  external_labels:
      monitor: 'monitor'

rule_files:
  - rules.yaml
alerting:
  alertmanagers:
    - scheme: http
      static_configs:
      - targets:
        - alertmanager:9093

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets:
        -  prometheus:9090
        -  node-exporter:9100
  - job_name: 'AWS-RESOUCES'
    ec2_sd_configs:
      - region: ${AWS_REGION}
        access_key: ${AWS_ACCESS_KEY}
        secret_key: ${AWS_SECRET_KEY}
        port: 9100
    # publicIPで取得したい場合は下記を適用する
    relabel_configs:
      - source_labels: [__meta_ec2_public_ip]
        regex:  '(.*)'
        target_label: __address__
        replacement: '${1}:9100'
      - source_labels: [__meta_ec2_tag_Name]
        target_label: instance

rules.yaml

groups:
  - name: targets
    rules:
    - alert: monitor_service_down
      expr: up == 0
      for: 30s
      labels:
        severity: critical
      annotations:
        summary: "Monitor service non-operational"
        description: "Service {{ $labels.instance }} is down."

alert.conf

めんどくさくなって、${WEBHOOK}はべた書きで試したけど、envsubstで同様に適用することも可能なはず。

global:
  slack_api_url: '${WEBHOOK}'

route:
  receiver: 'slack'

receivers:
  - name: 'slack'
    slack_configs:
    - channel: '#alerts'
      text: "{{ .CommonAnnotations.summary }}"
      send_resolved: true

ローカルで動作確認

template.docker-compose.yamlからdocker-compose.yamlを作成して、environmentにクレデンシャル情報を記載する。 prometheusのサービスディスカバリはそのままローカルでも適用される(public subnetの場合だけかも)。

# ここを書き換える
environment:
  AWS_REGION: ${AWS_REGION}
  AWS_ACCESS_KEY: ${AWS_ACCESS_KEY}
  AWS_SECRET_KEY: ${AWS_SECRET_KEY}

# buildして起動
$ cd /path/to/repository/
$ docker-compose build
$ docker-compose up

deploy

docker-compose buildで生成された、各イメージ(${REPO}_prometheus,${REPO}_alertmanager)をpush。

docker-compose build
docker tag ${image_name}:latest ${URI}/${image_name}:latest
docker push ${URI}/${image_name}:latest

タスク定義の実行コマンドは下記のようになる、

prometheusの場合

エントリポイント ["sh","-c"]
コマンド ["envsubst < /etc/prometheus/template.prometheus.yml > /etc/prometheus/prometheus.yml ; /bin/prometheus --config.file=/etc/prometheus/prometheus.yml --web.console.libraries=/usr/share/prometheus/console_libraries --web.console.templates=/usr/share/prometheus/consoles"]

ポートマッピング:9090→9090 tcp

# 環境変数も設定しておくこと
- AWS_ACCESS_KEY
- AWS_REGION
- AWS_SECRET_KEY

alertmanagerの場合

エントリポイント ["sh","-c"]
コマンド ["--config.file=/etc/alertmanager/config.yaml"]

ポートマッピング:9093→9093 tcp

まとめ

成果物は下記にまとめておきました。また課題が見つかり次第、いろいろ試そうと思っています。

wadason/prom_on_ecs_example

参考

※envsubstで環境変数を適用する方法など

https://cross-black777.hatenablog.com/entry/2017/10/30/221644