依存関係の失敗時にsystemdサービスを再起動する

26

依存関係の1つが起動時に失敗する（ただし、再試行後に成功する）場合のサービスの再起動を処理する適切なアプローチは何ですか。

問題をより明確にするための不自然な再現を次に示します。

a.service（最初の試行での失敗と2回目の試行での成功をシミュレートします）

[Unit]
Description=A

[Service]
ExecStartPre=/bin/sh -x -c "[ -f /tmp/success ] || (touch /tmp/success && sleep 10)"
ExecStart=/bin/true
TimeoutStartSec=5
Restart=on-failure
RestartSec=5
RemainAfterExit=yes

b.service（Aが開始した後、簡単に成功します）

[Unit]
Description=B
After=a.service
Requires=a.service

[Service]
ExecStart=/bin/true
RemainAfterExit=yes
Restart=on-failure
RestartSec=5

bを始めましょう：

# systemctl start b
A dependency job for b.service failed. See 'journalctl -xe' for details.

ログ：

Jun 30 21:34:54 debug systemd[1]: Starting A...
Jun 30 21:34:54 debug sh[1308]: + '[' -f /tmp/success ']'
Jun 30 21:34:54 debug sh[1308]: + touch /tmp/success
Jun 30 21:34:54 debug sh[1308]: + sleep 10
Jun 30 21:34:59 debug systemd[1]: a.service start-pre operation timed out. Terminating.
Jun 30 21:34:59 debug systemd[1]: Failed to start A.
Jun 30 21:34:59 debug systemd[1]: Dependency failed for B.
Jun 30 21:34:59 debug systemd[1]: Job b.service/start failed with result 'dependency'.
Jun 30 21:34:59 debug systemd[1]: Unit a.service entered failed state.
Jun 30 21:34:59 debug systemd[1]: a.service failed.
Jun 30 21:35:04 debug systemd[1]: a.service holdoff time over, scheduling restart.
Jun 30 21:35:04 debug systemd[1]: Starting A...
Jun 30 21:35:04 debug systemd[1]: Started A.
Jun 30 21:35:04 debug sh[1314]: + '[' -f /tmp/success ']'

Aは正常に開始されましたが、Bは失敗状態のままで、再試行しません。

編集

両方のサービスに以下を追加し、Aの開始時にBが正常に開始されるようになりましたが、理由を説明できません。

[Install]
WantedBy=multi-user.target

なぜこれがAとBの関係に影響するのでしょうか？

EDIT2

上記の「修正」はsystemd 220では機能しません。

systemd 219デバッグログ

systemd219 systemd[1]: Trying to enqueue job b.service/start/replace
systemd219 systemd[1]: Installed new job b.service/start as 3454
systemd219 systemd[1]: Installed new job a.service/start as 3455
systemd219 systemd[1]: Enqueued job b.service/start as 3454
systemd219 systemd[1]: About to execute: /bin/sh -x -c '[ -f /tmp/success ] || (touch oldcoreos
systemd219 systemd[1]: Forked /bin/sh as 1502
systemd219 systemd[1]: a.service changed dead -> start-pre
systemd219 systemd[1]: Starting A...
systemd219 systemd[1502]: Executing: /bin/sh -x -c '[ -f /tmp/success ] || (touch /tmpoldcoreos
systemd219 sh[1502]: + '[' -f /tmp/success ']'
systemd219 sh[1502]: + touch /tmp/success
systemd219 sh[1502]: + sleep 10
systemd219 systemd[1]: a.service start-pre operation timed out. Terminating.
systemd219 systemd[1]: a.service changed start-pre -> final-sigterm
systemd219 systemd[1]: Child 1502 belongs to a.service
systemd219 systemd[1]: a.service: control process exited, code=killed status=15
systemd219 systemd[1]: a.service got final SIGCHLD for state final-sigterm
systemd219 systemd[1]: a.service changed final-sigterm -> failed
systemd219 systemd[1]: Job a.service/start finished, result=failed
systemd219 systemd[1]: Failed to start A.
systemd219 systemd[1]: Job b.service/start finished, result=dependency
systemd219 systemd[1]: Dependency failed for B.
systemd219 systemd[1]: Job b.service/start failed with result 'dependency'.
systemd219 systemd[1]: Unit a.service entered failed state.
systemd219 systemd[1]: a.service failed.
systemd219 systemd[1]: a.service changed failed -> auto-restart
systemd219 systemd[1]: a.service: cgroup is empty
systemd219 systemd[1]: a.service: cgroup is empty
systemd219 systemd[1]: a.service holdoff time over, scheduling restart.
systemd219 systemd[1]: Trying to enqueue job a.service/restart/fail
systemd219 systemd[1]: Installed new job a.service/restart as 3718
systemd219 systemd[1]: Installed new job b.service/restart as 3803
systemd219 systemd[1]: Enqueued job a.service/restart as 3718
systemd219 systemd[1]: a.service scheduled restart job.
systemd219 systemd[1]: Job b.service/restart finished, result=done
systemd219 systemd[1]: Converting job b.service/restart -> b.service/start
systemd219 systemd[1]: a.service changed auto-restart -> dead
systemd219 systemd[1]: Job a.service/restart finished, result=done
systemd219 systemd[1]: Converting job a.service/restart -> a.service/start
systemd219 systemd[1]: About to execute: /bin/sh -x -c '[ -f /tmp/success ] || (touch oldcoreos
systemd219 systemd[1]: Forked /bin/sh as 1558
systemd219 systemd[1]: a.service changed dead -> start-pre
systemd219 systemd[1]: Starting A...
systemd219 systemd[1]: Child 1558 belongs to a.service
systemd219 systemd[1]: a.service: control process exited, code=exited status=0
systemd219 systemd[1]: a.service got final SIGCHLD for state start-pre
systemd219 systemd[1]: About to execute: /bin/true
systemd219 systemd[1]: Forked /bin/true as 1561
systemd219 systemd[1]: a.service changed start-pre -> running
systemd219 systemd[1]: Job a.service/start finished, result=done
systemd219 systemd[1]: Started A.
systemd219 systemd[1]: Child 1561 belongs to a.service
systemd219 systemd[1]: a.service: main process exited, code=exited, status=0/SUCCESS
systemd219 systemd[1]: a.service changed running -> exited
systemd219 systemd[1]: a.service: cgroup is empty
systemd219 systemd[1]: About to execute: /bin/true
systemd219 systemd[1]: Forked /bin/true as 1563
systemd219 systemd[1]: b.service changed dead -> running
systemd219 systemd[1]: Job b.service/start finished, result=done
systemd219 systemd[1]: Started B.
systemd219 systemd[1]: Starting B...
systemd219 systemd[1]: Child 1563 belongs to b.service
systemd219 systemd[1]: b.service: main process exited, code=exited, status=0/SUCCESS
systemd219 systemd[1]: b.service changed running -> exited
systemd219 systemd[1]: b.service: cgroup is empty
systemd219 sh[1558]: + '[' -f /tmp/success ']'

systemd 220デバッグログ

systemd220 systemd[1]: b.service: Trying to enqueue job b.service/start/replace
systemd220 systemd[1]: a.service: Installed new job a.service/start as 4846
systemd220 systemd[1]: b.service: Installed new job b.service/start as 4761
systemd220 systemd[1]: b.service: Enqueued job b.service/start as 4761
systemd220 systemd[1]: a.service: About to execute: /bin/sh -x -c '[ -f /tmp/success ] || (touch /tmp/success && sleep 10)'
systemd220 systemd[1]: a.service: Forked /bin/sh as 2032
systemd220 systemd[1]: a.service: Changed dead -> start-pre
systemd220 systemd[1]: Starting A...
systemd220 systemd[2032]: a.service: Executing: /bin/sh -x -c '[ -f /tmp/success ] || (touch /tmp/success && sleep 10)'
systemd220 sh[2032]: + '[' -f /tmp/success ']'
systemd220 sh[2032]: + touch /tmp/success
systemd220 sh[2032]: + sleep 10
systemd220 systemd[1]: a.service: Start-pre operation timed out. Terminating.
systemd220 systemd[1]: a.service: Changed start-pre -> final-sigterm
systemd220 systemd[1]: a.service: Child 2032 belongs to a.service
systemd220 systemd[1]: a.service: Control process exited, code=killed status=15
systemd220 systemd[1]: a.service: Got final SIGCHLD for state final-sigterm.
systemd220 systemd[1]: a.service: Changed final-sigterm -> failed
systemd220 systemd[1]: a.service: Job a.service/start finished, result=failed
systemd220 systemd[1]: Failed to start A.
systemd220 systemd[1]: b.service: Job b.service/start finished, result=dependency
systemd220 systemd[1]: Dependency failed for B.
systemd220 systemd[1]: b.service: Job b.service/start failed with result 'dependency'.
systemd220 systemd[1]: a.service: Unit entered failed state.
systemd220 systemd[1]: a.service: Failed with result 'timeout'.
systemd220 systemd[1]: a.service: Changed failed -> auto-restart
systemd220 systemd[1]: a.service: cgroup is empty
systemd220 systemd[1]: a.service: Failed to send unit change signal for a.service: Transport endpoint is not connected
systemd220 systemd[1]: a.service: Service hold-off time over, scheduling restart.
systemd220 systemd[1]: a.service: Trying to enqueue job a.service/restart/fail
systemd220 systemd[1]: a.service: Installed new job a.service/restart as 5190
systemd220 systemd[1]: a.service: Enqueued job a.service/restart as 5190
systemd220 systemd[1]: a.service: Scheduled restart job.
systemd220 systemd[1]: a.service: Changed auto-restart -> dead
systemd220 systemd[1]: a.service: Job a.service/restart finished, result=done
systemd220 systemd[1]: a.service: Converting job a.service/restart -> a.service/start
systemd220 systemd[1]: a.service: About to execute: /bin/sh -x -c '[ -f /tmp/success ] || (touch /tmp/success && sleep 10)'
systemd220 systemd[1]: a.service: Forked /bin/sh as 2132
systemd220 systemd[1]: a.service: Changed dead -> start-pre
systemd220 systemd[1]: Starting A...
systemd220 systemd[1]: a.service: Child 2132 belongs to a.service
systemd220 systemd[1]: a.service: Control process exited, code=exited status=0
systemd220 systemd[1]: a.service: Got final SIGCHLD for state start-pre.
systemd220 systemd[1]: a.service: About to execute: /bin/true
systemd220 systemd[1]: a.service: Forked /bin/true as 2136
systemd220 systemd[1]: a.service: Changed start-pre -> running
systemd220 systemd[1]: a.service: Job a.service/start finished, result=done
systemd220 systemd[1]: Started A.
systemd220 systemd[1]: a.service: Child 2136 belongs to a.service
systemd220 systemd[1]: a.service: Main process exited, code=exited, status=0/SUCCESS
systemd220 systemd[1]: a.service: Changed running -> exited
systemd220 systemd[1]: a.service: cgroup is empty
systemd220 systemd[1]: a.service: cgroup is empty
systemd220 systemd[1]: a.service: cgroup is empty
systemd220 systemd[1]: a.service: cgroup is empty
systemd220 sh[2132]: + '[' -f /tmp/success ']'

systemd

— ヴァディム
ソース

1

これを追跡するアップストリームsystemdの問題があります：github.com/systemd/systemd/issues/1312

— JKnight

31

このトピックに関する情報が乏しいために誰かがこれに出くわした場合に備えて、この問題に関する私の調査結果を要約しようとします。

Restart=on-failure プロセス障害にのみ適用されます（依存関係の障害による障害には適用されません）
依存関係が正常に再起動すると、特定の条件下で依存障害ユニットが再起動されるという事実は、systemd <220のバグでした：http : //lists.freedesktop.org/archives/systemd-devel/2015-July/033513.html
依存関係が起動時に失敗する可能性が少しでもあり、回復力に関心がある場合は、Before/ Afterを使用せず、依存関係が生成するアーティファクトのチェックを実行します

例えば

ExecStartPre=/usr/bin/test -f /some/thing
Restart=on-failure
RestartSec=5s

使用することもできますsystemctl is-active <dependecy>。

非常にハッキーですが、私はより良いオプションを見つけていません。

私の意見では、依存関係の障害を処理する方法がないことがsystemdの欠陥です。

— ヴァディム
ソース

はい、言うまでもなく、Leonard poetringが実装したくないマウントポイントの再試行がない：github.com/systemd/systemd/issues/4468

— Hvisage

0

スクリプトを作成して、cronジョブに簡単に入れることができるようなもののようです。基本的なロジックは次のようになります

サービスaとbの両方と依存関係が実行中/有効な状態であるかどうかを確認します。すべてが正常に機能しているかどうかを確認する最良の方法がわかります。
すべてが正常に機能している場合は、何もしないか、すべてが機能していることをログに記録します。ロギングには、以前のログエントリを検索できるという利点があります。
何かが壊れている場合は、サービスを再起動し、サービスと依存関係のステータスチェックが発生するスクリプトの先頭に戻ります。ジャンプは、サービスの再起動に自信があり、依存関係が機能する可能性が高い場合にのみ発生する必要があります。そうでない場合、ループの可能性があります。
しばらくしてからcronにスクリプトを再度実行させます

スクリプトが設定されると、cronをテストするのに適した場所になります.cronが非効率的な場合、スクリプトは、他のサービスのステータスを確認し、必要に応じてそれらを再起動できる低レベルのシステムサービスを作成するための良い出発点となります。投資したい労力によっては、結果に基づいてメールを送信するようにスクリプトを設定することもできます（もちろん、問題のサービスがネットワークサービスでない限り）。

— マット
ソース

このcronジョブは、プロセス/サービスマネージャーで実行する必要があります。そうしないと、systemdが実行を試みない SVR4メソッドに戻ることになります...

— Hvisage

0

AfterそしてBefore唯一のサービスが中に開始される順序を設定して、あなたのサービスのファイルは、「AとBは、その後に開始された場合、AはBより前に開始されなければならない」と言います。

Requires このサービスを開始する場合、そのサービスを最初に開始する必要があります。例では「Bが開始され、Aが実行されていない場合、Aを開始します」

を追加するWantedBy=multi-user.targetと、システムの初期化時にサービスを開始する必要があることをシステムに伝えていますmulti-user.target。おそらく、追加すると、手動でサービスを開始するのではなく、システムにサービスを開始させることになります。

バージョン220でこれが機能しない理由はわかりません。222を試してみる価値があるかもしれません。VMを掘り出し、機会があればサービスを試してみます。

— マイケル・ショー
ソース

1

systemd-develで尋ねたところ、219で動作していたという事実はバグでした。意図した動作は、失敗した依存関係が再起動されないことです。

— ヴァディム

0

私はこれに何日も費やし、「systemd」のように動作させようとしましたが、フラストレーションをあきらめ、依存関係と障害を管理するラッパースクリプトを作成しました。各子サービスは通常のsystemdサービスであり、「Requires」や「PartOf」、または他のサービスへのフックはありません。

最上位のサービスファイルは次のようになります。

[Service]
Type=simple
Environment=REQUIRES=foo.service bar.service
ExecStartPre=/usr/bin/systemctl start $REQUIRES
ExecStart=@PREFIX@/bin/top-service.sh $REQUIRES
ExecStop=/usr/bin/systemctl      stop $REQUIRES

ここまでは順調ですね。top.serviceファイルコントロールfoo.serviceとbar.service。開始topするfooとが開始しbar、停止topするfooとが停止しbarます。最後の要素は、top-service.shサービスの障害を監視するスクリプトです。

#!/bin/bash

# This monitors REQUIRES services. If any service stops, all of the services are stopped and this script ends.

REQUIRES="$@"

if [ "$REQUIRES" == "" ]
then
  echo "ERROR: no services listed"
  exit 1
fi

echo "INFO: watching services: ${REQUIRES}"

end=0
while [[ $end == 0 ]]
do
  s=$(systemctl is-active ${REQUIRES} )
  if echo $s | egrep '^(active ?)+$' > /dev/null
  then
    # $s has embedded newlines, but echo $s seems to get rid of them, while echo "$s" keeps them.
    # echo INFO: All active, $s
    end=0
  else
    echo "WARN: ${REQUIRES}"
    echo WARN: $s
  fi

  if [[ $s == *"failed"* ]] || [[ $s == *"unknown"* ]]
  then
    echo "WARN: At least one service is failed or unknown, ending service"
    end=1
  else
    sleep 1
  fi
done

echo "INFO: done watching services, stopping: ${REQUIRES}"
systemctl stop ${REQUIRES}
echo "INFO: stopped: ${REQUIRES}"
exit 1

— マーク・ラカタ
ソース

REQUIRES="$@"は本質的にバグのあるコードです-配列を文字列に折りたたんで、アイテム間の元の境界を破棄しているため、引数は set -- "argument one" "argument two"と同じになりset -- "argument" "one" "argument" "two"ます。requires=( "$@" )は元のデータを保持するため、安全に拡張できsystemctl is-active "${requires[@]}"ます。

— チャールズダフィー

-1

これに答えません。しかし、誰かがこれを必要とするかもしれません（このページは検索で表示されるためです）：

あるべき

[Service]
 Restart=always
 RestartSec=3

https://jonarcher.info/2015/08/ensure-systemd-services-restart-on-failure/

— シモン・ドゥードキン
ソース

質問をもっと注意深く読んでください。これは、単一の異常なサービスを再起動することではなく、被告サービスが失敗したときのsystemdの動作に関するものです。

— ヴァディム