종속성 실패시 시스템 서비스 재시작

26

종속성 중 하나가 시작시 실패하지만 재시도 후에 성공하는 경우 서비스 재시작을 처리하는 올바른 방법은 무엇입니까?

다음은 문제를 더 명확하게하기 위해 고안된 재현입니다.

a.service (첫 번째 시도 실패 및 두 번째 시도 성공)

[Unit]
Description=A

[Service]
ExecStartPre=/bin/sh -x -c "[ -f /tmp/success ] || (touch /tmp/success && sleep 10)"
ExecStart=/bin/true
TimeoutStartSec=5
Restart=on-failure
RestartSec=5
RemainAfterExit=yes

b. 서비스 (A가 시작된 후 사소한 성공)

[Unit]
Description=B
After=a.service
Requires=a.service

[Service]
ExecStart=/bin/true
RemainAfterExit=yes
Restart=on-failure
RestartSec=5

b를 시작하자 :

# systemctl start b
A dependency job for b.service failed. See 'journalctl -xe' for details.

로그 :

Jun 30 21:34:54 debug systemd[1]: Starting A...
Jun 30 21:34:54 debug sh[1308]: + '[' -f /tmp/success ']'
Jun 30 21:34:54 debug sh[1308]: + touch /tmp/success
Jun 30 21:34:54 debug sh[1308]: + sleep 10
Jun 30 21:34:59 debug systemd[1]: a.service start-pre operation timed out. Terminating.
Jun 30 21:34:59 debug systemd[1]: Failed to start A.
Jun 30 21:34:59 debug systemd[1]: Dependency failed for B.
Jun 30 21:34:59 debug systemd[1]: Job b.service/start failed with result 'dependency'.
Jun 30 21:34:59 debug systemd[1]: Unit a.service entered failed state.
Jun 30 21:34:59 debug systemd[1]: a.service failed.
Jun 30 21:35:04 debug systemd[1]: a.service holdoff time over, scheduling restart.
Jun 30 21:35:04 debug systemd[1]: Starting A...
Jun 30 21:35:04 debug systemd[1]: Started A.
Jun 30 21:35:04 debug sh[1314]: + '[' -f /tmp/success ']'

A가 성공적으로 시작되었지만 B는 실패 상태로 남아 있으며 다시 시도하지 않습니다.

편집하다

두 서비스 모두에 다음을 추가했는데 이제 A가 시작되면 B가 성공적으로 시작되지만 그 이유를 설명 할 수 없습니다.

[Install]
WantedBy=multi-user.target

이것이 왜 A와 B의 관계에 영향을 미칩니 까?

편집 2

위의 "수정"은 시스템 220에서 작동하지 않습니다.

systemd 219 디버그 로그

systemd219 systemd[1]: Trying to enqueue job b.service/start/replace
systemd219 systemd[1]: Installed new job b.service/start as 3454
systemd219 systemd[1]: Installed new job a.service/start as 3455
systemd219 systemd[1]: Enqueued job b.service/start as 3454
systemd219 systemd[1]: About to execute: /bin/sh -x -c '[ -f /tmp/success ] || (touch oldcoreos
systemd219 systemd[1]: Forked /bin/sh as 1502
systemd219 systemd[1]: a.service changed dead -> start-pre
systemd219 systemd[1]: Starting A...
systemd219 systemd[1502]: Executing: /bin/sh -x -c '[ -f /tmp/success ] || (touch /tmpoldcoreos
systemd219 sh[1502]: + '[' -f /tmp/success ']'
systemd219 sh[1502]: + touch /tmp/success
systemd219 sh[1502]: + sleep 10
systemd219 systemd[1]: a.service start-pre operation timed out. Terminating.
systemd219 systemd[1]: a.service changed start-pre -> final-sigterm
systemd219 systemd[1]: Child 1502 belongs to a.service
systemd219 systemd[1]: a.service: control process exited, code=killed status=15
systemd219 systemd[1]: a.service got final SIGCHLD for state final-sigterm
systemd219 systemd[1]: a.service changed final-sigterm -> failed
systemd219 systemd[1]: Job a.service/start finished, result=failed
systemd219 systemd[1]: Failed to start A.
systemd219 systemd[1]: Job b.service/start finished, result=dependency
systemd219 systemd[1]: Dependency failed for B.
systemd219 systemd[1]: Job b.service/start failed with result 'dependency'.
systemd219 systemd[1]: Unit a.service entered failed state.
systemd219 systemd[1]: a.service failed.
systemd219 systemd[1]: a.service changed failed -> auto-restart
systemd219 systemd[1]: a.service: cgroup is empty
systemd219 systemd[1]: a.service: cgroup is empty
systemd219 systemd[1]: a.service holdoff time over, scheduling restart.
systemd219 systemd[1]: Trying to enqueue job a.service/restart/fail
systemd219 systemd[1]: Installed new job a.service/restart as 3718
systemd219 systemd[1]: Installed new job b.service/restart as 3803
systemd219 systemd[1]: Enqueued job a.service/restart as 3718
systemd219 systemd[1]: a.service scheduled restart job.
systemd219 systemd[1]: Job b.service/restart finished, result=done
systemd219 systemd[1]: Converting job b.service/restart -> b.service/start
systemd219 systemd[1]: a.service changed auto-restart -> dead
systemd219 systemd[1]: Job a.service/restart finished, result=done
systemd219 systemd[1]: Converting job a.service/restart -> a.service/start
systemd219 systemd[1]: About to execute: /bin/sh -x -c '[ -f /tmp/success ] || (touch oldcoreos
systemd219 systemd[1]: Forked /bin/sh as 1558
systemd219 systemd[1]: a.service changed dead -> start-pre
systemd219 systemd[1]: Starting A...
systemd219 systemd[1]: Child 1558 belongs to a.service
systemd219 systemd[1]: a.service: control process exited, code=exited status=0
systemd219 systemd[1]: a.service got final SIGCHLD for state start-pre
systemd219 systemd[1]: About to execute: /bin/true
systemd219 systemd[1]: Forked /bin/true as 1561
systemd219 systemd[1]: a.service changed start-pre -> running
systemd219 systemd[1]: Job a.service/start finished, result=done
systemd219 systemd[1]: Started A.
systemd219 systemd[1]: Child 1561 belongs to a.service
systemd219 systemd[1]: a.service: main process exited, code=exited, status=0/SUCCESS
systemd219 systemd[1]: a.service changed running -> exited
systemd219 systemd[1]: a.service: cgroup is empty
systemd219 systemd[1]: About to execute: /bin/true
systemd219 systemd[1]: Forked /bin/true as 1563
systemd219 systemd[1]: b.service changed dead -> running
systemd219 systemd[1]: Job b.service/start finished, result=done
systemd219 systemd[1]: Started B.
systemd219 systemd[1]: Starting B...
systemd219 systemd[1]: Child 1563 belongs to b.service
systemd219 systemd[1]: b.service: main process exited, code=exited, status=0/SUCCESS
systemd219 systemd[1]: b.service changed running -> exited
systemd219 systemd[1]: b.service: cgroup is empty
systemd219 sh[1558]: + '[' -f /tmp/success ']'

체계화 된 220 디버그 로그

systemd220 systemd[1]: b.service: Trying to enqueue job b.service/start/replace
systemd220 systemd[1]: a.service: Installed new job a.service/start as 4846
systemd220 systemd[1]: b.service: Installed new job b.service/start as 4761
systemd220 systemd[1]: b.service: Enqueued job b.service/start as 4761
systemd220 systemd[1]: a.service: About to execute: /bin/sh -x -c '[ -f /tmp/success ] || (touch /tmp/success && sleep 10)'
systemd220 systemd[1]: a.service: Forked /bin/sh as 2032
systemd220 systemd[1]: a.service: Changed dead -> start-pre
systemd220 systemd[1]: Starting A...
systemd220 systemd[2032]: a.service: Executing: /bin/sh -x -c '[ -f /tmp/success ] || (touch /tmp/success && sleep 10)'
systemd220 sh[2032]: + '[' -f /tmp/success ']'
systemd220 sh[2032]: + touch /tmp/success
systemd220 sh[2032]: + sleep 10
systemd220 systemd[1]: a.service: Start-pre operation timed out. Terminating.
systemd220 systemd[1]: a.service: Changed start-pre -> final-sigterm
systemd220 systemd[1]: a.service: Child 2032 belongs to a.service
systemd220 systemd[1]: a.service: Control process exited, code=killed status=15
systemd220 systemd[1]: a.service: Got final SIGCHLD for state final-sigterm.
systemd220 systemd[1]: a.service: Changed final-sigterm -> failed
systemd220 systemd[1]: a.service: Job a.service/start finished, result=failed
systemd220 systemd[1]: Failed to start A.
systemd220 systemd[1]: b.service: Job b.service/start finished, result=dependency
systemd220 systemd[1]: Dependency failed for B.
systemd220 systemd[1]: b.service: Job b.service/start failed with result 'dependency'.
systemd220 systemd[1]: a.service: Unit entered failed state.
systemd220 systemd[1]: a.service: Failed with result 'timeout'.
systemd220 systemd[1]: a.service: Changed failed -> auto-restart
systemd220 systemd[1]: a.service: cgroup is empty
systemd220 systemd[1]: a.service: Failed to send unit change signal for a.service: Transport endpoint is not connected
systemd220 systemd[1]: a.service: Service hold-off time over, scheduling restart.
systemd220 systemd[1]: a.service: Trying to enqueue job a.service/restart/fail
systemd220 systemd[1]: a.service: Installed new job a.service/restart as 5190
systemd220 systemd[1]: a.service: Enqueued job a.service/restart as 5190
systemd220 systemd[1]: a.service: Scheduled restart job.
systemd220 systemd[1]: a.service: Changed auto-restart -> dead
systemd220 systemd[1]: a.service: Job a.service/restart finished, result=done
systemd220 systemd[1]: a.service: Converting job a.service/restart -> a.service/start
systemd220 systemd[1]: a.service: About to execute: /bin/sh -x -c '[ -f /tmp/success ] || (touch /tmp/success && sleep 10)'
systemd220 systemd[1]: a.service: Forked /bin/sh as 2132
systemd220 systemd[1]: a.service: Changed dead -> start-pre
systemd220 systemd[1]: Starting A...
systemd220 systemd[1]: a.service: Child 2132 belongs to a.service
systemd220 systemd[1]: a.service: Control process exited, code=exited status=0
systemd220 systemd[1]: a.service: Got final SIGCHLD for state start-pre.
systemd220 systemd[1]: a.service: About to execute: /bin/true
systemd220 systemd[1]: a.service: Forked /bin/true as 2136
systemd220 systemd[1]: a.service: Changed start-pre -> running
systemd220 systemd[1]: a.service: Job a.service/start finished, result=done
systemd220 systemd[1]: Started A.
systemd220 systemd[1]: a.service: Child 2136 belongs to a.service
systemd220 systemd[1]: a.service: Main process exited, code=exited, status=0/SUCCESS
systemd220 systemd[1]: a.service: Changed running -> exited
systemd220 systemd[1]: a.service: cgroup is empty
systemd220 systemd[1]: a.service: cgroup is empty
systemd220 systemd[1]: a.service: cgroup is empty
systemd220 systemd[1]: a.service: cgroup is empty
systemd220 sh[2132]: + '[' -f /tmp/success ']'

systemd

— 바딤
소스

1

이것을 추적하는 업스트림 시스템 문제가 있습니다 : github.com/systemd/systemd/issues/1312

— JKnight

31

이 주제에 대한 정보가 검색 될 때 누군가이 문제를 발견 한 경우이 문제에 대한 결과를 요약하려고합니다.

Restart=on-failure 프로세스 실패에만 적용 (종속성 실패로 인한 실패에는 적용되지 않음)
종속성이 성공적으로 다시 시작될 때 특정 조건에서 종속 실패한 장치가 다시 시작된다는 사실은 systemd <220의 버그였습니다. http://lists.freedesktop.org/archives/systemd-devel/2015-July/033513.html
시작시 종속성이 실패 할 가능성이 적고 복원력에 관심이있는 경우 Before/를 사용하지 말고 After종속성이 생성하는 일부 아티팩트를 검사하십시오.

예 :

ExecStartPre=/usr/bin/test -f /some/thing
Restart=on-failure
RestartSec=5s

당신은 심지어 사용할 수 있습니다 systemctl is-active <dependecy>.

매우 해 키지 만 더 나은 옵션을 찾지 못했습니다.

제 생각에는 의존성 실패를 처리 할 수있는 방법이 없다는 것은 체계적인 결함입니다.

— 바딤
소스

예, Leonard 시인이 구현하고 싶지 않은 마운트 지점에 대한 재 시도는 언급하지 않아도됩니다. github.com/systemd/systemd/issues/4468

— Hvisage

0

스크립트로 작성하여 cronjob에 쉽게 넣을 수있는 것 같습니다. 기본 논리는 다음과 같습니다.

서비스 a 및 b와 종속성이 모두 유효한 상태인지 확인하십시오. 모든 것이 올바르게 작동하는지 확인하는 가장 좋은 방법을 알게 될 것입니다
모든 것이 올바르게 작동하면 아무것도하지 않거나 모든 것이 작동하고 있음을 기록하십시오. 로깅은 이전 로그 항목을 찾을 수 있다는 장점이 있습니다.
문제가 발생하면 서비스를 다시 시작하고 서비스 및 종속성 상태 확인이 발생하는 스크립트의 시작 부분으로 이동하십시오. 점프는 서비스 재시작에 확신이 있고 종속성이 작업 가능성이 높을 경우에만 발생해야하며, 그렇지 않으면 루프가 발생할 가능성이 있습니다.
cron이 잠시 후에 스크립트를 다시 실행하도록하십시오

스크립트가 설정되면 cron이이를 테스트하기에 좋은 장소이고, cron이 비효율적 인 경우 스크립트는 다른 서비스의 상태를 확인하고 필요에 따라 다시 시작할 수있는 저수준 시스템 서비스를 작성하는 데 적합한 시작점이됩니다. 노력하려는 노력의 양에 따라 결과에 따라 이메일을 보내도록 스크립트를 설정할 수도 있습니다 (물론 문제의 서비스가 네트워크 서비스가 아닌 한).

— 매트
소스

이 cronjob은 프로세스 / 서비스 관리자에서 수행해야합니다. 그렇지 않으면 시스템은 시도 하지 않는 SVR4 방법으로 되돌아갑니다 .

— Hvisage

0

After그리고 Before에만 서비스가 시작됩니다 순서를 설정, 서비스 파일은 "A와 B는 다음 시작됩니다 경우 A는 B 전에 시작해야합니다"라고합니다.

Requires 이 서비스를 시작하려면 해당 서비스를 먼저 시작해야합니다 (예 : "B가 시작되고 A가 실행 중이 아닌 경우 A 시작").

추가 할 때 WantedBy=multi-user.target이제 시스템을 초기화 할 때 서비스를 시작해야한다고 시스템에 multi-user.target알리고 있습니다. 추가 한 후에는 수동으로 시작하지 않고 시스템에서 서비스를 시작하게했을 것입니다.

왜 이것이 버전 220에서 작동하지 않는지 잘 모르겠습니다. 222를 시도해 볼 가치가 있습니다. VM을 파고 기회가 생길 때 서비스를 시도해 보겠습니다.

— 마이클 쇼
소스

1

나는 systemd-devel을 물었습니다 .219에서 작동했다는 사실은 버그였습니다. 의도 된 동작은 실패한 종속성이 다시 시작 되지 않는 것 입니다.

— Vadim

0

나는 이것을 "체계화 된"방식으로 작동 시키려고 노력하면서 며칠을 보냈지 만 좌절감을 포기하고 의존성과 실패를 관리하는 래퍼 스크립트를 작성했습니다. 각 하위 서비스는 일반적인 시스템 서비스이며 "필수"또는 "PartOf"또는 다른 서비스에 대한 연결 고리가 없습니다.

내 최상위 서비스 파일은 다음과 같습니다.

[Service]
Type=simple
Environment=REQUIRES=foo.service bar.service
ExecStartPre=/usr/bin/systemctl start $REQUIRES
ExecStart=@PREFIX@/bin/top-service.sh $REQUIRES
ExecStop=/usr/bin/systemctl      stop $REQUIRES

여태까지는 그런대로 잘됐다. top.service파일을 제어 foo.service하고 bar.service. 시작 top시작 foo및 bar, 중지 top중지 foo및 bar. 마지막 구성 요소는 top-service.sh서비스 실패를 모니터링하는 스크립트입니다.

#!/bin/bash

# This monitors REQUIRES services. If any service stops, all of the services are stopped and this script ends.

REQUIRES="$@"

if [ "$REQUIRES" == "" ]
then
  echo "ERROR: no services listed"
  exit 1
fi

echo "INFO: watching services: ${REQUIRES}"

end=0
while [[ $end == 0 ]]
do
  s=$(systemctl is-active ${REQUIRES} )
  if echo $s | egrep '^(active ?)+$' > /dev/null
  then
    # $s has embedded newlines, but echo $s seems to get rid of them, while echo "$s" keeps them.
    # echo INFO: All active, $s
    end=0
  else
    echo "WARN: ${REQUIRES}"
    echo WARN: $s
  fi

  if [[ $s == *"failed"* ]] || [[ $s == *"unknown"* ]]
  then
    echo "WARN: At least one service is failed or unknown, ending service"
    end=1
  else
    sleep 1
  fi
done

echo "INFO: done watching services, stopping: ${REQUIRES}"
systemctl stop ${REQUIRES}
echo "INFO: stopped: ${REQUIRES}"
exit 1

— 마크 라 카타
소스

REQUIRES="$@"선천적으로 버그가 많은 코드입니다. 배열을 문자열로 축소하여 항목 사이의 원래 경계를 버리고 따라서 인수로 생성 된 인수는 다음과 같습니다. set -- "argument one" "argument two"와 동일하게됩니다 set -- "argument" "one" "argument" "two". requires=( "$@" )원본 데이터를 유지하므로으로 안전하게 확장 할 수 systemctl is-active "${requires[@]}"있습니다.

— Charles Duffy

-1

이것에 대답하지 마십시오. 그러나 누군가가 필요할 수 있습니다 (이 페이지가 검색에 표시되기 때문에).

해야한다

[Service]
 Restart=always
 RestartSec=3

https://jonarcher.info/2015/08/ensure-systemd-services-restart-on-failure/

— 시몬 두킨
소스

질문을보다 자세히 읽으십시오. 이것은 비정상적인 단일 서비스를 다시 시작하는 것이 아니라 피고 서비스가 실패 할 때 시스템이 작동하는 방식에 관한 것입니다.

— Vadim