From e8f55568de602630e0ce60f7863bdaeeea26c1c0 Mon Sep 17 00:00:00 2001
From: Jeff King <peff@peff.net>
Date: Thu, 9 Sep 2021 18:57:57 -0400
Subject: t5562: use alarm() to interrupt timed child-wait
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The t5562 script occasionally takes 60 extra seconds to complete due to
a race condition in the invoke-with-content-length.pl helper.

The way it's supposed to work is this:

  - we set up a SIGCLD handler

  - we kick off http-backend and write to it with a set content-length,
    but _don't_ close the pipe

  - we sleep for 60 seconds, assuming that SIGCLD from http-backend
    finishing will interrupt us

  - after the sleep finishes (whetherby 60 seconds or because it was
    interrupted by the signal), we check a flag to see if our SIGCLD
    handler was called. If not, then we complain.

This usually completes immediately, because the signal interrupts our
sleep. But very occasionally the child process dies _before_ we hit the
sleep, so we don't realize it. The test still completes successfully
(because our $exited flag is set), but it takes an extra 60 seconds.

There's no way to check the flag and sleep atomically. So the best we
can do with this approach is to sleep in smaller chunks (say, 1 second)
and check the flag incrementally. Then we waste a maximum of 1 second if
we lose the race. This was proposed in:

  https://lore.kernel.org/git/20190218205028.32486-1-max@max630.net/

and it does work. But we can do better.

Instead of blocking on sleep and waiting for the child signal to
interrupt us, we can block on the child exiting and set an alarm signal
to trigger the timeout.

This lets us exit the script immediately when the child behaves (with no
race possible), and wait a maximum of 60 seconds when it doesn't.

Note one small subtlety: perl is very willing to restart the waitpid()
call after the alarm is delivered, even if we've thrown an exception via
die. "perldoc -f alarm" claims you can get around this with an eval/die
combo (and even has some example code), but it doesn't seem to work for
me with waitpid(); instead, we continue waiting until the child exits.

So instead, we'll instruct the child process to exit in the alarm
handler itself. In the original code this was done by calling
close($out). That would continue to work, since our child is always
http-backend, which should exit when its stdin closes. But we can be
even more robust against a hung or confused child by sending a KILL
signal, which should terminate it immediately.

Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>

diff --git a/t/t5562/invoke-with-content-length.pl b/t/t5562/invoke-with-content-length.pl
index 0943474..718dd9b 100644
--- a/t/t5562/invoke-with-content-length.pl
+++ b/t/t5562/invoke-with-content-length.pl
@@ -13,11 +13,6 @@ my $body_data;
 defined read($body_fh, $body_data, $body_size) or die "Cannot read $body_filename: $!";
 close($body_fh);
 
-my $exited = 0;
-$SIG{"CHLD"} = sub {
-        $exited = 1;
-};
-
 # write data
 my $pid = open(my $out, "|-", @command);
 {
@@ -29,8 +24,13 @@ my $pid = open(my $out, "|-", @command);
 }
 print $out $body_data or die "Cannot write data: $!";
 
-sleep 60; # is interrupted by SIGCHLD
-if (!$exited) {
-        close($out);
+$SIG{ALRM} = sub {
+        kill 'KILL', $pid;
         die "Command did not exit after reading whole body";
+};
+alarm 60;
+
+my $ret = waitpid($pid, 0);
+if ($ret != $pid) {
+        die "confusing return from waitpid: $ret";
 }
-- 
cgit v0.10.2-6-g49f6