Skip to content

Commit

Permalink
.ci/aws: Add 5 min sleep before launching p3dn
Browse files Browse the repository at this point in the history
When p3dn's fail to launch due to ICE, retry logic takes ~12 min (killing the existing
p3dn's, terminating security groups, removing elastic IP's, and then
waiting 6 minutes). PR CI will run faster overall if we can avoid going
into the retry logic all together by sleeping before hand for 5 min.

Signed-off-by: Seth Zegelstein <szegel@amazon.com>
  • Loading branch information
a-szegel committed May 31, 2024
1 parent 50295f1 commit c102ecf
Showing 1 changed file with 9 additions and 1 deletion.
10 changes: 9 additions & 1 deletion .ci/aws/common.groovy
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,6 @@ def wait_for_odcr_capacity(region, instance_count, odcr) {
sh ". venv/bin/activate; ./PortaFiducia/scripts/wait_for_odcr_capacity.py --region ${region} --odcr-id ${odcr} --required-capacity ${instance_count}"
}


def run_test_orchestrator_once(run_name, build_tag, os, instance_type, instance_count, region, config, odcr, addl_args) {
/*
* Run PortaFiducia/tests/test_orchestrator.py with given command line arguments
Expand All @@ -69,6 +68,15 @@ def run_test_orchestrator_once(run_name, build_tag, os, instance_type, instance_
kill_all_clusters(instance_type, region)
wait_for_odcr_capacity(region, instance_count, odcr)

/*
* p3dn clusters are getting ICE'ed within an ODCR, when we try to launch them back to back.
* This is a non-deterministic work around to help us increase our chances of not getting ICE'ed.
* Worst case, this increases our time to publish results on PR's by 15 minutes.
*/
if (instance_type == "p3dn.24xlarge") {
sh "sleep 300"
}

def cluster_name = get_cluster_name(build_tag, os, instance_type)
def args = "--config ${config} --os ${os} --odcr ${odcr} --instance-type ${instance_type} --instance-count ${instance_count} --region ${region} --cluster-name ${cluster_name} ${addl_args} --junit-xml outputs/${cluster_name}.xml"
def ret = sh (
Expand Down

0 comments on commit c102ecf

Please sign in to comment.