Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to end process or retry upon connection reset error #135

Open
abaumann opened this issue Feb 9, 2024 · 4 comments
Open

Failure to end process or retry upon connection reset error #135

abaumann opened this issue Feb 9, 2024 · 4 comments

Comments

@abaumann
Copy link

abaumann commented Feb 9, 2024

We have a reproducible test case from a collaborator trying to use getm within Terra for TCGA data - this happens transiently, but can be reproduced when trying to download at scale from TCGA buckets. I am not sure why it happens as there are simply Google signed urls, but getm doesn't handle the error gracefully, leaving a process open and causing the WDL task to hang indefinitely (costing $ for the user).

In order to help resolve this I wanted to check if the following could be updated in getm:

  • when an error like this occurs, have getm exit properly rather than leave the process open
  • see if we can add some sort of retries for errors like this - not sure if that needs to be a user provided flag (like up to 3 retries) or if this can just be made to retry within reasonable default limits, but retry would be preferred over fully restarting a WDL
  • ideally: add timestamps to the beginning of all of the verbose logging, which helps a lot in the debugging of issues like this across systems

Thank you!

Here is output from getm running in very verbose logging mode which shows the error:

DEBUG:getm.cli:{"progress_class": "ProgressLogger"}
DEBUG:getm.cli:{"message": "validating manifest", "manifest": [{"url": "https://gdc-tcga-phs000178-controlled.storage.googleapis.com/a03305a8-902d-41a2-8791-08c0dcaed0f7/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam?x-goog-algorithm=GOOG4-RSA-SHA256&x-goog-credential=alexbaumann-1911%40dcf-prod.iam.gserviceaccount.com%2F20240129%2Fauto%2Fstorage%2Fgoog4_request&x-goog-date=20240129T200441Z&x-goog-expires=3600&x-goog-signedheaders=host&x-goog-signature=4c006e811411f156a76...682b", "filepath": "/cromwell_root", "checksum": "92cf29e2761443eb247354da91bdc9f7", "checksum-algorithm": "md5"}, {"url": "https://gdc-tcga-phs000178-controlled.storage.googleapis.com/c05728b2-70bd-4a9b-a6d6-abc6127f9b5a/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bai?x-goog-algorithm=GOOG4-RSA-SHA256&x-goog-credential=alexbaumann-1911%40dcf-prod.iam.gserviceaccount.com%2F20240129%2Fauto%2Fstorage%2Fgoog4_request&x-goog-date=20240129T200443Z&x-goog-expires=3600&x-goog-signedheaders=host&x-goog-signature=4eb6a3cc2bbbb14a428ff80a8c74e3f...bd8", "filepath": "/cromwell_root", "checksum": "1b0e4d4571f949df5e266218a1af2ad4", "checksum-algorithm": "md5"}], "schema": {"type": "array", "items": {"type": "object", "properties": {"url": {"type": "string"}, "filepath": {"type": "string"}, "checksum": {"type": "string"}, "checksum-algorithm": {"type": "string", "enum": ["md5", "gs_crc32c", "s3_etag", "null"]}}, "required": ["url"], "dependencies": {"checksum": ["checksum-algorithm"], "checksum-algorithm": ["checksum"]}}}}
DEBUG:getm.cli:{"multipart_buffer_size": 104857600}
INFO:getm.cli:{"message": "initiating multipart download", "url": "https://gdc-tcga-phs000178-controlled.storage.googleapis.com/a03305a8-902d-41a2-8791-08c0dcaed0f7/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam?x-goog-algorithm=GOOG4-RSA-SHA256&x-goog-credential=alexbaumann-1911%40dcf-prod.iam.gserviceaccount.com%2F20240129%2Fauto%2Fs...357dd225636cde32eef08510e01144c226b2ad682b", "expected_checksum": "92cf29e2761443eb247354da91bdc9f7", "checksum_algorithm": "md5"}
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 104.4MiB/s 2.453076s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 102.0MiB/s 5.017502s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 109.7MiB/s 7.001775s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 115.2MiB/s 8.891281s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 115.2MiB/s 11.107452s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 110.4MiB/s 13.914369s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 106.4MiB/s 16.840150s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 103.6MiB/s 19.760566s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 101.5MiB/s 22.693003s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 100.0MiB/s 25.596672s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 98.7MiB/s 28.521763s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 0% 310.5GiB 97.7MiB/s 31.447519s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 96.9MiB/s 34.359779s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 95.2MiB/s 37.633454s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 90.8MiB/s 42.295202s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 88.2MiB/s 46.439717s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 89.7MiB/s 48.525774s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 91.2MiB/s 50.532638s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 92.5MiB/s 52.563134s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 93.3MiB/s 54.882982s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 93.1MiB/s 57.742125s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 92.9MiB/s 60.629910s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 92.6MiB/s 63.552936s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 1% 310.5GiB 92.3MiB/s 66.548761s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 2% 310.5GiB 92.0MiB/s 69.555787s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 2% 310.5GiB 91.7MiB/s 72.567698s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 2% 310.5GiB 91.5MiB/s 75.553901s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 2% 310.5GiB 91.3MiB/s 78.505067s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 2% 310.5GiB 91.1MiB/s 81.459932s
INFO:getm.progress:/cromwell_root/e5813338-55aa-4e30-8ef3-80ff161cbe8d_wgs_gdc_realn.bam 2% 310.5GiB 91.1MiB/s 84.325711s
Process URLReaderKeepAlive-1:1:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/urllib3/response.py", line 444, in _error_catcher
yield
File "/usr/local/lib/python3.9/dist-packages/urllib3/response.py", line 567, in read
data = self._fp_read(amt) if not fp_closed else b""
File "/usr/local/lib/python3.9/dist-packages/urllib3/response.py", line 525, in _fp_read
data = self._fp.read(chunk_amt)
File "/usr/lib/python3.9/http/client.py", line 458, in read
n = self.readinto(b)
File "/usr/lib/python3.9/http/client.py", line 502, in readinto
n = self.fp.readinto(b)
File "/usr/lib/python3.9/socket.py", line 704, in readinto
return self._sock.recv_into(b)
File "/usr/lib/python3.9/ssl.py", line 1241, in recv_into
return self.read(nbytes, buffer)
File "/usr/lib/python3.9/ssl.py", line 1099, in read
return self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/getm/reader.py", line 190, in run
bytes_read = handle.readinto(buf[stop: stop + self.chunk_size])
File "/usr/local/lib/python3.9/dist-packages/urllib3/response.py", line 737, in readinto
temp = self.read(len(b))
File "/usr/local/lib/python3.9/dist-packages/urllib3/response.py", line 593, in read
raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
File "/usr/lib/python3.9/contextlib.py", line 135, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/local/lib/python3.9/dist-packages/urllib3/response.py", line 461, in _error_catcher
raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/lib/python3.9/dist-packages/getm/reader.py", line 194, in run
buf.stop = stop
File "/usr/local/lib/python3.9/dist-packages/getm/concurrent/buffers.py", line 90, in __exit__
self.close()
File "/usr/local/lib/python3.9/dist-packages/getm/concurrent/buffers.py", line 82, in close
sm.close()
File "/usr/lib/python3.9/multiprocessing/shared_memory.py", line 227, in close
self._mmap.close()
BufferError: cannot close exported pointers exist
Exception ignored in: <function SharedMemory.__del__ at 0x7e5ca92ba550>
Traceback (most recent call last):
File "/usr/lib/python3.9/multiprocessing/shared_memory.py", line 184, in __del__
self.close()
File "/usr/lib/python3.9/multiprocessing/shared_memory.py", line 227, in close
self._mmap.close()
BufferError: cannot close exported pointers exist

@aednichols
Copy link

@xbrianh if you're still active around these parts, would love to get your thoughts. We're seeing this issue in production Terra fairly often. Thanks!

@aednichols
Copy link

cc @benedictpaten

@xbrianh
Copy link
Member

xbrianh commented May 1, 2024

Hi @aednichols It’s been a while, but I’ll try to take a look this weekend.

@xbrianh
Copy link
Member

xbrianh commented May 6, 2024

@abaumann @aednichols @benedictpaten

I've created a draft PR that addressed some of the ungraceful exit issues raised by @abaumann. This is by no means ready to merge but it should point you in the right direction, especially this snippet.

I'd be open to a brief discussion with whomever maintains this repo, just for old times sake :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants