Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing tests with Intel iGPU and its OpenCL driver #164

Closed
pjaaskel opened this issue Sep 27, 2022 · 24 comments
Closed

Failing tests with Intel iGPU and its OpenCL driver #164

pjaaskel opened this issue Sep 27, 2022 · 24 comments
Labels
bug Something isn't working opencl Issues affecting only the OpenCL backend

Comments

@pjaaskel
Copy link
Collaborator

pjaaskel commented Sep 27, 2022

With export OverrideDefaultFP64Settings=1 and export IGC_EnableDPEmulation=1 to emulate the doubles (#137).

Check the end of the thread for the current status.

@pjaaskel pjaaskel added this to the 0.9 - the first release milestone Sep 27, 2022
@pjaaskel
Copy link
Collaborator Author

With #162:

The following tests FAILED:
	173 - Unit_hipMemcpyWithStream_MultiThread (Subprocess aborted)
	500 - Unit_hipStreamGetFlags_BasicFunctionalities (Failed)
	571 - Unit_hipStreamPerThread_MultiThread (Subprocess aborted)
	572 - Unit_hipStreamPerThread_DeviceReset_1 (Subprocess aborted)
	579 - Stress_hipMalloc (Failed)
Errors while running CTest

@pjaaskel
Copy link
Collaborator Author

With the brutal sync issue workaround for #152:

The following tests FAILED:
	500 - Unit_hipStreamGetFlags_BasicFunctionalities (Failed)
	571 - Unit_hipStreamPerThread_MultiThread (Subprocess aborted)
	572 - Unit_hipStreamPerThread_DeviceReset_1 (Subprocess aborted)
	579 - Stress_hipMalloc (Failed)

@pjaaskel
Copy link
Collaborator Author

With #162, now down to

The following tests FAILED:
	500 - Unit_hipStreamGetFlags_BasicFunctionalities (Failed)
	571 - Unit_hipStreamPerThread_MultiThread (Subprocess aborted)
	572 - Unit_hipStreamPerThread_DeviceReset_1 (Subprocess aborted)

@pjaaskel
Copy link
Collaborator Author

Iris Xe iGPU OpenCL (with #162 and @pvelesko's OpenCL event fix on top):

The following tests FAILED:
571 - Unit_hipStreamPerThread_MultiThread (Subprocess aborted)
572 - Unit_hipStreamPerThread_DeviceReset_1 (Subprocess aborted)

Filters: Unit_hipStreamPerThread_MultiThread
===============================================================================
test cases: 1 | 1 passed
assertions: - none -

malloc(): unsorted double linked list corrupted
Aborted (core dumped)
catch/unit/streamperthread/hipStreamPerThread_DeviceReset "Unit_hipStreamPerThread_DeviceReset_1"
Filters: Unit_hipStreamPerThread_DeviceReset_1
===============================================================================
test cases: 1 | 1 passed
assertions: - none -

malloc(): unsorted double linked list corrupted
Aborted (core dumped

Might be a single issue. @pvelesko @franz any bells ringing?

@pvelesko
Copy link
Collaborator

interesting.. same ones as #146

@pjaaskel
Copy link
Collaborator Author

Yep, seems like heap corruption, overwriting the malloc() book keeping structures or such. Could be serious. Have you ran valgrind recently?

@pjaaskel
Copy link
Collaborator Author

This crashes both with the GPU and the CPU drivers. When I run it in valgrind with the CPU driver, it prints out:

Filters: Unit_hipStreamPerThread_DeviceReset_1
===============================================================================
test cases: 1 | 1 passed
assertions: - none -

CHIP error [TID 28045] [1664522125.458253727] : hipErrorOutOfMemory (clSVMAlloc failed) in /home/pjaaskel/src/chip-spv/src/backend/OpenCL/SVMemoryRegion.cc:40:allocate

CHIP error [TID 28045] [1664522125.490626259] : Caught Error: hipErrorOutOfMemory

Also:

==28143== Conditional jump or move depends on uninitialised value(s)
==28143==    at 0x7430723: __intel_sse2_strrchr (in /opt/intel/oneapi/lib/intel64/libtbb.so.12.6)
==28143==    by 0x7420F13: strrchr (string.h:254)
==28143==    by 0x7420F13: init_ap_data (dynamic_link.cpp:260)
==28143==    by 0x7420F13: _INTERNALebd713af::tbb::detail::r1::init_dl_data() (dynamic_link.cpp:297)
==28143==    by 0x5506F67: __pthread_once_slow (pthread_once.c:116)
==28143==    by 0x7420E79: __gthread_once (gthr-default.h:699)
==28143==    by 0x7420E79: call_once (mutex:786)
==28143==    by 0x7420E79: tbb::detail::r1::init_dynamic_link_data() (dynamic_link.cpp:331)
==28143==    by 0x400647D: call_init.part.0 (dl-init.c:70)
==28143==    by 0x4006567: call_init (dl-init.c:33)
==28143==    by 0x4006567: _dl_init (dl-init.c:117)
==28143==    by 0x55E1C84: _dl_catch_exception (dl-error-skeleton.c:182)
==28143==    by 0x400DFF5: dl_open_worker (dl-open.c:808)
==28143==    by 0x400DFF5: dl_open_worker (dl-open.c:771)
==28143==    by 0x55E1C27: _dl_catch_exception (dl-error-skeleton.c:208)
==28143==    by 0x400E34D: _dl_open (dl-open.c:883)
==28143==    by 0x54FD6BB: dlopen_doit (dlopen.c:56)
==28143==    by 0x55E1C27: _dl_catch_exception (dl-error-skeleton.c:208)
==28143== 
...

When I run with the GPU driver in Valgrind, it heisenbugs away. Might be still a race in the runtime related to memory copies/allocations. Any quick guesses @pvelesko or should I dig deeper (to ensure this is not more serious)?

@pvelesko
Copy link
Collaborator

I can't reproduce this on the OpenCL iGPU. On the OpenCL CPU side, the error seems to be coming from the OpenCL runtime since the test passes but it segfaults upon thread exit. @pjaaskel

@pjaaskel
Copy link
Collaborator Author

It might be coming from or just appear in it (memory corruption can be visible that way). It fails for both iGPU and the CPU here, but we can postpone for 0.9.1.

@pjaaskel pjaaskel modified the milestones: 0.9 - the first release, 0.9.1 Sep 30, 2022
@pjaaskel
Copy link
Collaborator Author

With the latest OpenCL fixes @pvelesko, this target regressed by one test:

The following tests FAILED:
	150 - Unit_hipMemset3DAsync_ConcurrencyMthread (Subprocess aborted)
	571 - Unit_hipStreamPerThread_MultiThread (Subprocess aborted)
	572 - Unit_hipStreamPerThread_DeviceReset_1 (Subprocess aborted)
Errors while running CTest
``

@pvelesko
Copy link
Collaborator

ok let me check this.

@pvelesko pvelesko added bug Something isn't working opencl Issues affecting only the OpenCL backend potential runtime bug labels Sep 30, 2022
@pjaaskel
Copy link
Collaborator Author

pjaaskel commented Oct 3, 2022

With the latest 0.9 branch:

	173 - Unit_hipMemcpyWithStream_MultiThread (SEGFAULT)
	571 - Unit_hipStreamPerThread_MultiThread (Subprocess aborted)
	572 - Unit_hipStreamPerThread_DeviceReset_1 (SEGFAULT)

@pvelesko
Copy link
Collaborator

pvelesko commented Oct 3, 2022

	571 - Unit_hipStreamPerThread_MultiThread (Subprocess aborted)
	572 - Unit_hipStreamPerThread_DeviceReset_1 (SEGFAULT)
``` 

are already root-caused waiting a fix from Intel. 

```
	173 - Unit_hipMemcpyWithStream_MultiThread (SEGFAULT)
```

Also seems like a bug in OpenCL runtime - need to make a C reproducer. 

@pjaaskel
Copy link
Collaborator Author

pjaaskel commented Oct 3, 2022

OK. Did you report the issues to https://github.com/intel/compute-runtime/issues?

@pvelesko
Copy link
Collaborator

pvelesko commented Oct 3, 2022

@pjaaskel
Copy link
Collaborator Author

pjaaskel commented Oct 3, 2022

That's for Level Zero, this is OpenCL, but maybe the same underlying issue.

@pvelesko
Copy link
Collaborator

pvelesko commented Oct 3, 2022

läpällä ja kännissä

@pjaaskel
Copy link
Collaborator Author

pjaaskel commented Oct 5, 2022

At 3 failing tests with 7f20613.

	 21 - stream (Failed)
	658 - Unit_hipStreamPerThread_MultiThread (Subprocess aborted)
	659 - Unit_hipStreamPerThread_DeviceReset_1 (Subprocess aborted)

@pvelesko
Copy link
Collaborator

pvelesko commented Oct 6, 2022

@pjaaskel I can only reproduce the stream failure one of the systems at JLSE. It's not reproducible on the system which has the most stable runtime. I was told to use this system as reference.

@pjaaskel
Copy link
Collaborator Author

pjaaskel commented Oct 6, 2022

OK, let's see if it goes away with future driver upgrades.

@CHIP-SPV CHIP-SPV deleted a comment from pjaaskel Oct 7, 2022
@pjaaskel
Copy link
Collaborator Author

None of the 688 or so fail here anymore with OpenCL/iGPU. Well done!

@pjaaskel
Copy link
Collaborator Author

Looking good still, but one of the tests is time outing:
``
...
99% tests passed, 1 tests failed out of 692

Label Time Summary:
cuda = 63.15 secproc (27 tests)
internal = 116.91 sec
proc (85 tests)

Total Test time (real) = 960.48 sec

The following tests did not run:
65 - sycl_chip_interop (Skipped)
66 - sycl_chip_interop_usm (Skipped)

The following tests FAILED:
620 - Unit_hipStreamCreate_MultistreamBasicFunctionalities (Timeout)
Errors while running CTest

@pjaaskel
Copy link
Collaborator Author

Current (6407148) status:

99% tests passed, 7 tests failed out of 757
...
The following tests FAILED:
	376 - Unit_hipGraphNodeGetType_Functional (Failed)
	377 - Unit_hipGraphNodeGetType_NodeType (Failed)
	753 - ABM_AddKernel_MultiTypeMultiSize - int (Failed)
	754 - ABM_AddKernel_MultiTypeMultiSize - long (Failed)
	755 - ABM_AddKernel_MultiTypeMultiSize - float (Failed)
	756 - ABM_AddKernel_MultiTypeMultiSize - long long (Failed)
	757 - ABM_AddKernel_MultiTypeMultiSize - double (Failed)
Errors while running CTest

"Unit_hipGraphExecHostNodeSetParams_Negative" start time: Jan 16 15:08 EET
Output:
----------------------------------------------------------
CHIP error [TID 54506] [1673874492.764718369] : hipErrorInvalidValue (Failed to find the node in hipGraphExec_t) in /home/pjaaskel/src/chip$

CHIP error [TID 54506] [1673874492.764815782] : Caught Error: hipErrorInvalidValue
CHIP error [TID 54506] [1673874492.764901796] : hipErrorInvalidValue (Failed to find the node in hipGraphExec_t) in /home/pjaaskel/src/chip$

CHIP error [TID 54506] [1673874492.764910125] : Caught Error: hipErrorInvalidValue
CHIP error [TID 54506] [1673874492.764956995] : hipErrorInvalidValue (Failed to find the node in hipGraphExec_t) in /home/pjaaskel/src/chip$

CHIP error [TID 54506] [1673874492.764964494] : Caught Error: hipErrorInvalidValue
Filters: Unit_hipGraphExecHostNodeSetParams_Negative

The ABM cases fail due to pushing too large WG size (1000) while the iGPU has 512 max.

@pjaaskel pjaaskel modified the milestones: 0.9.1, 1.0 Mar 22, 2023
@pjaaskel
Copy link
Collaborator Author

Let's open separate issues to each test we start to fix and assign them to whoever's working on it to keep better track.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working opencl Issues affecting only the OpenCL backend
Projects
None yet
Development

No branches or pull requests

2 participants