Skip to content

Bug: There is a data race in the test_old test #1540

@YexuanXiao

Description

@YexuanXiao

While running the CI for my own fork of C++/WinRT, I observed that test_old fails randomly (typically requiring 100-500 runs to reproduce). The failure line is

REQUIRE(completed);
, and I have identified the root cause.

The following is a simplified version of the code, stripped of all unrelated distractions. It is equivalent to the content in async.cpp, with some comments added to aid understanding.

struct signal_done
{
    HANDLE signal;
    ~signal_done()
    {
        SetEvent(signal);
    }
};
IAsyncOperationWithProgress<std::uint64_t, std::uint64_t> AutoCancel_IAsyncOperationWithProgress(HANDLE go)
{
    signal_done d{ go };
    co_await resume_on_signal(go); // switches to the thread pool due to suspension.
    co_await std::suspend_never{}; // at this point, an exception is thrown due to cancellation being detected
    REQUIRE(false);
    co_return 0;
}
TEST_CASE("async, AutoCancel_IAsyncOperationWithProgress, 2")
{
    handle event { CreateEvent(nullptr, false, false, nullptr)}; // # 1 auto-reset event and initialized as unset
    IAsyncOperationWithProgress<std::uint64_t, std::uint64_t> async = AutoCancel_IAsyncOperationWithProgress(event.get());
    REQUIRE(async.Status() == AsyncStatus::Started);

    bool completed = false;     // # 2 not atomic and not protected by a mutex
    bool objectMatches = false;
    bool statusMatches = false;

    async.Completed([&](const IAsyncOperationWithProgress<std::uint64_t, std::uint64_t> & sender, AsyncStatus status)
    {
        completed = true; # 3
        objectMatches = (async == sender);
        statusMatches = (status == AsyncStatus::Canceled);
    });

    async.Cancel();
    SetEvent(event.get()); // #4 signal async to run
    REQUIRE(WaitForSingleObject(event.get(), INFINITE) == WAIT_OBJECT_0); // # 5 wait for async to be canceled
    REQUIRE(async.Status() == AsyncStatus::Canceled);
    REQUIRE_THROWS_AS(async.GetResults(), hresult_canceled); # 6

    REQUIRE(completed);     # 7
    REQUIRE(objectMatches);
    REQUIRE(statusMatches);
}

A simplified execution flow of this test is as follows:

  1. An auto-reset event is create. WaitForSingleObject is responsible for resetting it.
  2. Execute the coroutine and its body (C++/WinRT coroutine's initial_awaiter::suspend does not suspend the coroutine).
  3. signal_done is initialized; it will signal the event upon destruction.
  4. Execute co_await resume_on_signal(go);. This causes the coroutine to switch to the thread pool and suspend.
    bool await_ready() const noexcept
    {
    return WINRT_IMPL_WaitForSingleObject(m_handle, 0) == 0;
    }
    template <typename T>
    void await_suspend(impl::coroutine_handle<T> resume)
    {
    set_cancellable_promise_from_handle(resume);
    m_resume = resume;
    create_threadpool_wait();
    }
  5. The coroutine returns to the test function.
  6. Set the completion callback.
  7. Set the coroutine state to canceled.
  8. Signal the event.
  9. The coroutine resumes.
  10. A. co_await std::suspend_never{}; detects that the coroutine has been canceled and throws a canceled exception.
    B. The test function is suspended at step # 5, waiting for the signal.
  11. A. signal_done destructs and signals the event.
    B. The test function resumes execution.
  12. A. The coroutine executes final_suspend, which invokes the completion callback.
    struct final_suspend_awaiter
    {
    promise_base* promise;
    bool await_ready() const noexcept
    {
    return false;
    }
    void await_resume() const noexcept
    {
    }
    bool await_suspend(coroutine_handle<>) const noexcept
    {
    promise->set_completed();
    uint32_t const remaining = promise->subtract_reference();
    if (remaining == 0)
    {
    std::atomic_thread_fence(std::memory_order_acquire);
    }
    return remaining > 0;
    }
    };

    B. The test function checks the completed variable at step # 7.

The problem lies in step 4. The coroutine is resumed on the thread pool. Consequently, the completed callback is set to true on a thread pool thread (# 3). Simultaneously, the test thread checks the variable at step # 7. The setting of the completed variable and the checking of it are unordered. This causes the test to fail randomly. The failure is not always observed because step # 7 throws and catches an exception, which often slows down the test function, creating a timing window that allows the test to pass more often than it fails.

I believe the key to this issue lies in two points:

  1. Is this test correct?
  2. Does the completion callback have to be executed in final_suspend? Can it be executed by the Cancel function? By modifying basic_coroutine_foundation.h as follows, the test can also succeed.

void Cancel() noexcept
{
winrt::delegate<> cancel;
{
slim_lock_guard const guard(m_lock);
if (m_status.load(std::memory_order_relaxed) == AsyncStatus::Started)
{
m_status.store(AsyncStatus::Canceled, std::memory_order_relaxed);
if (cancellable_promise::originate_on_cancel())
{
m_exception = std::make_exception_ptr(hresult_canceled());
}
else
{
m_exception = std::make_exception_ptr(hresult_canceled(hresult_error::no_originate));
}
cancel = std::move(m_cancel);
}
}
if (cancel)
{
cancel();
}
cancellable_promise::cancel();
}

void Cancel() noexcept
{
    winrt::delegate<> cancel;
    async_completed_handler_t<AsyncInterface> completed; // NB
    {
        slim_lock_guard const guard(m_lock);
        if (m_status.load(std::memory_order_relaxed) == AsyncStatus::Started)
        {
            m_status.store(AsyncStatus::Canceled, std::memory_order_relaxed);
            if (cancellable_promise::originate_on_cancel())
            {
                m_exception = std::make_exception_ptr(hresult_canceled());
            }
            else
            {
                m_exception = std::make_exception_ptr(hresult_canceled(hresult_error::no_originate));
            }
            cancel = std::move(m_cancel);
            completed = std::move(this->m_completed); // NB
        }
    }
    if (cancel)
    {
        cancel();
    }
    cancellable_promise::cancel();
    if (completed) // NB
    {
        winrt::impl::invoke(completed, *this, AsyncStatus::Canceled); //NB
    }
}

This change gives the completion callback a chance to execute on the thread that calls Cancel. I'm not sure if this is a good idea or if it will break existing code. The documentation currently doesn't specify this in detail.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions