std :: hardware_destructive_interference_size 및 std :: hardware_constructive_interference_size 이해

Question 1

C ++ 17 추가 std::hardware_destructive_interference_size및std::hardware_constructive_interference_size . 첫째, L1 캐시 라인의 크기를 얻는 이식 가능한 방법이라고 생각했지만 이는 지나치게 단순화 된 것입니다.

질문 :

이러한 상수는 L1 캐시 라인 크기와 어떤 관련이 있습니까?
사용 사례를 보여주는 좋은 예가 있습니까?
둘 다 정의됩니다 static constexpr. 바이너리를 빌드하고 캐시 라인 크기가 다른 다른 머신에서 실행하면 문제가되지 않습니까? 코드가 어떤 시스템에서 실행 될지 확실하지 않은 경우 해당 시나리오에서 허위 공유를 어떻게 방지 할 수 있습니까?

Question 2

이 상수의 의도는 실제로 캐시 라인 크기를 얻는 것입니다. 그들에 대한 근거를 읽을 수있는 가장 좋은 곳은 제안 자체입니다.

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0154r1.html

읽기 쉽도록 여기에 근거를 인용하겠습니다.

[...] 방해하지 않는 메모리의 단위 (첫 번째 순서까지) [은] 일반적으로 캐시 라인 크기라고합니다. 합니다.

캐시 라인 크기의 사용은 크게 두 가지 범주로 나뉩니다.

서로 다른 스레드에서 일시적으로 분리 된 런타임 액세스 패턴을 사용하여 객체 간의 파괴적인 간섭 (거짓 공유)을 방지합니다.

일시적으로 로컬 런타임 액세스 패턴이있는 객체 간의 건설적인 간섭 (진정한 공유)을 촉진합니다.

이 유용한 구현 수량에 대한 가장 중요한 문제는 그룹으로서의 보급 성과 인기에도 불구하고 그 가치를 결정하기 위해 현재 실무에서 사용되는 방법의 이식성이 의심 스럽다는 것입니다. [...]

우리는이 목적을 위해 겸손한 발명, 구현에 의해 주어진 목적에 대해 보수적으로 정의 될 수있는이 수량에 대한 추상화에 기여하는 것을 목표로합니다.

파괴적인 간섭 크기 : 서로 다른 스레드의 다른 런타임 액세스 패턴으로 인한 잘못된 공유를 방지하기 위해 두 개체 간의 오프셋으로 적합한 숫자입니다.

건설적인 간섭 크기 : 두 개체의 결합 된 메모리 공간 크기 및 기본 정렬에 대한 제한으로 적합한 숫자로, 둘 사이의 진정한 공유를 촉진 할 수 있습니다.

두 경우 모두 이러한 값은 성능을 향상시킬 수있는 힌트로 순전히 구현 품질을 기준으로 제공됩니다. 이들은 alignas()키워드 와 함께 사용하기에 이상적인 이식 가능한 값 이며, 현재 표준 지원 이식 용도가 거의 없습니다.

"이러한 상수는 L1 캐시 라인 크기와 어떤 관련이 있습니까?"

이론적으로는 꽤 직접적입니다.

컴파일러가 실행중인 아키텍처를 정확히 알고 있다고 가정하면 거의 확실하게 L1 캐시 라인 크기를 정확하게 제공 할 것입니다. (나중에 언급했듯이 이것은 큰 가정입니다.)

그만한 가치가 있기 때문에 나는 거의 항상 이러한 값이 동일 할 것이라고 기대합니다. 나는 그들이 개별적으로 선언 된 유일한 이유는 완전성 때문이라고 믿습니다. (즉, 컴파일러가 건설적인 간섭을 위해 L1 캐시 라인 크기 대신 L2 캐시 라인 크기를 추정하려고 할 수도 있습니다.하지만 이것이 실제로 유용한 지 모르겠습니다.)

"사용 사례를 보여주는 좋은 예가 있습니까?"

이 답변의 맨 아래에 거짓 공유와 진정한 공유를 보여주는 긴 벤치 마크 프로그램을 첨부했습니다.

int 래퍼 배열을 할당하여 거짓 공유를 보여줍니다. 한 경우에는 여러 요소가 L1 캐시 라인에 맞고 다른 하나의 요소는 L1 캐시 라인을 차지합니다. 타이트 루프에서 단일 고정 요소가 배열에서 선택되고 반복적으로 업데이트됩니다.

한 쌍의 int를 래퍼에 할당하여 진정한 공유를 보여줍니다. 한 경우에는 쌍 내의 두 int가 L1 캐시 라인 크기에 맞지 않고 다른 경우에는 맞지 않습니다. 타이트한 루프에서는 쌍의 각 요소가 반복적으로 업데이트됩니다.

테스트중인 개체에 액세스하기위한 코드는 변경 되지 않습니다 . 유일한 차이점은 개체 자체의 레이아웃과 정렬입니다.

나는 C ++ 17 컴파일러를 가지고 있지 않기 때문에 (현재 대부분의 사람들이 가지고 있지 않다고 가정) 문제의 상수를 내 것으로 대체했습니다. 이 값을 컴퓨터에서 정확하게 업데이트해야합니다. 즉, 64 바이트는 일반적인 최신 데스크톱 하드웨어 (작성 당시)에서 올바른 값일 것입니다.

경고 : 테스트는 컴퓨터의 모든 코어를 사용하고 최대 256MB의 메모리를 할당합니다. 최적화로 컴파일하는 것을 잊지 마십시오!

내 컴퓨터에서 출력은 다음과 같습니다.

하드웨어 동시성 : 16
sizeof (naive_int) : 4
alignof (naive_int) : 4
sizeof (cache_int) : 64
alignof (cache_int) : 64
sizeof (bad_pair) : 72
alignof (bad_pair) : 4
sizeof (good_pair) : 8
alignof (양호한 쌍) : 4
naive_int 테스트를 실행 중입니다.
평균 시간 : 0.0873625 초, 쓸모없는 결과 : 3291773
cache_int 테스트를 실행 중입니다.
평균 시간 : 0.024724 초, 쓸모없는 결과 : 3286020
bad_pair 테스트를 실행 중입니다.
평균 시간 : 0.308667 초, 쓸모없는 결과 : 6396272
good_pair 테스트를 실행 중입니다.
평균 시간 : 0.174936 초, 쓸모없는 결과 : 6668457

허위 공유를 피함으로써 ~ 3.5 배의 속도를 얻고 진정한 공유를 보장함으로써 ~ 1.7 배의 속도를 얻습니다.

"둘 다 정적 constexpr로 정의되어 있습니다. 바이너리를 빌드하고 캐시 라인 크기가 다른 다른 시스템에서 실행하는 경우 문제가되지 않습니까? 코드가 어떤 시스템에서 실행되는지 확실하지 않은 경우 해당 시나리오에서 잘못된 공유를 어떻게 방지 할 수 있습니까? 실행 중입니까? "

이것은 실제로 문제가 될 것입니다. 이러한 상수는 특히 대상 머신의 캐시 라인 크기에 매핑되는 것이 보장되지 않지만 컴파일러가 예상 할 수있는 최상의 근사치입니다.

이것은 제안서에 언급되어 있으며, 부록에서는 다양한 환경 힌트 및 매크로를 기반으로 컴파일 시간에 일부 라이브러리가 캐시 라인 크기를 감지하는 방법에 대한 예를 제공합니다. 당신이 하는 이 값이 적어도 것을 보장alignof(max_align_t) 하한 명백한이다.

즉,이 값은 대체 사례로 사용되어야합니다. 알고있는 경우 정확한 값을 자유롭게 정의 할 수 있습니다. 예 :

constexpr std::size_t cache_line_size() {
#ifdef KNOWN_L1_CACHE_LINE_SIZE
  return KNOWN_L1_CACHE_LINE_SIZE;
#else
  return std::hardware_destructive_interference_size;
#endif
}

컴파일 중에 캐시 라인 크기를 가정하려면 KNOWN_L1_CACHE_LINE_SIZE.

도움이 되었기를 바랍니다!

벤치 마크 프로그램 :

#include <chrono>
#include <condition_variable>
#include <cstddef>
#include <functional>
#include <future>
#include <iostream>
#include <random>
#include <thread>
#include <vector>

// !!! YOU MUST UPDATE THIS TO BE ACCURATE !!!
constexpr std::size_t hardware_destructive_interference_size = 64;

// !!! YOU MUST UPDATE THIS TO BE ACCURATE !!!
constexpr std::size_t hardware_constructive_interference_size = 64;

constexpr unsigned kTimingTrialsToComputeAverage = 100;
constexpr unsigned kInnerLoopTrials = 1000000;

typedef unsigned useless_result_t;
typedef double elapsed_secs_t;

//////// CODE TO BE SAMPLED:

// wraps an int, default alignment allows false-sharing
struct naive_int {
    int value;
};
static_assert(alignof(naive_int) < hardware_destructive_interference_size, "");

// wraps an int, cache alignment prevents false-sharing
struct cache_int {
    alignas(hardware_destructive_interference_size) int value;
};
static_assert(alignof(cache_int) == hardware_destructive_interference_size, "");

// wraps a pair of int, purposefully pushes them too far apart for true-sharing
struct bad_pair {
    int first;
    char padding[hardware_constructive_interference_size];
    int second;
};
static_assert(sizeof(bad_pair) > hardware_constructive_interference_size, "");

// wraps a pair of int, ensures they fit nicely together for true-sharing
struct good_pair {
    int first;
    int second;
};
static_assert(sizeof(good_pair) <= hardware_constructive_interference_size, "");

// accesses a specific array element many times
template <typename T, typename Latch>
useless_result_t sample_array_threadfunc(
    Latch& latch,
    unsigned thread_index,
    T& vec) {
    // prepare for computation
    std::random_device rd;
    std::mt19937 mt{ rd() };
    std::uniform_int_distribution<int> dist{ 0, 4096 };

    auto& element = vec[vec.size() / 2 + thread_index];

    latch.count_down_and_wait();

    // compute
    for (unsigned trial = 0; trial != kInnerLoopTrials; ++trial) {
        element.value = dist(mt);
    }

    return static_cast<useless_result_t>(element.value);
}

// accesses a pair's elements many times
template <typename T, typename Latch>
useless_result_t sample_pair_threadfunc(
    Latch& latch,
    unsigned thread_index,
    T& pair) {
    // prepare for computation
    std::random_device rd;
    std::mt19937 mt{ rd() };
    std::uniform_int_distribution<int> dist{ 0, 4096 };

    latch.count_down_and_wait();

    // compute
    for (unsigned trial = 0; trial != kInnerLoopTrials; ++trial) {
        pair.first = dist(mt);
        pair.second = dist(mt);
    }

    return static_cast<useless_result_t>(pair.first) +
        static_cast<useless_result_t>(pair.second);
}

//////// UTILITIES:

// utility: allow threads to wait until everyone is ready
class threadlatch {
public:
    explicit threadlatch(const std::size_t count) :
        count_{ count }
    {}

    void count_down_and_wait() {
        std::unique_lock<std::mutex> lock{ mutex_ };
        if (--count_ == 0) {
            cv_.notify_all();
        }
        else {
            cv_.wait(lock, [&] { return count_ == 0; });
        }
    }

private:
    std::mutex mutex_;
    std::condition_variable cv_;
    std::size_t count_;
};

// utility: runs a given function in N threads
std::tuple<useless_result_t, elapsed_secs_t> run_threads(
    const std::function<useless_result_t(threadlatch&, unsigned)>& func,
    const unsigned num_threads) {
    threadlatch latch{ num_threads + 1 };

    std::vector<std::future<useless_result_t>> futures;
    std::vector<std::thread> threads;
    for (unsigned thread_index = 0; thread_index != num_threads; ++thread_index) {
        std::packaged_task<useless_result_t()> task{
            std::bind(func, std::ref(latch), thread_index)
        };

        futures.push_back(task.get_future());
        threads.push_back(std::thread(std::move(task)));
    }

    const auto starttime = std::chrono::high_resolution_clock::now();

    latch.count_down_and_wait();
    for (auto& thread : threads) {
        thread.join();
    }

    const auto endtime = std::chrono::high_resolution_clock::now();
    const auto elapsed = std::chrono::duration_cast<
        std::chrono::duration<double>>(
            endtime - starttime
            ).count();

    useless_result_t result = 0;
    for (auto& future : futures) {
        result += future.get();
    }

    return std::make_tuple(result, elapsed);
}

// utility: sample the time it takes to run func on N threads
void run_tests(
    const std::function<useless_result_t(threadlatch&, unsigned)>& func,
    const unsigned num_threads) {
    useless_result_t final_result = 0;
    double avgtime = 0.0;
    for (unsigned trial = 0; trial != kTimingTrialsToComputeAverage; ++trial) {
        const auto result_and_elapsed = run_threads(func, num_threads);
        const auto result = std::get<useless_result_t>(result_and_elapsed);
        const auto elapsed = std::get<elapsed_secs_t>(result_and_elapsed);

        final_result += result;
        avgtime = (avgtime * trial + elapsed) / (trial + 1);
    }

    std::cout
        << "Average time: " << avgtime
        << " seconds, useless result: " << final_result
        << std::endl;
}

int main() {
    const auto cores = std::thread::hardware_concurrency();
    std::cout << "Hardware concurrency: " << cores << std::endl;

    std::cout << "sizeof(naive_int): " << sizeof(naive_int) << std::endl;
    std::cout << "alignof(naive_int): " << alignof(naive_int) << std::endl;
    std::cout << "sizeof(cache_int): " << sizeof(cache_int) << std::endl;
    std::cout << "alignof(cache_int): " << alignof(cache_int) << std::endl;
    std::cout << "sizeof(bad_pair): " << sizeof(bad_pair) << std::endl;
    std::cout << "alignof(bad_pair): " << alignof(bad_pair) << std::endl;
    std::cout << "sizeof(good_pair): " << sizeof(good_pair) << std::endl;
    std::cout << "alignof(good_pair): " << alignof(good_pair) << std::endl;

    {
        std::cout << "Running naive_int test." << std::endl;

        std::vector<naive_int> vec;
        vec.resize((1u << 28) / sizeof(naive_int));  // allocate 256 mibibytes

        run_tests([&](threadlatch& latch, unsigned thread_index) {
            return sample_array_threadfunc(latch, thread_index, vec);
        }, cores);
    }
    {
        std::cout << "Running cache_int test." << std::endl;

        std::vector<cache_int> vec;
        vec.resize((1u << 28) / sizeof(cache_int));  // allocate 256 mibibytes

        run_tests([&](threadlatch& latch, unsigned thread_index) {
            return sample_array_threadfunc(latch, thread_index, vec);
        }, cores);
    }
    {
        std::cout << "Running bad_pair test." << std::endl;

        bad_pair p;

        run_tests([&](threadlatch& latch, unsigned thread_index) {
            return sample_pair_threadfunc(latch, thread_index, p);
        }, cores);
    }
    {
        std::cout << "Running good_pair test." << std::endl;

        good_pair p;

        run_tests([&](threadlatch& latch, unsigned thread_index) {
            return sample_pair_threadfunc(latch, thread_index, p);
        }, cores);
    }
}

Question 3

거의 항상 이러한 값이 동일 할 것으로 예상합니다.

위와 관련하여 수락 된 답변에 약간의 기여를하고 싶습니다. 얼마 전에이 두 가지가 folly라이브러리 에서 별도로 정의되어야하는 아주 좋은 사용 사례를 보았습니다 . Intel Sandy Bridge 프로세서에 대한주의 사항을 참조하십시오.

https://github.com/facebook/folly/blob/3af92dbe6849c4892a1fe1f9366306a2f5cbe6a0/folly/lang/Align.h

//  Memory locations within the same cache line are subject to destructive
//  interference, also known as false sharing, which is when concurrent
//  accesses to these different memory locations from different cores, where at
//  least one of the concurrent accesses is or involves a store operation,
//  induce contention and harm performance.
//
//  Microbenchmarks indicate that pairs of cache lines also see destructive
//  interference under heavy use of atomic operations, as observed for atomic
//  increment on Sandy Bridge.
//
//  We assume a cache line size of 64, so we use a cache line pair size of 128
//  to avoid destructive interference.
//
//  mimic: std::hardware_destructive_interference_size, C++17
constexpr std::size_t hardware_destructive_interference_size =
    kIsArchArm ? 64 : 128;
static_assert(hardware_destructive_interference_size >= max_align_v, "math?");

//  Memory locations within the same cache line are subject to constructive
//  interference, also known as true sharing, which is when accesses to some
//  memory locations induce all memory locations within the same cache line to
//  be cached, benefiting subsequent accesses to different memory locations
//  within the same cache line and heping performance.
//
//  mimic: std::hardware_constructive_interference_size, C++17
constexpr std::size_t hardware_constructive_interference_size = 64;
static_assert(hardware_constructive_interference_size >= max_align_v, "math?");

Question 4

위의 코드를 테스트했지만 기본 기능을 이해하지 못하는 사소한 오류가 있다고 생각합니다. 거짓 공유를 방지하기 위해 두 개의 개별 원자간에 단일 캐시 라인을 공유해서는 안됩니다. 해당 구조체의 정의를 변경했습니다.

struct naive_int
{
    alignas ( sizeof ( int ) ) atomic < int >               value;
};

struct cache_int
{
    alignas ( hardware_constructive_interference_size ) atomic < int >  value;
};

struct bad_pair
{
    // two atomics sharing a single 64 bytes cache line 
    alignas ( hardware_constructive_interference_size ) atomic < int >  first;
    atomic < int >                              second;
};

struct good_pair
{
    // first cache line begins here
    alignas ( hardware_constructive_interference_size ) atomic < int >  
                                                first;
    // That one is still in the first cache line
    atomic < int >                              first_s; 
    // second cache line starts here
    alignas ( hardware_constructive_interference_size ) atomic < int >
                                                second;
    // That one is still in the second cache line
    atomic < int >                              second_s;
};

그리고 결과 실행 :

Hardware concurrency := 40
sizeof(naive_int)    := 4
alignof(naive_int)   := 4
sizeof(cache_int)    := 64
alignof(cache_int)   := 64
sizeof(bad_pair)     := 64
alignof(bad_pair)    := 64
sizeof(good_pair)    := 128
alignof(good_pair)   := 64
Running naive_int test.
Average time: 0.060303 seconds, useless result: 8212147
Running cache_int test.
Average time: 0.0109432 seconds, useless result: 8113799
Running bad_pair test.
Average time: 0.162636 seconds, useless result: 16289887
Running good_pair test.
Average time: 0.129472 seconds, useless result: 16420417

나는 마지막 결과에서 많은 차이를 경험했지만 그 특정 문제에 정확히 어떤 핵심도 전념하지 않았습니다. 어쨌든 이것은 2 Xeon 2690V2에서 실행되었으며 64 또는 128을 사용하는 다양한 실행에서hardware_constructive_interference_size = 128 에서 64가 충분하고 128은 사용 가능한 캐시를 매우 잘 사용하지 않는 것으로 나타났습니다.

나는 당신의 질문이 Jeff Preshing이 말하는 것을 이해하는데 도움이된다는 것을 갑자기 깨달았습니다.