26

課題は、行列のパーマネントを計算するために可能な限り速いコードを書くことです。

永久n行列のn行列A=（ a_i,j）は以下のように定義されます

ここでS_nのすべての順列の集合を表します[1, n]。

例として（wikiから）：

この質問では、行列はすべて正方形で、値-1と1その中のみを持ちます。

例

入力：

[[ 1 -1 -1  1]
 [-1 -1 -1  1]
 [-1  1 -1  1]
 [ 1 -1 -1  1]]

永久：

-4

入力：

[[-1 -1 -1 -1]
 [-1  1 -1 -1]
 [ 1 -1 -1 -1]
 [ 1 -1  1 -1]]

永久：

入力：

[[ 1 -1  1 -1 -1 -1 -1 -1]
 [-1 -1  1  1 -1  1  1 -1]
 [ 1 -1 -1 -1 -1  1  1  1]
 [-1 -1 -1  1 -1  1  1  1]
 [ 1 -1 -1  1  1  1  1 -1]
 [-1  1 -1  1 -1  1  1 -1]
 [ 1 -1  1 -1  1 -1  1 -1]
 [-1 -1  1 -1  1  1  1  1]]

永久：

入力：

[[1, -1, 1, -1, -1, 1, 1, 1, -1, -1, -1, -1, 1, 1, 1, 1, -1, 1, 1, -1],
 [1, -1, 1, 1, 1, 1, 1, -1, 1, -1, -1, 1, 1, 1, -1, -1, 1, 1, 1, -1],
 [-1, -1, 1, 1, 1, -1, -1, -1, -1, 1, -1, 1, 1, 1, -1, -1, -1, 1, -1, -1],
 [-1, -1, -1, 1, 1, -1, 1, 1, 1, 1, 1, 1, -1, -1, -1, -1, -1, -1, 1, -1],
 [-1, 1, 1, 1, -1, 1, 1, 1, -1, -1, -1, 1, -1, 1, -1, 1, 1, 1, 1, 1],
 [1, -1, 1, 1, -1, -1, 1, -1, 1, 1, 1, 1, -1, 1, 1, -1, 1, -1, -1, -1],
 [1, -1, -1, 1, -1, -1, -1, 1, -1, 1, 1, 1, 1, -1, -1, -1, 1, 1, 1, -1],
 [1, -1, -1, 1, -1, 1, 1, -1, 1, 1, 1, -1, 1, -1, 1, 1, 1, -1, 1, 1],
 [1, -1, -1, -1, -1, -1, 1, 1, 1, -1, -1, -1, -1, -1, 1, 1, -1, 1, 1, -1],
 [-1, -1, 1, -1, 1, -1, 1, 1, -1, 1, -1, 1, 1, 1, 1, 1, 1, -1, 1, 1],
 [-1, -1, -1, -1, -1, -1, -1, 1, -1, -1, -1, -1, 1, 1, 1, 1, -1, -1, -1, -1],
 [1, 1, -1, -1, -1, 1, 1, -1, -1, 1, -1, 1, 1, -1, 1, 1, 1, 1, 1, 1],
 [-1, 1, 1, -1, -1, -1, -1, -1, 1, 1, 1, 1, -1, -1, -1, -1, -1, 1, -1, 1],
 [1, 1, -1, -1, -1, 1, -1, 1, -1, -1, -1, -1, 1, -1, 1, 1, -1, 1, -1, 1],
 [1, 1, 1, 1, 1, -1, -1, -1, 1, 1, 1, -1, 1, -1, 1, 1, 1, -1, 1, 1],
 [1, -1, -1, 1, -1, -1, -1, -1, 1, -1, -1, 1, 1, -1, 1, -1, -1, -1, -1, -1],
 [-1, 1, 1, 1, -1, 1, 1, -1, -1, 1, 1, 1, -1, -1, 1, 1, -1, -1, 1, 1],
 [1, 1, -1, -1, 1, 1, -1, 1, 1, -1, 1, 1, 1, -1, 1, 1, -1, 1, -1, 1],
 [1, 1, 1, -1, -1, -1, 1, -1, -1, 1, 1, -1, -1, -1, 1, -1, -1, -1, -1, 1],
 [-1, 1, 1, 1, -1, -1, -1, -1, -1, -1, -1, 1, 1, -1, 1, 1, -1, 1, -1, -1]]

永久：

1021509632

タスク

nby nマトリックスを指定して、そのパーマネントを出力するコードを作成する必要があります。

コードをテストする必要があるので、たとえば標準入力から読み込むなど、コードへの入力としてマトリックスを提供する簡単な方法を提供していただければ助かります。

パーマネントは大きくなる可能性があることに注意してください（すべて1の行列は極端な場合です）。

スコアとタイ

増加するサイズのランダム+ -1行列でコードをテストし、コンピューターでコードが1分以上かかったときに停止します。スコアリングマトリックスは、公平性を確保するために、すべての提出物で一貫しています。

2人が同じスコアを獲得した場合、勝者はの値に対して最速のものですn。それらが互いに1秒以内にある場合は、最初に投稿されたものです。

言語とライブラリ

利用可能な任意の言語とライブラリを使用できますが、パーマネントを計算するための既存の関数は使用できません。可能であれば、コードを実行できるとよいので、可能な限りLinuxでコードを実行/コンパイルする方法の完全な説明を含めてください。

参照実装

小さな行列のパーマネントを計算するためのさまざまな言語のコードがたくさんあるcodegolfの質問が既にあります。MathematicaとMapleには、それらにアクセスできれば永続的な実装もあります。

マイマシンタイミングは64ビットマシンで実行されます。これは、8GB RAM、AMD FX-8350 8コアプロセッサ、およびRadeon HD 4250を備えた標準のUbuntuインストールです。これは、コードを実行できる必要があることも意味します。

私のマシンに関する低レベルの情報

cat /proc/cpuinfo/|grep flags 与える

フラグ：fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc ap_c_sq sq rep_good nopl nonstop_tsc aq f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_ncs nfs cpb hw_pstate vmmcall npf ns

Scala、Nim、Julia、Rust、Bashの愛好家が自分の言語を誇示することができるように、大きなInt問題に悩まされない、密接に関連する多言語の質問をします。

リーダーボード

n = 33（45秒。n= 34の場合は64秒）。g ++ 5.4.0を使用したC ++のTon Hospel。
n = 32（32秒）。デニスでCトンHospelのgccのフラグを使用してgccの5.4.0で。
n = 31（54秒）。クリスチャンのSieversでハスケル
n = 31（60秒）。プリモでrpython
n = 30（26秒）。ezrastで錆
n = 28（49秒）。XNORとのPython + pypy 5.4.1
n = 22（25秒）。シェバングとのPython + pypy 5.4.1

注。実際には、デニスとトンホスペルのタイミングは神秘的な理由で大きく異なります。例えば、ウェブブラウザをロードした後、それらはより速くなるようです！引用されたタイミングは、私が行ったすべてのテストで最速です。

math fastest-code matrix

— セルギオール
ソース

5

私は最初の文を読み、「Lembik」と考え、下にスクロールしました、うん-Lembik。

— orlp

@orlp :)久しぶりです。

1

@Lembik大規模なテストケースを追加しました。誰かが確認するのを待っています。

— XNOR

2

答えの1つは、パーマネントを格納するために倍精度浮動小数点数を使用するため、おおよその結果を出力します。それは許されますか？

— デニス

1

...私はサインして、いくつかの魔法を行うことができるかもしれないと思ったが、それが出てパンなかった@ChristianSievers

— ソクラテスのフェニックス

13

gcc C ++ n≈36（システムでは57秒）

すべての列の合計が偶数の場合、更新にグレイコードを使用したGlynn式を使用します。それ以外の場合は、Ryserの方法を使用します。スレッド化およびベクトル化。AVX向けに最適化されているため、古いプロセッサではあまり期待しないでください。n>=35符号付き128ビットアキュムレータがオーバーフローするため、システムが十分に高速であっても、+ 1のみの行列を気にしないでください。ランダム行列の場合、おそらくオーバーフローは発生しません。以下のためにn>=37内部乗数すべてのためのオーバーフローを開始します1/-1行列。そのため、このプログラムのみを使用してくださいn<=36。

任意の種類の空白で区切られたSTDIN上の行列要素を与えるだけです

permanent
1 2
3 4
^D

permanent.cpp：

/*
  Compile using something like:
    g++ -Wall -O3 -march=native -fstrict-aliasing -std=c++11 -pthread -s permanent.cpp -o permanent
*/

#include <iostream>
#include <iomanip>
#include <cstdlib>
#include <cstdint>
#include <climits>
#include <array>
#include <vector>
#include <thread>
#include <future>
#include <ctgmath>
#include <immintrin.h>

using namespace std;

bool const DEBUG = false;
int const CACHE = 64;

using Index  = int_fast32_t;
Index glynn;
// Number of elements in our vectors
Index const POW   = 3;
Index const ELEMS = 1 << POW;
// Over how many floats we distribute each row
Index const WIDTH = 9;
// Number of bits in the fraction part of a floating point number
int const FLOAT_MANTISSA = 23;
// Type to use for the first add/multiply phase
using Sum  = float;
using SumN = __restrict__ Sum __attribute__((vector_size(ELEMS*sizeof(Sum))));
// Type to convert to between the first and second phase
using ProdN = __restrict__ int32_t __attribute__((vector_size(ELEMS*sizeof(int32_t))));
// Type to use for the third and last multiply phase.
// Also used for the final accumulator
using Value = __int128;
using UValue = unsigned __int128;

// Wrap Value so C++ doesn't really see it and we can put it in vectors etc.
// Needed since C++ doesn't fully support __int128
struct Number {
    Number& operator+=(Number const& right) {
        value += right.value;
        return *this;
    }
    // Output the value
    void print(ostream& os, bool dbl = false) const;
    friend ostream& operator<<(ostream& os, Number const& number) {
        number.print(os);
        return os;
    }

    Value value;
};

using ms = chrono::milliseconds;

auto nr_threads = thread::hardware_concurrency();
vector<Sum> input;

// Allocate cache aligned datastructures
template<typename T>
T* alloc(size_t n) {
    T* mem = static_cast<T*>(aligned_alloc(CACHE, sizeof(T) * n));
    if (mem == nullptr) throw(bad_alloc());
    return mem;
}

// Work assigned to thread k of nr_threads threads
Number permanent_part(Index n, Index k, SumN** more) {
    uint64_t loops = (UINT64_C(1) << n) / nr_threads;
    if (glynn) loops /= 2;
    Index l = loops < ELEMS ? loops : ELEMS;
    loops /= l;
    auto from = loops * k;
    auto to   = loops * (k+1);

    if (DEBUG) cout << "From=" << from << "\n";
    uint64_t old_gray = from ^ from/2;
    uint64_t bit = 1;
    bool bits = (to-from) & 1;

    Index nn = (n+WIDTH-1)/WIDTH;
    Index ww = nn * WIDTH;
    auto column = alloc<SumN>(ww);
    for (Index i=0; i<n; ++i)
        for (Index j=0; j<ELEMS; ++j) column[i][j] = 0;
    for (Index i=n; i<ww; ++i)
        for (Index j=0; j<ELEMS; ++j) column[i][j] = 1;
    Index b;
    if (glynn) {
        b = n > POW+1 ? n - POW - 1: 0;
        auto c = n-1-b;
        for (Index k=0; k<l; k++) {
            Index gray = k ^ k/2;
            for (Index j=0; j< c; ++j)
                if (gray & 1 << j)
                    for (Index i=0; i<n; ++i)
                        column[i][k] -= input[(b+j)*n+i];
                else
                    for (Index i=0; i<n; ++i)
                        column[i][k] += input[(b+j)*n+i];
        }
        for (Index i=0; i<n; ++i)
            for (Index k=0; k<l; k++)
                column[i][k] += input[n*(n-1)+i];

        for (Index k=1; k<l; k+=2)
            column[0][k] = -column[0][k];

        for (Index i=0; i<b; ++i, bit <<= 1) {
            if (old_gray & bit) {
                bits = bits ^ 1;
                for (Index j=0; j<ww; ++j)
                    column[j] -= more[i][j];
            } else {
                for (Index j=0; j<ww; ++j)
                    column[j] += more[i][j];
            }
        }

        for (Index i=0; i<n; ++i)
            for (Index k=0; k<l; k++)
                column[i][k] /= 2;
    } else {
        b = n > POW ? n - POW : 0;
        auto c = n-b;
        for (Index k=0; k<l; k++) {
            Index gray = k ^ k/2;
            for (Index j=0; j<c; ++j)
                if (gray & 1 << j)
                    for (Index i=0; i<n; ++i)
                        column[i][k] -= input[(b+j)*n+i];
        }

        for (Index k=1; k<l; k+=2)
            column[0][k] = -column[0][k];

        for (Index i=0; i<b; ++i, bit <<= 1) {
            if (old_gray & bit) {
                bits = bits ^ 1;
                for (Index j=0; j<ww; ++j)
                    column[j] -= more[i][j];
            }
        }
    }

    if (DEBUG) {
        for (Index i=0; i<ww; ++i) {
            cout << "Column[" << i << "]=";
            for (Index j=0; j<ELEMS; ++j) cout << " " << column[i][j];
            cout << "\n";
        }
    }

    --more;
    old_gray = (from ^ from/2) | UINT64_C(1) << b;
    Value total = 0;
    SumN accu[WIDTH];
    for (auto p=from; p<to; ++p) {
        uint64_t new_gray = p ^ p/2;
        uint64_t bit = old_gray ^ new_gray;
        Index i = __builtin_ffsl(bit);
        auto diff = more[i];
        auto c = column;
        if (new_gray > old_gray) {
            // Phase 1 add/multiply.
            // Uses floats until just before loss of precision
            for (Index i=0; i<WIDTH; ++i) accu[i] = *c++ -= *diff++;

            for (Index j=1; j < nn; ++j)
                for (Index i=0; i<WIDTH; ++i) accu[i] *= *c++ -= *diff++;
        } else {
            // Phase 1 add/multiply.
            // Uses floats until just before loss of precision
            for (Index i=0; i<WIDTH; ++i) accu[i] = *c++ += *diff++;

            for (Index j=1; j < nn; ++j)
                for (Index i=0; i<WIDTH; ++i) accu[i] *= *c++ += *diff++;
        }

        if (DEBUG) {
            cout << "p=" << p << "\n";
            for (Index i=0; i<ww; ++i) {
                cout << "Column[" << i << "]=";
                for (Index j=0; j<ELEMS; ++j) cout << " " << column[i][j];
                cout << "\n";
            }
        }

        // Convert floats to int32_t
        ProdN prod32[WIDTH] __attribute__((aligned (32)));
        for (Index i=0; i<WIDTH; ++i)
            // Unfortunately gcc doesn't recognize the static_cast<int32_t>
            // as a vector pattern, so force it with an intrinsic
#ifdef __AVX__
            //prod32[i] = static_cast<ProdN>(accu[i]);
            reinterpret_cast<__m256i&>(prod32[i]) = _mm256_cvttps_epi32(accu[i]);
#else   // __AVX__
            for (Index j=0; j<ELEMS; ++j)
                prod32[i][j] = static_cast<int32_t>(accu[i][j]);
#endif  // __AVX__

        // Phase 2 multiply. Uses int64_t until just before overflow
        int64_t prod64[3][ELEMS];
        for (Index i=0; i<3; ++i) {
            for (Index j=0; j<ELEMS; ++j)
                prod64[i][j] = static_cast<int64_t>(prod32[i][j]) * prod32[i+3][j] * prod32[i+6][j];
        }
        // Phase 3 multiply. Collect into __int128. For large matrices this will
        // actually overflow but that's ok as long as all 128 low bits are
        // correct. Terms will cancel and the final sum can fit into 128 bits
        // (This will start to fail at n=35 for the all 1 matrix)
        // Strictly speaking this needs the -fwrapv gcc option
        for (Index j=0; j<ELEMS; ++j) {
            auto value = static_cast<Value>(prod64[0][j]) * prod64[1][j] * prod64[2][j];
            if (DEBUG) cout << "value[" << j << "]=" << static_cast<double>(value) << "\n";
            total += value;
        }
        total = -total;

        old_gray = new_gray;
    }

    return bits ? Number{-total} : Number{total};
}

// Prepare datastructures, Assign work to threads
Number permanent(Index n) {
    Index nn = (n+WIDTH-1)/WIDTH;
    Index ww = nn*WIDTH;

    Index rows  = n > (POW+glynn) ? n-POW-glynn : 0;
    auto data = alloc<SumN>(ww*(rows+1));
    auto pointers = alloc<SumN *>(rows+1);
    auto more = &pointers[0];
    for (Index i=0; i<rows; ++i)
        more[i] = &data[ww*i];
    more[rows] = &data[ww*rows];
    for (Index j=0; j<ww; ++j)
        for (Index i=0; i<ELEMS; ++i)
            more[rows][j][i] = 0;

    Index loops = n >= POW+glynn ? ELEMS : 1 << (n-glynn);
    auto a = &input[0];
    for (Index r=0; r<rows; ++r) {
        for (Index j=0; j<n; ++j) {
            for (Index i=0; i<loops; ++i)
                more[r][j][i] = j == 0 && i %2 ? -*a : *a;
            for (Index i=loops; i<ELEMS; ++i)
                more[r][j][i] = 0;
            ++a;
        }
        for (Index j=n; j<ww; ++j)
            for (Index i=0; i<ELEMS; ++i)
                more[r][j][i] = 0;
    }

    if (DEBUG)
        for (Index r=0; r<=rows; ++r)
            for (Index j=0; j<ww; ++j) {
                cout << "more[" << r << "][" << j << "]=";
                for (Index i=0; i<ELEMS; ++i)
                    cout << " " << more[r][j][i];
                cout << "\n";
            }

    // Send work to threads...
    vector<future<Number>> results;
    for (auto i=1U; i < nr_threads; ++i)
        results.emplace_back(async(DEBUG ? launch::deferred: launch::async, permanent_part, n, i, more));
    // And collect results
    auto r = permanent_part(n, 0, more);
    for (auto& result: results)
        r += result.get();

    free(data);
    free(pointers);

    // For glynn we should double the result, but we will only do this during
    // the final print. This allows n=34 for an all 1 matrix to work
    // if (glynn) r *= 2;
    return r;
}

// Print 128 bit number
void Number::print(ostream& os, bool dbl) const {
    const UValue BILLION = 1000000000;

    UValue val;
    if (value < 0) {
        os << "-";
        val = -value;
    } else
        val = value;
    if (dbl) val *= 2;

    uint32_t output[5];
    for (int i=0; i<5; ++i) {
        output[i] = val % BILLION;
        val /= BILLION;
    }
    bool print = false;
    for (int i=4; i>=0; --i) {
        if (print) {
            os << setfill('0') << setw(9) << output[i];
        } else if (output[i] || i == 0) {
            print = true;
            os << output[i];
        }
    }
}

// Read matrix, check for sanity
void my_main() {
    Sum a;
    while (cin >> a)
        input.push_back(a);

    size_t n = sqrt(input.size());
    if (input.size() != n*n)
        throw(logic_error("Read " + to_string(input.size()) +
                          " elements which does not make a square matrix"));

    vector<double> columns_pos(n, 0);
    vector<double> columns_neg(n, 0);
    Sum *p = &input[0];
    for (size_t i=0; i<n; ++i)
        for (size_t j=0; j<n; ++j, ++p) {
            if (*p >= 0) columns_pos[j] += *p;
            else         columns_neg[j] -= *p;
        }
    std::array<double,WIDTH> prod;
    prod.fill(1);

    int32_t odd = 0;
    for (size_t j=0; j<n; ++j) {
        prod[j%WIDTH] *= max(columns_pos[j], columns_neg[j]);
        auto sum = static_cast<int32_t>(columns_pos[j] - columns_neg[j]);
        odd |= sum;
    }
    glynn = (odd & 1) ^ 1;
    for (Index i=0; i<WIDTH; ++i)
        // A float has an implicit 1. in front of the fraction so it can
        // represent 1 bit more than the mantissa size. And 1 << (mantissa+1)
        // itself is in fact representable
        if (prod[i] && log2(prod[i]) > FLOAT_MANTISSA+1)
            throw(range_error("Values in matrix are too large. A subproduct reaches " + to_string(prod[i]) + " which doesn't fit in a float without loss of precision"));

    for (Index i=0; i<3; ++i) {
        auto prod3 = prod[i] * prod[i+3] * prod[i+6];
        if (log2(prod3) >= CHAR_BIT*sizeof(int64_t)-1)
            throw(range_error("Values in matrix are too large. A subproduct reaches " + to_string(prod3) + " which doesn't fit in an int64"));
    }

    nr_threads = pow(2, ceil(log2(static_cast<float>(nr_threads))));
    uint64_t loops = UINT64_C(1) << n;
    if (glynn) loops /= 2;
    if (nr_threads * ELEMS > loops)
        nr_threads = max(loops / ELEMS, UINT64_C(1));
    // if (DEBUG) nr_threads = 1;

    cout << n << " x " << n << " matrix, method " << (glynn ? "Glynn" : "Ryser") << ", " << nr_threads << " threads" << endl;

    // Go for the actual calculation
    auto start = chrono::steady_clock::now();
    auto perm = permanent(n);
    auto end = chrono::steady_clock::now();
    auto elapsed = chrono::duration_cast<ms>(end-start).count();

    cout << "Permanent=";
    perm.print(cout, glynn);
    cout << " (" << elapsed / 1000. << " s)" << endl;
}

// Wrapper to print any exceptions
int main() {
    try {
        my_main();
    } catch(exception& e) {
        cerr << "Error: " << e.what() << endl;
        exit(EXIT_FAILURE);
    }
    exit(EXIT_SUCCESS);
}

— トン・ホスペル
ソース

フラグ：fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc ap_cq aq c_a f16c lahf_lm cmp_legacy SVM extapic cr8_legacy ABM SSE4A misalignsse 3dnowprefetch osvw IBS XOP SKINIT WDT LWP fma4 TCE nodeid_msr TBM topoext perfctr_core perfctr_nb CPB hw_pstate vmmcall BMI1 ARAT NPT lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold

私はまだテストハーネスをデバッグしてコードを実行していますが、非常に高速に見えます、ありがとう！大きなintサイズが速度の問題を引き起こしているのではないかと思っていました（お勧め）。accu.org/index.php/articles/1849に興味がある場合は見ました。

テストハーネスでの使用が非常に困難になったため、quick_exitを削除するためにコードを修正する必要がありました。興味深いことに、wikiが他の1つは2倍高速であるべきだと主張しているように見えるのに、なぜRyserの式を使用しているのですか

@Lembik Ryserの式に切り替えました。他の式2 << (n-1)では、最後にスケールバックする必要があるため、int128アキュムレータがそのポイントのはるか前にオーバーフローしたことを意味します。

— トンホスペル

1

@Lembikはい:-)

— トンホスペル

7

C99、n≈33（35秒）

#include <stdint.h>
#include <stdio.h>

#define CHUNK_SIZE 12
#define NUM_THREADS 8

#define popcnt __builtin_popcountll
#define BILLION (1000 * 1000 * 1000)
#define UPDATE_ROW_PPROD() \
    update_row_pprod(row_pprod, row, rows, row_sums, mask, mask_popcnt)

typedef __int128 int128_t;

static inline int64_t update_row_pprod
(
    int64_t* row_pprod, int64_t row, int64_t* rows,
    int64_t* row_sums, int64_t mask, int64_t mask_popcnt
)
{
    int64_t temp = 2 * popcnt(rows[row] & mask) - mask_popcnt;

    row_pprod[0] *= temp;
    temp -= 1;
    row_pprod[1] *= temp;
    temp -= row_sums[row];
    row_pprod[2] *= temp;
    temp += 1;
    row_pprod[3] *= temp;

    return row + 1;
}

int main(int argc, char* argv[])
{
    int64_t size = argc - 1, rows[argc - 1];
    int64_t row_sums[argc - 1];
    int128_t permanent = 0, sign = size & 1 ? -1 : 1;

    if (argc == 2)
    {
        printf("%d\n", argv[1][0] == '-' ? -1 : 1);
        return 0;
    }

    for (int64_t row = 0; row < size; row++)
    {
        char positive = argv[row + 1][0] == '+' ? '-' : '+';

        sign *= ',' - positive;
        rows[row] = row_sums[row] = 0;

        for (char* p = &argv[row + 1][1]; *p; p++)
        {
            rows[row] <<= 1;
            rows[row] |= *p == positive;
            row_sums[row] += *p == positive;
        }

        row_sums[row] = 2 * row_sums[row] - size;
    }

    #pragma omp parallel for reduction(+:permanent) num_threads(NUM_THREADS)
    for (int64_t mask = 1; mask < 1LL << (size - 1); mask += 2)
    {
        int64_t mask_popcnt = popcnt(mask);
        int64_t row = 0;
        int128_t row_prod = 1 - 2 * (mask_popcnt & 1);
        int128_t row_prod_high = -row_prod;
        int128_t row_prod_inv = row_prod;
        int128_t row_prod_inv_high = -row_prod;

        for (int64_t chunk = 0; chunk < size / CHUNK_SIZE; chunk++)
        {
            int64_t row_pprod[4] = {1, 1, 1, 1};

            for (int64_t i = 0; i < CHUNK_SIZE; i++)
                row = UPDATE_ROW_PPROD();

            row_prod *= row_pprod[0], row_prod_high *= row_pprod[1];
            row_prod_inv *= row_pprod[3], row_prod_inv_high *= row_pprod[2];
        }

        int64_t row_pprod[4] = {1, 1, 1, 1};

        while (row < size)
            row = UPDATE_ROW_PPROD();

        row_prod *= row_pprod[0], row_prod_high *= row_pprod[1];
        row_prod_inv *= row_pprod[3], row_prod_inv_high *= row_pprod[2];
        permanent += row_prod + row_prod_high + row_prod_inv + row_prod_inv_high;
    }

    permanent *= sign;

    if (permanent < 0)
        printf("-"), permanent *= -1;

    int32_t output[5], print = 0;

    output[0] = permanent % BILLION, permanent /= BILLION;
    output[1] = permanent % BILLION, permanent /= BILLION;
    output[2] = permanent % BILLION, permanent /= BILLION;
    output[3] = permanent % BILLION, permanent /= BILLION;
    output[4] = permanent % BILLION;

    if (output[4])
        printf("%u", output[4]), print = 1;
    if (print)
        printf("%09u", output[3]);
    else if (output[3])
        printf("%u", output[3]), print = 1;
    if (print)
        printf("%09u", output[2]);
    else if (output[2])
        printf("%u", output[2]), print = 1;
    if (print)
        printf("%09u", output[1]);
    else if (output[1])
        printf("%u", output[1]), print = 1;
    if (print)
        printf("%09u\n", output[0]);
    else
        printf("%u\n", output[0]);
}

現在、入力は少し面倒です。コマンド行引数として行を使用します。各エントリは符号で表されます。つまり、+は1を示し、-は-1を示します。

試運転

$ gcc -Wall -std=c99 -march=native -Ofast -fopenmp -fwrapv -o permanent permanent.c
$ ./permanent +--+ ---+ -+-+ +--+
-4
$ ./permanent ---- -+-- +--- +-+-
0
$ ./permanent +-+----- --++-++- +----+++ ---+-+++ +--++++- -+-+-++- +-+-+-+- --+-++++
192
$ ./permanent +-+--+++----++++-++- +-+++++-+--+++--+++- --+++----+-+++---+-- ---++-++++++------+- -+++-+++---+-+-+++++ +-++--+-++++-++-+--- +--+---+-++++---+++- +--+-++-+++-+-+++-++ +-----+++-----++-++- --+-+-++-+-++++++-++ -------+----++++---- ++---++--+-++-++++++ -++-----++++-----+-+ ++---+-+----+-++-+-+ +++++---+++-+-+++-++ +--+----+--++-+----- -+++-++--+++--++--++ ++--++-++-+++-++-+-+ +++---+--++---+----+ -+++-------++-++-+--
1021509632
$ time ./permanent +++++++++++++++++++++++++++++++{,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,}     # 31
8222838654177922817725562880000000

real    0m8.365s
user    1m6.504s
sys     0m0.000s
$ time ./permanent ++++++++++++++++++++++++++++++++{,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,}   # 32
263130836933693530167218012160000000

real    0m17.013s
user    2m15.226s
sys     0m0.001s
$ time ./permanent +++++++++++++++++++++++++++++++++{,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,} # 33
8683317618811886495518194401280000000

real    0m34.592s
user    4m35.354s
sys     0m0.001s

— デニス
ソース

改善のためのアイデアはありますか？

— xnor

@xnorいくつか。SSEでパックされた乗算を試して、大きなループを部分的に展開します（並列化を高速化できるかどうかを確認し、呼び出しを行わずに4つ以上の値を一度に計算しますpopcnt）。時間を節約できるなら、次の大きなハードルは整数型です。ランダムに生成されたマトリックスの場合、パーマネントは比較的小さくなります。実際の計算を行う前に境界を計算する簡単な方法を見つけることができれば、全体を大きな条件でラップすることができます。

— デニス

@Dennisループの展開について、可能な限り小さな最適化は、一番上の行をすべて+1にすることです。

— -xnor

@xnorええ、私はいくつかの点であることを試してみましたが、その後は（うまくいかなかった何かをしようとする変更を元に戻し、すべてで）。ボトルネックは整数乗算（64ビットでは遅く、128では本当に遅い）のようです。そのため、SSEが少し役立つと思います。

— デニス

1

@Dennisなるほど。境界について、1つの非自明な境界は、演算子ノルム| Per（M）| <= | M | ^ nの観点からのものです。参照してくださいarxiv.org/pdf/1606.07474v1.pdf

— XNOR

5

Python 2、n≈28

from operator import mul

def fast_glynn_perm(M):
    row_comb = [sum(c) for c in zip(*M)]
    n=len(M)

    total = 0
    old_grey = 0 
    sign = +1

    binary_power_dict = {2**i:i for i in range(n)}
    num_loops = 2**(n-1)

    for bin_index in xrange(1, num_loops + 1):  
        total += sign * reduce(mul, row_comb)

        new_grey = bin_index^(bin_index/2)
        grey_diff = old_grey ^ new_grey
        grey_diff_index = binary_power_dict[grey_diff]

        new_vector = M[grey_diff_index]
        direction = 2 * cmp(old_grey,new_grey)      

        for i in range(n):
            row_comb[i] += new_vector[i] * direction

        sign = -sign
        old_grey = new_grey

    return total/num_loops

更新には、Glynn式とグレーコードを使用します。n=23私のマシンで1分以内に実行されます。これをより高速な言語で、より良いデータ構造で確実に実装することができます。これは、マトリックスが±1値であることを使用しません。

Ryser式の実装は非常によく似ており、±1ベクトルではなく、係数のすべての0/1ベクトルを合計します。Glynnの式の約2倍の時間がかかるのは、2 ^ n個のそのようなベクトルをすべて追加するため+1です。

from operator import mul

def fast_ryser_perm(M):
    n=len(M)
    row_comb = [0] * n

    total = 0
    old_grey = 0 
    sign = +1

    binary_power_dict = {2**i:i for i in range(n)}
    num_loops = 2**n

    for bin_index in range(1, num_loops) + [0]: 
        total += sign * reduce(mul, row_comb)

        new_grey = bin_index^(bin_index/2)
        grey_diff = old_grey ^ new_grey
        grey_diff_index = binary_power_dict[grey_diff]

        new_vector = M[grey_diff_index]
        direction = cmp(old_grey, new_grey)

        for i in range(n):
            row_comb[i] += new_vector[i] * direction

        sign = -sign
        old_grey = new_grey

    return total * (-1)**n

— xnor
ソース

驚くばかり。あなたもテストするためにpypyを持っていますか？

@Lembikいいえ、あまりインストールしていません。

— xnor

私もそれをテストするときにpypyを使用します。他の高速式を実装する方法がわかりますか？わかりにくいです。

@Lembik他の高速式は何ですか？

— XNOR

1

参考として、pypyこれを搭載したマシンではn=28、44.6秒で簡単に計算できました。Lembikのシステムは、少し速くないとしても、速度の点ではかなり似ているようです。

— ケード

4

錆+ extprim

グレーコードを実装したこの簡単なRyserは、ラップトップでn = 31を実行するのに約65 90秒かかります。~~私はあなたのマシンが60代未満でそこに到達すると思います。~~私はextprim 1.1.1を使用していi128ます。

Rustを使用したことがなく、自分が何をしているのかわかりません。何でもする以外のコンパイラオプションはありませんcargo build --release。コメント/提案/最適化を歓迎します。

呼び出しは、デニスのプログラムと同じです。

use std::env;
use std::thread;
use std::sync::Arc;
use std::sync::mpsc;

extern crate extprim;
use extprim::i128::i128;

static THREADS : i64 = 8; // keep this a power of 2.

fn main() {
  // Read command line args for the matrix, specified like
  // "++- --- -+-" for [[1, 1, -1], [-1, -1, -1], [-1, 1, -1]].
  let mut args = env::args();
  args.next();

  let mat : Arc<Vec<Vec<i64>>> = Arc::new(args.map( |ss|
    ss.trim().bytes().map( |cc| if cc == b'+' {1} else {-1}).collect()
  ).collect());

  // Figure how many iterations each thread has to do.
  let size = 2i64.pow(mat.len() as u32);
  let slice_size = size / THREADS; // Assumes divisibility.

  let mut accumulator : i128;
  if slice_size >= 4 { // permanent() requires 4 divides slice_size
    let (tx, rx) = mpsc::channel();

    // Launch threads.
    for ii in 0..THREADS {
      let mat = mat.clone();
      let tx = tx.clone();
      thread::spawn(move ||
        tx.send(permanent(&mat, ii * slice_size, (ii+1) * slice_size))
      );
    }

    // Accumulate results.
    accumulator = extprim::i128::ZERO;
    for _ in 0..THREADS {
      accumulator += rx.recv().unwrap();
    }
  }
  else { // Small matrix, don't bother threading.
    accumulator = permanent(&mat, 0, size);
  }
  println!("{}", accumulator);
}

fn permanent(mat: &Vec<Vec<i64>>, start: i64, end: i64) -> i128 {
  let size = mat.len();
  let sentinel = std::i64::MAX / size as i64;

  let mut bits : Vec<bool> = Vec::with_capacity(size);
  let mut sums : Vec<i64> = Vec::with_capacity(size);

  // Initialize gray code bits.
  let gray_number = start ^ (start / 2);

  for row in 0..size {
    bits.push((gray_number >> row) % 2 == 1);
    sums.push(0);
  }

  // Initialize column sums
  for row in 0..size {
    if bits[row] {
      for column in 0..size {
        sums[column] += mat[row][column];
      }
    }
  }

  // Do first two iterations with initial sums
  let mut total = product(&sums, sentinel);
  for column in 0..size {
    sums[column] += mat[0][column];
  }
  bits[0] = true;

  total -= product(&sums, sentinel);

  // Do rest of iterations updating gray code bits incrementally
  let mut gray_bit : usize;
  let mut idx = start + 2;
  while idx < end {
    gray_bit = idx.trailing_zeros() as usize;

    if bits[gray_bit] {
      for column in 0..size {
        sums[column] -= mat[gray_bit][column];
      }
      bits[gray_bit] = false;
    }
    else {
      for column in 0..size {
        sums[column] += mat[gray_bit][column];
      }
      bits[gray_bit] = true;
    }

    total += product(&sums, sentinel);

    if bits[0] {
      for column in 0..size {
        sums[column] -= mat[0][column];
      }
      bits[0] = false;
    }
    else {
      for column in 0..size {
        sums[column] += mat[0][column];
      }
      bits[0] = true;
    }

    total -= product(&sums, sentinel);
    idx += 2;
  }
  return if size % 2 == 0 {total} else {-total};
}

#[inline]
fn product(sums : &Vec<i64>, sentinel : i64) -> i128 {
  let mut ret : Option<i128> = None;
  let mut tally = sums[0];
  for ii in 1..sums.len() {
    if tally.abs() >= sentinel {
      ret = Some(ret.map_or(i128::new(tally), |n| n * i128::new(tally)));
      tally = sums[ii];
    }
    else {
      tally *= sums[ii];
    }
  }
  if ret.is_none() {
    return i128::new(tally);
  }
  return ret.unwrap() * i128::new(tally);
}

— エズラスト
ソース

extprimをインストールしてコードをコンパイルするためのコピーおよび貼り付け可能なコマンドラインを教えてください。

出力は「i128！（-2）」のようになります。-2が正解です。これは予想されるもので、パーマネントを出力するためだけに変更できますか？

1

@Lembik：出力を修正する必要があります。あなたはコンパイルを理解したように見えgit clone https://gitlab.com/ezrast/permanent.git; cd permanent; cargo build --releaseますが、私と同じセットアップを確実にしたい場合にできるように、Gitでそれを投げました。貨物は依存関係を処理します。バイナリが入りtarget/releaseます。

— ezrast

残念ながら、これはn = 29に対して間違った答えを与えます。bpaste.net

1

@Lembik gah、申し訳ありませんが、中間値は思ったより早くあふれていました。これは修正されましたが、プログラムは現在かなり遅くなっています。

— ezrast

4

ハスケル、n = 31（54s）

@Angsによる貴重な貢献が多数あります：use Vector、短絡回路製品を使用し、奇数nを見てください。

import Control.Parallel.Strategies
import qualified Data.Vector.Unboxed as V
import Data.Int

type Row = V.Vector Int8

x :: Row -> [Row] -> Integer -> Int -> Integer
x p (v:vs) m c = let c' = c - 1
                     r = if c>0 then parTuple2 rseq rseq else r0
                     (a,b) = ( x p                  vs m    c' ,
                               x (V.zipWith(-) p v) vs (-m) c' )
                             `using` r
                 in a+b
x p _      m _ = prod m p

prod :: Integer -> Row -> Integer
prod m p = if 0 `V.elem` p then 0 
                           else V.foldl' (\a b->a*fromIntegral b) m p

p, pt :: [Row] -> Integer
p (v:vs) = x (foldl (V.zipWith (+)) v vs) (map (V.map (2*)) vs) 1 11
           `div` 2^(length vs)
p [] = 1 -- handle 0x0 matrices too  :-)

pt (v:vs) | even (length vs) = p ((V.map (2*) v) : vs ) `div` 2
pt mat                       = p mat

main = getContents >>= print . pt . map V.fromList . read

Haskellでの並列処理の最初の試み。改訂履歴を通じて多くの最適化手順を確認できます。驚くべきことに、それはほとんど非常に小さな変更でした。コードは、パーマネントの計算に関するウィキペディアの記事の「Balasubramanian-Bax / Franklin-Glynn公式」セクションの公式に基づいています。

pパーマネントを計算します。経由ptで呼び出され、常に有効な方法でマトリックスを変換しますが、ここで得られるマトリックスに特に役立ちます。

でコンパイルしghc -O2 -threaded -fllvm -feager-blackholing -o <name> <name>.hsます。並列化で実行するには、次のようなランタイムパラメータを指定します./<name> +RTS -N。入力は[[1,2],[3,4]]、最後の例のように括弧で囲まれたネストされたコンマ区切りのリストを含むstdinからのものです（どこでも改行が許可されます）。

— クリスチャン・シーヴァーズ
ソース

1

を差し込むことで20-25％の速度改善を得ることができましたData.Vector。変更された関数型を除く変更：import qualified Data.Vector as V、x (V.zipWith(-) p v) vs (-m) c' )、p (v:vs) = x (foldl (V.zipWith (+)) v vs) (map (V.map (2*)) vs) 1 11、main = getContents >>= print . p . map V.fromList . read

— Angs

1

@Angsどうもありがとう！私は、より適切なデータ型を検討する気がしませんでした。少し変更する必要がある（また使用する必要があったV.product）のは驚くべきことです。それは私に〜10％しか与えませんでした。ベクトルにInts のみが含まれるようにコードを変更しました。加算されるだけなので、それは大丈夫です。大きな数字は乗算に由来します。それから約20％でした。私は古いコードで同じ変更を試みましたが、そのときは遅くなりました。ボックス化されていないベクターを使用できるので、もう一度試しました。

— クリスチャンシーバーズ

1

@ christian-sievers glab私は助けになるかもしれません。ここに私が見つけた別の面白い幸運ベースの最適化があります：x p _ m _ = m * (sum $ V.foldM' (\a b -> if b==0 then Nothing else Just $ a*fromIntegral b) 1 p)-0が特別なケースであるモナド倍としての積頻繁に有益であるようです。

— アンス

1

@Angs素晴らしい！Debianの安定版などのghcを必要としない形式に変更しましたTransversable（producteatlierを変更しないことは間違いではありません...）。入力の形式を使用していますが、それは問題ないようです。私たちはそれに依存せず、最適化するだけです。タイミングがずっとエキサイティングになります。ランダムな30x30マトリックスは29x29よりわずかに高速ですが、31x31は4x時間かかります。-そのINLINEは私には機能しないようです。私の知る限り、再帰関数では無視されます。

— クリスチャンシーバーズ

1

@ christian-sieversええ、私はそれについて何かを言おうとしていましたproduct が、忘れていました。偶数の長さだけにゼロがあるように見えるpので、奇数の長さの場合は、短絡の代わりに通常の製品を使用して両方の長所を得る必要があります。

— アンズ

3

Mathematica、n≈20

p[m_] := Last[Fold[Take[ListConvolve[##, {1, -1}, 0], 2^Length[m]]&,
  Table[If[IntegerQ[Log2[k]], m[[j, Log2[k] + 1]], 0], {j, n}, {k, 0, 2^Length[m] - 1}]]]

このTimingコマンドを使用すると、20x20マトリックスはシステム上で約48秒かかります。これは、パーマネントが行列の各行からの多項式の積の係数として見つけることができるという事実に依存しているため、他のアルゴリズムほど正確ではありません。係数リストを作成し、を使用して畳み込みを実行することにより、効率的な多項式乗算が実行されますListConvolve。これには、高速フーリエ変換またはO（n log n）時間を必要とする同様の方法を使用して畳み込みが実行されると仮定すると、約O（2 ⁿ n ²）時間を必要とします。

— マイル
ソース

3

Python 2、n = 22 [参照]

これは私が昨日Lembikと共有した「リファレンス」実装n=23であり、彼のマシンでは数秒で実現できず、私のマシンでは約52秒で実現します。これらの速度を実現するには、PyPyでこれを実行する必要があります。

最初の関数は、基本ルールを適用できる2x2が残るまで各サブマトリックスを調べることにより、行列式の計算方法と同様にパーマネントを計算します。それは信じられないほど遅いです。

2番目の関数は、Ryser関数（ウィキペディアにリストされている2番目の式）を実装する関数です。このセットSは、本質的に数値の累乗です{1,...,n}（s_listコード内の変数）。

from random import *
from time import time
from itertools import*

def perm(a): # naive method, recurses over submatrices, slow 
    if len(a) == 1:
        return a[0][0]
    elif len(a) == 2:
        return a[0][0]*a[1][1]+a[1][0]*a[0][1]
    else:
        tsum = 0
        for i in xrange(len(a)):
            transposed = [zip(*a)[j] for j in xrange(len(a)) if j != i]
            tsum += a[0][i] * perm(zip(*transposed)[1:])
        return tsum

def perm_ryser(a): # Ryser's formula, using matrix entries
    maxn = len(a)
    n_list = range(1,maxn+1)
    s_list = chain.from_iterable(combinations(n_list,i) for i in range(maxn+1))
    total = 0
    for st in s_list:
        stotal = (-1)**len(st)
        for i in xrange(maxn):
            stotal *= sum(a[i][j-1] for j in st)
        total += stotal
    return total*((-1)**maxn)


def genmatrix(d):
    mat = []
    for x in xrange(d):
        row = []
        for y in xrange(d):
            row.append([-1,1][randrange(0,2)])
        mat.append(row)
    return mat

def main():
    for i in xrange(1,24):
        k = genmatrix(i)
        print 'Matrix: (%dx%d)'%(i,i)
        print '\n'.join('['+', '.join(`j`.rjust(2) for j in a)+']' for a in k)
        print 'Permanent:',
        t = time()
        p = perm_ryser(k)
        print p,'(took',time()-t,'seconds)'

if __name__ == '__main__':
    main()

— カデ
ソース

「行列式の計算方法に似た」説明を言い換えるべきだと思います。行列式の方法がパーマネントにとって遅いというわけではありませんが、行列式の遅いメソッドの1つは、パーマネントに対して同様に（そしてゆっくりと）動作します。

— クリスチャンシーバーズ

1

@ChristianSievers良い点、私はそれを変更しました。

— ケード

2

RPython 5.4.1、n≈32（37秒）

from rpython.rlib.rtime import time
from rpython.rlib.rarithmetic import r_int, r_uint
from rpython.rlib.rrandom import Random
from rpython.rlib.rposix import pipe, close, read, write, fork, waitpid
from rpython.rlib.rbigint import rbigint

from math import log, ceil
from struct import pack

bitsize = len(pack('l', 1)) * 8 - 1

bitcounts = bytearray([0])
for i in range(16):
  b = bytearray([j+1 for j in bitcounts])
  bitcounts += b


def bitcount(n):
  bits = 0
  while n:
    bits += bitcounts[n & 65535]
    n >>= 16
  return bits


def main(argv):
  if len(argv) < 2:
    write(2, 'Usage: %s NUM_THREADS [N]'%argv[0])
    return 1
  threads = int(argv[1])

  if len(argv) > 2:
    n = int(argv[2])
    rnd = Random(r_uint(time()*1000))
    m = []
    for i in range(n):
      row = []
      for j in range(n):
        row.append(1 - r_int(rnd.genrand32() & 2))
      m.append(row)
  else:
    m = []
    strm = ""
    while True:
      buf = read(0, 4096)
      if len(buf) == 0:
        break
      strm += buf
    rows = strm.split("\n")
    for row in rows:
      r = []
      for val in row.split(' '):
        r.append(int(val))
      m.append(r)
    n = len(m)

  a = []
  for row in m:
    val = 0
    for v in row:
      val = (val << 1) | -(v >> 1)
    a.append(val)

  batches = int(ceil(n * log(n) / (bitsize * log(2))))

  pids = []
  handles = []
  total = rbigint.fromint(0)
  for i in range(threads):
    r, w = pipe()
    pid = fork()
    if pid:
      close(w)
      pids.append(pid)
      handles.append(r)
    else:
      close(r)
      total = run(n, a, i, threads, batches)
      write(w, total.str())
      close(w)
      return 0

  for pid in pids:
    waitpid(pid, 0)

  for handle in handles:
    strval = read(handle, 256)
    total = total.add(rbigint.fromdecimalstr(strval))
    close(handle)

  print total.rshift(n-1).str()

  return 0


def run(n, a, mynum, threads, batches):
  start = (1 << n-1) * mynum / threads
  end = (1 << n-1) * (mynum+1) / threads

  dtotal = rbigint.fromint(0)
  for delta in range(start, end):
    pdelta = rbigint.fromint(1 - ((bitcount(delta) & 1) << 1))
    for i in range(batches):
      pbatch = 1
      for j in range(i, n, batches):
        pbatch *= n - (bitcount(delta ^ a[j]) << 1)
      pdelta = pdelta.int_mul(pbatch)
    dtotal = dtotal.add(pdelta)

  return dtotal


def target(*args):
  return main

コンパイルするには、最新のPyPyソースをダウンロードして、次を実行します。

pypy /path/to/pypy-src/rpython/bin/rpython matrix-permanent.py

結果の実行可能ファイルはmatrix-permanent-c、現在の作業ディレクトリで名前が付けられるか、類似します。

PyPy 5.0の時点では、RPythonのスレッドプリミティブは以前よりもはるかに少ないプリミティブです。新しく生成されたスレッドにはGILが必要です。GILは、並列計算には多かれ少なかれ役に立ちません。fork代わりに使用したので、Windowsで期待どおりに動作しない場合がありますが、~~テスト~~し~~ていませ~~んがコンパイルに失敗します（unresolved external symbol _fork）。

実行可能ファイルは、最大2つのコマンドラインパラメーターを受け入れます。最初はスレッドの数で、2番目のオプションのパラメーターはnです。指定された場合、ランダム行列が生成され、そうでない場合は標準入力から読み込まれます。各行は改行で区切られ（末尾の改行なし）、各値スペースが区切られている必要があります。3番目の入力例は次のようになります。

1 -1 1 -1 -1 1 1 1 -1 -1 -1 -1 1 1 1 1 -1 1 1 -1
1 -1 1 1 1 1 1 -1 1 -1 -1 1 1 1 -1 -1 1 1 1 -1
-1 -1 1 1 1 -1 -1 -1 -1 1 -1 1 1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 1 -1 1 1 1 1 1 1 -1 -1 -1 -1 -1 -1 1 -1
-1 1 1 1 -1 1 1 1 -1 -1 -1 1 -1 1 -1 1 1 1 1 1
1 -1 1 1 -1 -1 1 -1 1 1 1 1 -1 1 1 -1 1 -1 -1 -1
1 -1 -1 1 -1 -1 -1 1 -1 1 1 1 1 -1 -1 -1 1 1 1 -1
1 -1 -1 1 -1 1 1 -1 1 1 1 -1 1 -1 1 1 1 -1 1 1
1 -1 -1 -1 -1 -1 1 1 1 -1 -1 -1 -1 -1 1 1 -1 1 1 -1
-1 -1 1 -1 1 -1 1 1 -1 1 -1 1 1 1 1 1 1 -1 1 1
-1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 1 1 1 1 -1 -1 -1 -1
1 1 -1 -1 -1 1 1 -1 -1 1 -1 1 1 -1 1 1 1 1 1 1
-1 1 1 -1 -1 -1 -1 -1 1 1 1 1 -1 -1 -1 -1 -1 1 -1 1
1 1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 1 -1 1 1 -1 1 -1 1
1 1 1 1 1 -1 -1 -1 1 1 1 -1 1 -1 1 1 1 -1 1 1
1 -1 -1 1 -1 -1 -1 -1 1 -1 -1 1 1 -1 1 -1 -1 -1 -1 -1
-1 1 1 1 -1 1 1 -1 -1 1 1 1 -1 -1 1 1 -1 -1 1 1
1 1 -1 -1 1 1 -1 1 1 -1 1 1 1 -1 1 1 -1 1 -1 1
1 1 1 -1 -1 -1 1 -1 -1 1 1 -1 -1 -1 1 -1 -1 -1 -1 1
-1 1 1 1 -1 -1 -1 -1 -1 -1 -1 1 1 -1 1 1 -1 1 -1 -1

サンプルの使用法

$ time ./matrix-permanent-c 8 30
8395059644858368

real    0m8.582s
user    1m8.656s
sys     0m0.000s

方法

実行時の複雑さをO（2 ⁿ n）に設定したBalasubramanian-Bax / Franklin-Glynn式を使用しました。ただし、δをグレーコードの順序で繰り返す代わりに、ベクトル行の乗算を1つのxor演算に置き換えました（マッピング（1、-1）→（0、1））。同様に、ベクトルの合計は、nからポップカウントの2倍を引くことにより、1回の操作で見つけることができます。

— プリモ
ソース

残念ながら、コードが間違った答え与えbpaste.net/show/8690251167e7

@Lembikが更新されました。好奇心から、次のコードの結果を教えてください。bpaste.net/show/76ec65e1b533

— primo

"True 18446744073709551615"が得られます。コードに非常に良い結果を追加しました。

@Lembikありがとう。63ビットがオーバーフローしないように、すでに乗算を分割していました。リストされた結果は8スレッドで取得されましたか？2または4は違いがありますか？25で30が終了した場合、31は1分未満であるようです。

— プリモ

-1

ラケット84バイト

次の簡単な関数は、小さな行列では機能しますが、大きな行列では私のマシンでハングします

(for/sum((p(permutations(range(length l)))))(for/product((k l)(c p))(list-ref k c)))

ゴルフをしていない：

(define (f ll) 
  (for/sum ((p (permutations (range (length ll))))) 
    (for/product ((l ll)(c p)) 
      (list-ref l c))))

コードは、行と列の数が等しくないように簡単に変更できます。

テスト：

(f '[[ 1 -1 -1  1]
     [-1 -1 -1  1]
     [-1  1 -1  1]
     [ 1 -1 -1  1]])

(f '[[ 1 -1  1 -1 -1 -1 -1 -1]
 [-1 -1  1  1 -1  1  1 -1]
 [ 1 -1 -1 -1 -1  1  1  1]
 [-1 -1 -1  1 -1  1  1  1]
 [ 1 -1 -1  1  1  1  1 -1]
 [-1  1 -1  1 -1  1  1 -1]
 [ 1 -1  1 -1  1 -1  1 -1]
 [-1 -1  1 -1  1  1  1  1]])

出力：

-4
192

前述したように、次のテストでハングします。

(f '[[1 -1 1 -1 -1 1 1 1 -1 -1 -1 -1 1 1 1 1 -1 1 1 -1]
 [1 -1 1 1 1 1 1 -1 1 -1 -1 1 1 1 -1 -1 1 1 1 -1]
 [-1 -1 1 1 1 -1 -1 -1 -1 1 -1 1 1 1 -1 -1 -1 1 -1 -1]
 [-1 -1 -1 1 1 -1 1 1 1 1 1 1 -1 -1 -1 -1 -1 -1 1 -1]
 [-1 1 1 1 -1 1 1 1 -1 -1 -1 1 -1 1 -1 1 1 1 1 1]
 [1 -1 1 1 -1 -1 1 -1 1 1 1 1 -1 1 1 -1 1 -1 -1 -1]
 [1 -1 -1 1 -1 -1 -1 1 -1 1 1 1 1 -1 -1 -1 1 1 1 -1]
 [1 -1 -1 1 -1 1 1 -1 1 1 1 -1 1 -1 1 1 1 -1 1 1]
 [1 -1 -1 -1 -1 -1 1 1 1 -1 -1 -1 -1 -1 1 1 -1 1 1 -1]
 [-1 -1 1 -1 1 -1 1 1 -1 1 -1 1 1 1 1 1 1 -1 1 1]
 [-1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 1 1 1 1 -1 -1 -1 -1]
 [1 1 -1 -1 -1 1 1 -1 -1 1 -1 1 1 -1 1 1 1 1 1 1]
 [-1 1 1 -1 -1 -1 -1 -1 1 1 1 1 -1 -1 -1 -1 -1 1 -1 1]
 [1 1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 1 -1 1 1 -1 1 -1 1]
 [1 1 1 1 1 -1 -1 -1 1 1 1 -1 1 -1 1 1 1 -1 1 1]
 [1 -1 -1 1 -1 -1 -1 -1 1 -1 -1 1 1 -1 1 -1 -1 -1 -1 -1]
 [-1 1 1 1 -1 1 1 -1 -1 1 1 1 -1 -1 1 1 -1 -1 1 1]
 [1 1 -1 -1 1 1 -1 1 1 -1 1 1 1 -1 1 1 -1 1 -1 1]
 [1 1 1 -1 -1 -1 1 -1 -1 1 1 -1 -1 -1 1 -1 -1 -1 -1 1]
 [-1 1 1 1 -1 -1 -1 -1 -1 -1 -1 1 1 -1 1 1 -1 1 -1 -1]])

— rnso
ソース

3

この質問のスピード版よりもコードゴルフ版の方がこの答えは良いですか？

できるだけ早くパーマネントを計算します

gcc C ++ n≈36（システムでは57秒）

C99、n≈33（35秒）

試運転

Python 2、n≈28

錆+ extprim

ハスケル、n = 31（54s）

Mathematica、n≈20

Python 2、n = 22 [参照]

RPython 5.4.1、n≈32（37秒）

ラケット84バイト