17

너비 의 $n$ 비트 벡터 목록을 제공합니다 . 목표는 공통점이 1이 아닌 목록에서 두 개의 비트 벡터를 반환하거나 해당 쌍이 존재하지 않는 것으로보고하는 것입니다. $k$

예를 들어, $[00110, 01100, 11000]$ 을 제공하면 유일한 해결책은 $\{00110, 11000\}$ 입니다. 대안 적으로, 입력 $[111, 011, 110, 101]$ 은 해결책이 없다. 그리고 0이 아닌 비트 벡터 $000...0$ 및 다른 요소 를 포함하는 모든리스트에는 $e$ 간단한 해결책 $\{e, 000...0\}$ 있습니다.

다음은 해결책이없는 약간 더 어려운 예입니다 (각 행은 비트 벡터이며 검은 색 사각형은 1이고 흰색 사각형은 0입니다).

■ ■ ■ ■ □ □ □ □ □ □ □ □ □
■ □ □ □ ■ ■ ■ □ □ □ □ □ □ 
■ □ □ □ □ □ □ ■ ■ ■ □ □ □
■ □ □ □ □ □ □ □ □ □ ■ ■ ■
□ ■ □ □ □ ■ □ □ □ ■ ■ □ □
□ ■ □ □ ■ □ □ □ ■ □ □ □ ■
□ ■ □ □ □ □ ■ ■ □ □ □ ■ □ <-- All row pairs share a black square
□ □ ■ □ □ □ ■ □ ■ □ ■ □ □
□ □ ■ □ □ ■ □ ■ □ □ □ □ ■
□ □ ■ □ ■ □ □ □ □ ■ □ ■ □
□ □ □ ■ ■ □ □ ■ □ □ ■ □ □
□ □ □ ■ □ □ ■ □ □ ■ □ □ ■
□ □ □ ■ □ ■ □ □ ■ □ □ ■ □

두 개의 겹치지 않는 비트 벡터를 얼마나 효율적으로 찾거나 존재하지 않는 것으로 표시 할 수 있습니까?

가능한 모든 쌍을 비교하는 순진 알고리즘은 $O(n^2 k)$ 입니다. 더 잘할 수 있습니까?

algorithms search-algorithms

— 크레이그 거 드니
소스

가능한 축소 : 각 벡터에 대해 하나의 꼭짓점이 있고 두 개의 해당 벡터에 공통점이 1 인 경우 두 꼭짓점 사이의 가장자리 가있는 그래프

G

$G$ 가 있습니다. 그래프 직경이

인지 알고 싶습니다

\geq 2

$\geq 2$ . 그러나

보다 빨리가는 것은 어려운 것 같습니다

O (n^{2} k)

$O(n^2k)$ .

— François

@ FrançoisGodi 3 개의 노드와 누락 된 모서리가있는 연결된 그래프 구성 요소의 지름은 2 이상입니다. 인접 목록 표시를 사용하면이 를 확인하는 데

O (V)

$O(V)$ 시간 이 걸립니다 .

— Craig Gidney

@Strilanc 물론, 솔루션이 없으면 그래프가 완성되었지만 (직경 = 1보다 더 명확합니다), 인접 목록 표현을 계산하는 데 시간이 오래 걸릴 수 있습니다.

— François

가 기계의 단어 너비보다 작습니까?

k

$k$

— Raphael

1

@TomvanderZanden 데이터 구조에 의존하는 불변을 위반하는 것처럼 들립니다. 특히, 그 평등은 전 이적이어야합니다. 나는 이미 trie를 사용하는 것에 대해 생각하고 있었고 쿼리 비트 마스크가 0을 가질 때마다 2의 요소 폭파를 피하는 방법을 보지 못했습니다.

— Craig Gidney

10

워밍업 : 랜덤 비트 벡터

워밍업으로, 각 비트 벡터가 무작위로 균일하게 선택된 경우부터 시작할 수 있습니다. 그런 다음 시간에 문제를 해결할 수 있음을 알 수 있습니다 (보다 정확하게는 을 으로 대체 할 수 있음 ). $O(n^{1.6} \min(k, \lg n))$ $1.6$ $\lg 3$

문제의 다음 두 가지 변형을 고려할 것입니다.

비트 벡터의 집합 가 주어지면 겹치지 않는 쌍 가있는 곳을 결정하십시오 . $S,T \subseteq \{0,1\}^k$ $s \in S, t \in T$

이를 해결하기위한 기본 기술은 분할 및 정복입니다. 나누기와 정복을 사용 하는 시간 알고리즘 은 다음과 같습니다 . $O(n^{1.6} k)$

첫 번째 비트 위치를 기준으로 와 를 분할 합니다. 즉, , , , $S$ $T$ $S_0 = \{s \in S : s_0=0\}$ $S_1 = \{s \in S : s_0 = 1\}$ $T_0 = \{t \in T : t_0 = 0\}$ 입니다. $T_1 = \{t \in T : t_0 = 1\}$
이제 , 및 에서 겹치지 않는 쌍을 재귀 적으로 찾으십시오 . 재귀 호출이 겹치지 않는 쌍을 찾으면 출력하고 그렇지 않으면 "겹치는 쌍이 없습니다"를 출력합니다. $S_0,T_0$ $S_0,T_1$ $T_1,S_0$

모든 비트 벡터가 무작위로 선택되므로, 와 . 따라서 우리는 재귀 호출이 세 번이고 문제의 크기를 2 배로 줄였습니다 (두 세트의 크기는 2 배씩 줄어 듭니다). 이후 스플릿은 두 세트들 중 하나는 크기 1까지이고, 문제는 선형 시간에 해결할 수있다. 우리는 라인을 따라 재발 관계를 얻습니다. $|S_b| \approx |S|/2$ $|T_b| \approx |T|/2$ $\lg \min(|S|,|T|)$ 이고, 그 해는 입니다. 두 세트의 경우 더 정확하게 실행 시간을 계산하면 실행 시간이 임을 알 수 있습니다. $T(n) = 3T(n/2) + O(nk)$ $T(n) = O(n^{1.6} k)$ . $O(\min(|S|,|T|)^{0.6} \max(|S|,|T|) k)$

This can be further improved, by noting that if $k \ge 2.5\lg n+100$ , then the probability that a non-overlapping pair exists is exponentially small. In particular, if $x,y$ are two random vectors, the probability that they're non-overlapping is $(3/4)^k$ . If $|S|=|T|=n$ , there are $n^2$ such pairs, so by a union bound, the probability a non-overlapping pair exists is at most $n^2 (3/4)^k$ . When $k \ge 2.5 \lg n+100$ , this is $\le 1/2^{100}$ . So, as a pre-processing step, if $k \ge 2.5 \lg n + 100$ , then we can immediately return "No non-overlapping pair exists" (the probability this is incorrect is negligibly small), otherwise we run the above algorithm.

따라서, 우리의 주행 시간 달성 (또는 bitvectors가 임의로 선택되는 특별한 경우에 대해 상기 제시 한 두 세트 변이체)에 대한. $O(n^{1.6} \min(k, \lg n))$ $O(\min(|S|,|T|)^{0.6} \max(|S|,|T|) \min(k, \lg n))$

Of course, this is not a worst-case analysis. Random bitvectors are considerably easier than the worst case -- but let's treat it as a warmup, to get some ideas that perhaps we can apply to the general case.

Lessons from the warmup

We can learn a few lessons from the warmup above. First, divide-and-conquer (splitting on a bit position) seems helpful. Second, you want to split on a bit position with as many $1$ 's in that position as possible; the more $0$ 's there are, the less reduction in subproblem size you get.

Third, this suggests that the problem gets harder as the density of $1$ 's gets smaller -- if there are very few $1$ 's among the bitvectors (they are mostly $0$ 's), the problem looks quite hard, as each split reduces the size of the subproblems a little bit. So, define the density $\Delta$ to be the fraction of bits that are $1$ (i.e., out of all $nk$ bits), and the density of bit position $i$ to be the fraction of bitvectors that are $1$ at position $i$ .

Handling very low density

As a next step, we might wonder what happens if the density is extremely small. It turns out that if the density in every bit position is smaller than $1/\sqrt{k}$ , we're guaranteed that a non-overlapping pair exists: there is a (non-constructive) existence argument showing that some non-overlapping pair must exist. This doesn't help us find it, but at least we know it exists.

왜 이런 경우입니까? 경우 한 쌍의 비트 벡터 가 비트 위치 덮여 있다고 가정 해 봅시다 . 모든 겹치는 비트 벡터 쌍은 일부 비트 위치로 덮여 있어야합니다. 우리는 특정 비트 위치를 고정하는 경우 지금 , 그 비트 위치에 의해 커버 될 수 쌍의 수는 많아야이다 . 모든 합산 $x,y$ $i$ $x_i=y_i=1$ $i$ $(n \Delta(i))^2 < n^2/k$ $k$ of the bit positions, we find that the total number of pairs that are covered by some bit position is $< n^2$ . This means there must exist some pair that's not covered by any bit position, which implies that this pair is non-overlapping. So if the density is sufficiently low in every bit position, then a non-overlapping pair surely exists.

However, I'm at a loss to identify a fast algorithm to find such a non-overlapping pair, in these regime, even though one is guaranteed to exist. I don't immediately see any techniques that would yield a running time that has a sub-quadratic dependence on $n$ . So, this is a nice special case to focus on, if you want to spend some time thinking about this problem.

Towards a general-case algorithm

In the general case, a natural heuristic seems to be: pick the bit position $i$ with the most number of $1$ 's (i.e., with the highest density), and split on it. In other words:

Find a bit position $i$ that maximizes $\Delta(i)$ .
Split $S$ and $T$ based upon bit position $i$ . In other words, form $S_0 = \{s \in S : s_i=0\}$ , $S_1 = \{s \in S : s_i = 1\}$ , $T_0 = \{t \in T : t_i = 0\}$ , $T_1 = \{t \in T : t_i = 1\}$ .
Now recursively look for a non-overlapping pair from $S_0,T_0$ , from $S_0,T_1$ , and from $T_1,S_0$ . If any recursive call finds a non-overlapping pair, output it, otherwise output "No overlapping pair exists".

The challenge is to analyze its performance in the worst case.

Let's assume that as a pre-processing step we first compute the density of every bit position. Also, if $\Delta(i) < 1/\sqrt{k}$ for every $i$ , assume that the pre-processing step outputs "An overlapping pair exists" (I realize that this doesn't exhibit an example of an overlapping pair, but let's set that aside as a separate challenge). All this can be done in $O(nk)$ time. The density information can be maintained efficiently as we do recursive calls; it won't be the dominant contributor to running time.

What will the running time of this procedure be? I'm not sure, but here are a few observations that might help. Each level of recursion reduces the problem size by about $n/\sqrt{k}$ bitvectors (e.g., from $n$ bitvectors to $n-n/\sqrt{k}$ bitvectors). Therefore, the recursion can only go about $\sqrt{k}$ levels deep. However, I'm not immediately sure how to count the number of leaves in the recursion tree (there are a lot less than $3^{\sqrt{k}}$ leaves), so I'm not sure what running time this should lead to.

— D.W.
소스

ad low density: this seems to be some kind of pigeon-hole argument. Maybe if we use your general idea (split w.r.t. the column with the most ones), we get better bounds because the

(S_{1}, T_{1})

$(S_1, T_1)$ -case (we don't recurse to) already gets rid of "most" ones?

— Raphael

The total number of ones may be a useful parameter. You have already shown a lower bound we can use for cutting off the tree; can we show upper bounds, too? For example, if there are more than

c k

$ck$ ones, we have at least

c

$c$ overlaps.

— Raphael

By the way, how do you propose we do the first split; arbitrarily? Why not just split the whole input set w.r.t. some column

i

$i$ ? We only need to recurse in the

0

$0$ -case (there is no solution among those that share a one at

i

$i$ ). In expectation, that gives via

T (n) = T (n / 2) + O (n k)

$T(n) = T(n/2) + O(nk)$ a bound of

O (n k)

$O(nk)$ (if

k

$k$ fixed). For a general bound, you have shown that we can (assuming the lower-bound-cutoff you propose) that we get rid of at least

n / \sqrt{k}

$n/\sqrt{k}$ elements with every split, which seems to imply an

O (n k)

$O(nk)$ worst-case bound. Or am I missing something?

— Raphael

Ah, that's wrong, of course, since it does not consider 0-1-mismatches. That's what I get for trying to think before breakfast, I guess.

— Raphael

@Raphael, there are two issues: (a) the vectors might be mostly zeros, so you can't count on getting a 50-50 split; the recurrence would be something more like

T (n) = T ((n - n / \sqrt{k}) k) + O (n k)

$T(n) = T((n-n/\sqrt{k})k)+O(nk)$ , (b) more importantly, it's not enough to just recurse on the 0-subset; you also need to examine pairings between a vector from the 0-subset and a vector from the 1-subset, so there's an additional recursion or two to do. (I think? I hope I got that right.)

— D.W.

8

Faster solution when $n \approx k$ , using matrix multiplication

Suppose that $n = k$ . Our goal is to do better than an $O(n^2k) = O(n^3)$ running time.

We can think of the bitvectors and bit positions as nodes in a graph. There is an edge between a bitvector node and a bit position node when the bitvector has a 1 in that position. The resulting graph is bipartite (with the bitvector-representing nodes on one side and the bitposition-representing nodes on the other), and has $n + k = 2n$ nodes.

Given the adjacency matrix $M$ of a graph, we can tell if there is a two-hop path between two vertices by squaring $M$ and checking if the resulting matrix has an "edge" between those two vertices (i.e. the edge's entry in the squared matrix is non-zero). For our purposes, a zero entry in the squared adjacency matrix corresponds to a non-overlapping pair of bitvectors (i.e. a solution). A lack of any zeroes means there's no solution.

Squaring an n x n matrix can be done in $O(n^\omega)$ time, where $\omega$ is known to be under $2.373$ and conjectured to be $2$ .

So the algorithm is:

Convert the bitvectors and bit positions into a bipartite graph with $n+k$ nodes and at most $nk$ edges. This takes $O(nk)$ time.
Compute the adjacency matrix of the graph. This takes $O((n+k)^2)$ time and space.
Square the adjacency matrix. This takes $O((n+k)^\omega)$ time.
Search the bitvector section of the adjacency matrix for zero entries. This takes $O(n^2)$ time.

The most expensive step is squaring the adjacency matrix. If $n=k$ then the overall algorithm takes $O((n+k)^\omega) = O(n^\omega)$ time, which is better than the naive $O(n^3)$ time.

This solution is also faster when $k$ grows not-too-much-slower and not-too-much-faster than $n$ . As long as $k \in \Omega(n^{\omega-2})$ and $k \in O(n^\frac{2}{\omega-1})$ , then $(n+k)^\omega$ is better than $n^2 k$ . For $w \approx 2.373$ that translates to $n^{0.731} \leq k \leq n^{1.373}$ (asymptotically). If $w$ limits to 2, then the bounds widen towards $n^\epsilon \leq k \leq n^{2-\epsilon}$ .

— Craig Gidney
소스

1. This is also better than the naive solution if

k = Ω (n)

$k=\Omega(n)$ but

k = o (n^{1.457})

$k=o(n^{1.457})$ . 2. If

k \geq n

$k \ge n$ , a heuristic could be: pick a random subset of

n

$n$ bit positions, restrict to those bit positions and use matrix multiplication to enumerate all pairs that don't overlap in those

n

$n$ bit positions; for each such pair, check if it solves the original problem. If there aren't many pairs that don't overlap in those

n

$n$ bit positions, this provides a speedup over the naive algorithm. However I don't know a good upper bound on the number of such pairs.

— D.W.

4

This is equivalent to finding a bit vector which is a subset of the complement of another vector; ie its 1's occur only where 0's occur in the other.

If k (or the number of 1's) is small, you can get $O(n2^k)$ time by simply generating all the subsets of the complement of each bitvector and putting them in a trie (using backtracking). If a bitvector is found in the trie (we can check each before complement-subset insertion) then we have a non-overlapping pair.

If the number of 1's or 0's is bounded to an even lower number than k, then the exponent can be replaced by that. The subset-indexing can be on either each vector or its complement, so long as probing uses the opposite.

There's also a scheme for superset-finding in a trie that only stores each vector only once, but does bit-skipping during probes for what I believe is similar aggregate complexity; ie it has $o(k)$ insertion but $o(2^k)$ searches.

— KWillets
소스

thanks. The complexity of your solution is

\sim n 2^{(1 - p) k}

$\sim n 2^{(1-p)k}$ , where

p

$p$ is the probability of 1's in the bitvector. A couple of implementation details: though this is a slight improvement, there's no need to compute and store the complements in the trie. Just following the complementary branches when checking for a non-overlapping match is enough. And, taking the 0's directly as wildcards, no special wildcard is needed, either.

— Mauro Lacy

2

Represent the bit vectors as an $n\times k$ matrix $M$ . Take $i$ and $j$ between 1 and $n$ .

\begin{aligned} (M M^{T})_{i j} = \sum_{l} M_{i l} M_{j l} \end{aligned} .

$\begin{align} (MM^T)_{ij} = \sum_l M_{il}M_{jl} \end{align}.$

$(MM^T)_{ij}$ , the dot product of the $i$ th and $j$ th vector, is non-zero if, and only if, vectors $i$ and $j$ share a common 1. So, to find a solution, compute $MM^T$ and return the position of a zero entry, if such an entry exists.

Complexity

Using naive multiplication, this requires $O(n^2k)$ arithmetic operations. If $n=k$ , it takes $O(n^{2.37})$ operations using the utterly impractical Coppersmith-Winograd algorithm, or $O(n^{2.8})$ using the Strassen algorithm. If $k=O(n^{0.302})$ , then the problem may be solved using $n^{2 + o(1)}$ operations.

— Ben
소스

How is this different from Strilanc's answer?

— D.W.

1

@D.W. Using an

n

$n$ -by-

k

$k$ matrix instead of an

(n + k)

$(n+k)$ -by-

(n + k)

$(n+k)$ matrix is an improvement. Also it mentions a way to cut off the factor of k when k << n, so that might be useful.

— Craig Gidney

겹치지 않는 비트 벡터 쌍 찾기

워밍업 : 랜덤 비트 벡터

Lessons from the warmup

Handling very low density

Towards a general-case algorithm

Faster solution when n≈kn≈kn \approx k, using matrix multiplication

Faster solution when $n \approx k$ , using matrix multiplication