겹치지 않는 비트 벡터 쌍 찾기


17

너비 kn 비트 벡터 목록을 제공합니다 . 목표는 공통점이 1이 아닌 목록에서 두 개의 비트 벡터를 반환하거나 해당 쌍이 존재하지 않는 것으로보고하는 것입니다.k

예를 들어, [00110,01100,11000] 을 제공하면 유일한 해결책은 {00110,11000} 입니다. 대안 적으로, 입력 [111,011,110,101] 은 해결책이 없다. 그리고 0이 아닌 비트 벡터 000...0 및 다른 요소 를 포함하는 모든리스트에는 e간단한 해결책 {e,000...0} 있습니다.

다음은 해결책이없는 약간 더 어려운 예입니다 (각 행은 비트 벡터이며 검은 색 사각형은 1이고 흰색 사각형은 0입니다).

■ ■ ■ ■ □ □ □ □ □ □ □ □ □
■ □ □ □ ■ ■ ■ □ □ □ □ □ □ 
■ □ □ □ □ □ □ ■ ■ ■ □ □ □
■ □ □ □ □ □ □ □ □ □ ■ ■ ■
□ ■ □ □ □ ■ □ □ □ ■ ■ □ □
□ ■ □ □ ■ □ □ □ ■ □ □ □ ■
□ ■ □ □ □ □ ■ ■ □ □ □ ■ □ <-- All row pairs share a black square
□ □ ■ □ □ □ ■ □ ■ □ ■ □ □
□ □ ■ □ □ ■ □ ■ □ □ □ □ ■
□ □ ■ □ ■ □ □ □ □ ■ □ ■ □
□ □ □ ■ ■ □ □ ■ □ □ ■ □ □
□ □ □ ■ □ □ ■ □ □ ■ □ □ ■
□ □ □ ■ □ ■ □ □ ■ □ □ ■ □

두 개의 겹치지 않는 비트 벡터를 얼마나 효율적으로 찾거나 존재하지 않는 것으로 표시 할 수 있습니까?

가능한 모든 쌍을 비교하는 순진 알고리즘은 O(n2k) 입니다. 더 잘할 수 있습니까?


가능한 축소 : 각 벡터에 대해 하나의 꼭짓점이 있고 두 개의 해당 벡터에 공통점이 1 인 경우 두 꼭짓점 사이의 가장자리 가있는 그래프 G 가 있습니다. 그래프 직경이 인지 알고 싶습니다 2. 그러나 보다 빨리가는 것은 어려운 것 같습니다 O(n2k).
François

@ FrançoisGodi 3 개의 노드와 누락 된 모서리가있는 연결된 그래프 구성 요소의 지름은 2 이상입니다. 인접 목록 표시를 사용하면이 를 확인하는 데 O(V) 시간 이 걸립니다 .
Craig Gidney

@Strilanc 물론, 솔루션이 없으면 그래프가 완성되었지만 (직경 = 1보다 더 명확합니다), 인접 목록 표현을 계산하는 데 시간이 오래 걸릴 수 있습니다.
François

가 기계의 단어 너비보다 작습니까? k
Raphael

1
@TomvanderZanden 데이터 구조에 의존하는 불변을 위반하는 것처럼 들립니다. 특히, 그 평등은 전 이적이어야합니다. 나는 이미 trie를 사용하는 것에 대해 생각하고 있었고 쿼리 비트 마스크가 0을 가질 때마다 2의 요소 폭파를 피하는 방법을 보지 못했습니다.
Craig Gidney

답변:


10

워밍업 : 랜덤 비트 벡터

워밍업으로, 각 비트 벡터가 무작위로 균일하게 선택된 경우부터 시작할 수 있습니다. 그런 다음 시간에 문제를 해결할 수 있음을 알 수 있습니다 (보다 정확하게는 1.6lg 3 으로 대체 할 수 있음 ).O(n1.6min(k,lgn))1.6lg3

문제의 다음 두 가지 변형을 고려할 것입니다.

비트 벡터의 집합 가 주어지면 겹치지 않는 쌍 s S , t T 가있는 곳을 결정하십시오 .S,T{0,1}ksS,tT

이를 해결하기위한 기본 기술은 분할 및 정복입니다. 나누기와 정복을 사용 하는 시간 알고리즘 은 다음과 같습니다 .O(n1.6k)

  1. 첫 번째 비트 위치를 기준으로 T 를 분할 합니다. 즉, S 0 = { s S : s 0 = 0 } , S 1 = { s S : s 0 = 1 } , T 0 = { t T : t 0 = 0 } , T 1 = { t T : tSTS0={sS:s0=0}S1={sS:s0=1}T0={tT:t0=0} 입니다.T1={tT:t0=1}

  2. 이제 , S 0 , T 1T 1 , S 0 에서 겹치지 않는 쌍을 재귀 적으로 찾으십시오 . 재귀 호출이 겹치지 않는 쌍을 찾으면 출력하고 그렇지 않으면 "겹치는 쌍이 없습니다"를 출력합니다.S0,T0S0,T1T1,S0

모든 비트 벡터가 무작위로 선택되므로, | T의 B | | T | / 2 . 따라서 우리는 재귀 호출이 세 번이고 문제의 크기를 2 배로 줄였습니다 (두 세트의 크기는 2 배씩 줄어 듭니다). 이후 LG ( | S는 | , | T | ) 스플릿은 두 세트들 중 하나는 크기 1까지이고, 문제는 선형 시간에 해결할 수있다. 우리는 라인을 따라 재발 관계를 얻습니다.|Sb||S|/2|Tb||T|/2lgmin(|S|,|T|) 이고, 그 해는 T ( n ) = O ( n 1.6 k ) 입니다. 두 세트의 경우 더 정확하게 실행 시간을 계산하면 실행 시간이 O ( 최소 ( | S | , | T | ) 0.6 max ( | S | , | T ) 임을 알 수 있습니다.T(n)=3T(n/2)+O(nk)T(n)=O(n1.6k) .O(min(|S|,|T|)0.6max(|S|,|T|)k)

This can be further improved, by noting that if k2.5lgn+100, then the probability that a non-overlapping pair exists is exponentially small. In particular, if x,y are two random vectors, the probability that they're non-overlapping is (3/4)k. If |S|=|T|=n, there are n2 such pairs, so by a union bound, the probability a non-overlapping pair exists is at most n2(3/4)k. When k2.5lgn+100, this is 1/2100. So, as a pre-processing step, if k2.5lgn+100, then we can immediately return "No non-overlapping pair exists" (the probability this is incorrect is negligibly small), otherwise we run the above algorithm.

따라서, 우리의 주행 시간 달성 (또는 O ( ( | S | , | T | ) 0.6 최대 ( | S | , | T | ) ( K , LG N을 ) ) bitvectors가 임의로 선택되는 특별한 경우에 대해 상기 제시 한 두 세트 변이체)에 대한.O(n1.6min(k,lgn))O(min(|S|,|T|)0.6max(|S|,|T|)min(k,lgn))

Of course, this is not a worst-case analysis. Random bitvectors are considerably easier than the worst case -- but let's treat it as a warmup, to get some ideas that perhaps we can apply to the general case.

Lessons from the warmup

We can learn a few lessons from the warmup above. First, divide-and-conquer (splitting on a bit position) seems helpful. Second, you want to split on a bit position with as many 1's in that position as possible; the more 0's there are, the less reduction in subproblem size you get.

Third, this suggests that the problem gets harder as the density of 1's gets smaller -- if there are very few 1's among the bitvectors (they are mostly 0's), the problem looks quite hard, as each split reduces the size of the subproblems a little bit. So, define the density Δ to be the fraction of bits that are 1 (i.e., out of all nk bits), and the density of bit position i to be the fraction of bitvectors that are 1 at position i.

Handling very low density

As a next step, we might wonder what happens if the density is extremely small. It turns out that if the density in every bit position is smaller than 1/k, we're guaranteed that a non-overlapping pair exists: there is a (non-constructive) existence argument showing that some non-overlapping pair must exist. This doesn't help us find it, but at least we know it exists.

왜 이런 경우입니까? x i = y i = 1 인 경우 한 쌍의 비트 벡터 가 비트 위치 i로 덮여 있다고 가정 해 봅시다 . 모든 겹치는 비트 벡터 쌍은 일부 비트 위치로 덮여 있어야합니다. 우리는 특정 비트 위치를 고정하는 경우 지금 난을 , 그 비트 위치에 의해 커버 될 수 쌍의 수는 많아야이다 ( N Δ ( I ) ) 2 < N 2 / K . 모든 k에서 합산x,yixi=yi=1i(nΔ(i))2<n2/kk of the bit positions, we find that the total number of pairs that are covered by some bit position is <n2. This means there must exist some pair that's not covered by any bit position, which implies that this pair is non-overlapping. So if the density is sufficiently low in every bit position, then a non-overlapping pair surely exists.

However, I'm at a loss to identify a fast algorithm to find such a non-overlapping pair, in these regime, even though one is guaranteed to exist. I don't immediately see any techniques that would yield a running time that has a sub-quadratic dependence on n. So, this is a nice special case to focus on, if you want to spend some time thinking about this problem.

Towards a general-case algorithm

In the general case, a natural heuristic seems to be: pick the bit position i with the most number of 1's (i.e., with the highest density), and split on it. In other words:

  1. Find a bit position i that maximizes Δ(i).

  2. Split S and T based upon bit position i. In other words, form S0={sS:si=0}, S1={sS:si=1}, T0={tT:ti=0}, T1={tT:ti=1}.

  3. Now recursively look for a non-overlapping pair from S0,T0, from S0,T1, and from T1,S0. If any recursive call finds a non-overlapping pair, output it, otherwise output "No overlapping pair exists".

The challenge is to analyze its performance in the worst case.

Let's assume that as a pre-processing step we first compute the density of every bit position. Also, if Δ(i)<1/k for every i, assume that the pre-processing step outputs "An overlapping pair exists" (I realize that this doesn't exhibit an example of an overlapping pair, but let's set that aside as a separate challenge). All this can be done in O(nk) time. The density information can be maintained efficiently as we do recursive calls; it won't be the dominant contributor to running time.

What will the running time of this procedure be? I'm not sure, but here are a few observations that might help. Each level of recursion reduces the problem size by about n/k bitvectors (e.g., from n bitvectors to nn/k bitvectors). Therefore, the recursion can only go about k levels deep. However, I'm not immediately sure how to count the number of leaves in the recursion tree (there are a lot less than 3k leaves), so I'm not sure what running time this should lead to.


ad low density: this seems to be some kind of pigeon-hole argument. Maybe if we use your general idea (split w.r.t. the column with the most ones), we get better bounds because the (S1,T1)-case (we don't recurse to) already gets rid of "most" ones?
Raphael

The total number of ones may be a useful parameter. You have already shown a lower bound we can use for cutting off the tree; can we show upper bounds, too? For example, if there are more than ck ones, we have at least c overlaps.
Raphael

By the way, how do you propose we do the first split; arbitrarily? Why not just split the whole input set w.r.t. some column i? We only need to recurse in the 0-case (there is no solution among those that share a one at i). In expectation, that gives via T(n)=T(n/2)+O(nk) a bound of O(nk) (if k fixed). For a general bound, you have shown that we can (assuming the lower-bound-cutoff you propose) that we get rid of at least n/k elements with every split, which seems to imply an O(nk) worst-case bound. Or am I missing something?
Raphael

Ah, that's wrong, of course, since it does not consider 0-1-mismatches. That's what I get for trying to think before breakfast, I guess.
Raphael

@Raphael, there are two issues: (a) the vectors might be mostly zeros, so you can't count on getting a 50-50 split; the recurrence would be something more like T(n)=T((nn/k)k)+O(nk), (b) more importantly, it's not enough to just recurse on the 0-subset; you also need to examine pairings between a vector from the 0-subset and a vector from the 1-subset, so there's an additional recursion or two to do. (I think? I hope I got that right.)
D.W.

8

Faster solution when nk, using matrix multiplication

Suppose that n=k. Our goal is to do better than an O(n2k)=O(n3) running time.

We can think of the bitvectors and bit positions as nodes in a graph. There is an edge between a bitvector node and a bit position node when the bitvector has a 1 in that position. The resulting graph is bipartite (with the bitvector-representing nodes on one side and the bitposition-representing nodes on the other), and has n+k=2n nodes.

Given the adjacency matrix M of a graph, we can tell if there is a two-hop path between two vertices by squaring M and checking if the resulting matrix has an "edge" between those two vertices (i.e. the edge's entry in the squared matrix is non-zero). For our purposes, a zero entry in the squared adjacency matrix corresponds to a non-overlapping pair of bitvectors (i.e. a solution). A lack of any zeroes means there's no solution.

Squaring an n x n matrix can be done in O(nω) time, where ω is known to be under 2.373 and conjectured to be 2.

So the algorithm is:

  • Convert the bitvectors and bit positions into a bipartite graph with n+k nodes and at most nk edges. This takes O(nk) time.
  • Compute the adjacency matrix of the graph. This takes O((n+k)2) time and space.
  • Square the adjacency matrix. This takes O((n+k)ω) time.
  • Search the bitvector section of the adjacency matrix for zero entries. This takes O(n2) time.

The most expensive step is squaring the adjacency matrix. If n=k then the overall algorithm takes O((n+k)ω)=O(nω) time, which is better than the naive O(n3) time.

This solution is also faster when k grows not-too-much-slower and not-too-much-faster than n. As long as kΩ(nω2) and kO(n2ω1), then (n+k)ω is better than n2k. For w2.373 that translates to n0.731kn1.373 (asymptotically). If w limits to 2, then the bounds widen towards nϵkn2ϵ.


1. This is also better than the naive solution if k=Ω(n) but k=o(n1.457). 2. If kn, a heuristic could be: pick a random subset of n bit positions, restrict to those bit positions and use matrix multiplication to enumerate all pairs that don't overlap in those n bit positions; for each such pair, check if it solves the original problem. If there aren't many pairs that don't overlap in those n bit positions, this provides a speedup over the naive algorithm. However I don't know a good upper bound on the number of such pairs.
D.W.

4

This is equivalent to finding a bit vector which is a subset of the complement of another vector; ie its 1's occur only where 0's occur in the other.

If k (or the number of 1's) is small, you can get O(n2k) time by simply generating all the subsets of the complement of each bitvector and putting them in a trie (using backtracking). If a bitvector is found in the trie (we can check each before complement-subset insertion) then we have a non-overlapping pair.

If the number of 1's or 0's is bounded to an even lower number than k, then the exponent can be replaced by that. The subset-indexing can be on either each vector or its complement, so long as probing uses the opposite.

There's also a scheme for superset-finding in a trie that only stores each vector only once, but does bit-skipping during probes for what I believe is similar aggregate complexity; ie it has o(k) insertion but o(2k) searches.


thanks. The complexity of your solution is n2(1p)k, where p is the probability of 1's in the bitvector. A couple of implementation details: though this is a slight improvement, there's no need to compute and store the complements in the trie. Just following the complementary branches when checking for a non-overlapping match is enough. And, taking the 0's directly as wildcards, no special wildcard is needed, either.
Mauro Lacy

2

Represent the bit vectors as an n×k matrix M. Take i and j between 1 and n.

(MMT)ij=lMilMjl.

(MMT)ij, the dot product of the ith and jth vector, is non-zero if, and only if, vectors i and j share a common 1. So, to find a solution, compute MMT and return the position of a zero entry, if such an entry exists.

Complexity

Using naive multiplication, this requires O(n2k) arithmetic operations. If n=k, it takes O(n2.37) operations using the utterly impractical Coppersmith-Winograd algorithm, or O(n2.8) using the Strassen algorithm. If k=O(n0.302), then the problem may be solved using n2+o(1) operations.


How is this different from Strilanc's answer?
D.W.

1
@D.W. Using an n-by-k matrix instead of an (n+k)-by-(n+k) matrix is an improvement. Also it mentions a way to cut off the factor of k when k << n, so that might be useful.
Craig Gidney
당사 사이트를 사용함과 동시에 당사의 쿠키 정책개인정보 보호정책을 읽고 이해하였음을 인정하는 것으로 간주합니다.
Licensed under cc by-sa 3.0 with attribution required.