칼럼니스트 존 벤틀리 (Jon Bentley)가 도널드 크 누스 (Donald Knuth)에게 파일에서 k 개의 가장 빈번한 단어를 찾을 수있는 프로그램을 작성해달라고 요청했을 때 이것은 아마도 1986 년에 어떤 공명을 불러 일으켰던 고전적인 코딩 과제 중 하나 일 것입니다. Knuth 는 8 페이지짜리 프로그램에서 해시를 사용하여 문맹 프로그래밍 기술을 설명 하는 빠른 솔루션 을 구현했습니다 . Bell Labs의 Douglas McIlroy는 Knuth의 솔루션이 성경의 전체 텍스트를 처리 할 수 없다고 비판하고 한 줄짜리로 답했습니다.

tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed 10q

1987 년 에 프린스턴 교수가 또 다른 해결책 으로 후속 기사 를 출판했습니다. 그러나 그것은 또한 하나의 성경에 대한 결과를 반환 할 수도 없었습니다!

문제 설명

원래 문제 설명 :

텍스트 파일과 정수 k가 주어지면, 파일에서 가장 일반적인 k 개의 단어 (및 그 수)를 줄이면서 인쇄해야합니다.

추가 문제 설명 :

크 누스는 단어를 라틴 문자의 문자열로 정의했습니다. [A-Za-z]+
다른 모든 문자는 무시됩니다
대문자와 소문자는 동등한 것으로 간주됩니다 ( WoRd== word)
파일 크기 나 단어 길이 제한 없음
연속 단어 사이의 거리는 임의로 클 수 있습니다
가장 빠른 프로그램은 총 CPU 시간을 최소로 사용하는 프로그램입니다 (멀티 스레딩은 도움이되지 않습니다)

샘플 테스트 사례

테스트 1 : James Joyce의 Ulysses 가 64 번 연결되었습니다 (96MB 파일).

Project Gutenberg에서 Ulysses 를 다운로드하십시오 .wget http://www.gutenberg.org/files/4300/4300-0.txt
그것을 64 번 연결하십시오 : for i in {1..64}; do cat 4300-0.txt >> ulysses64; done
가장 빈번한 단어는 968832 모양의 "the"입니다.

테스트 2 : 특수하게 생성 된 임의의 텍스트 giganovel(약 1GB)

Python 3 생성기 스크립트는 여기에 있습니다 .
본문에는 자연어와 유사하게 나타나는 148391 개의 독특한 단어가 포함되어 있습니다.
가장 빈번한 단어 :“e”(11309 모양) 및“ihit”(11290 모양).

일반성 테스트 : 임의로 큰 간격이있는 임의로 큰 단어.

참조 구현

이 문제에 대해 Rosetta Code 를 살펴보고 많은 구현이 엄청나게 느리다는 것을 깨닫고 (쉘 스크립트보다 느림) 여기 에서 몇 가지 좋은 구현을 테스트했습니다 . 아래는 ulysses64시간 복잡성과 함께 성능을 나타냅니다 .

                                     ulysses64      Time complexity
C++ (prefix trie + heap)             4.145          O((N + k) log k)
Python (Counter)                     10.547         O(N + k log Q)
AWK + sort                           20.606         O(N + Q log Q)
McIlroy (tr + sort + uniq)           43.554         O(N log N)

당신은 그것을 이길 수 있습니까?

테스팅

성능은 표준 Unix time명령 ( "사용자"시간) 과 함께 2017 13 "MacBook Pro를 사용하여 평가됩니다 . 가능하면 최신 컴파일러를 사용하십시오 (예 : 기존 버전이 아닌 최신 Haskell 버전 사용).

지금까지의 순위

참조 프로그램을 포함한 타이밍 :

                                              k=10                  k=100K
                                     ulysses64      giganovel      giganovel
C++ (trie) by ShreevatsaR            0.671          4.227          4.276
C (trie + bins) by Moogie            0.704          9.568          9.459
C (trie + list) by Moogie            0.767          6.051          82.306
C++ (hash trie) by ShreevatsaR       0.788          5.283          5.390
C (trie + sorted list) by Moogie     0.804          7.076          x
Rust (trie) by Anders Kaseorg        0.842          6.932          7.503
J by miles                           1.273          22.365         22.637
C# (trie) by recursive               3.722          25.378         24.771
C++ (trie + heap)                    4.145          42.631         72.138
APL (Dyalog Unicode) by Adám         7.680          x              x
Python (dict) by movatica            9.387          99.118         100.859
Python (Counter)                     10.547         102.822        103.930
Ruby (tally) by daniero              15.139         171.095        171.551
AWK + sort                           20.606         213.366        222.782
McIlroy (tr + sort + uniq)           43.554         715.602        750.420

누적 순위 * (%, 최고 점수 – 300) :

#     Program                         Score  Generality
 1  C++ (trie) by ShreevatsaR           300     Yes
 2  C++ (hash trie) by ShreevatsaR      368      x
 3  Rust (trie) by Anders Kaseorg       465     Yes
 4  C (trie + bins) by Moogie           552      x
 5  J by miles                         1248     Yes
 6  C# (trie) by recursive             1734      x
 7  C (trie + list) by Moogie          2182      x
 8  C++ (trie + heap)                  3313      x
 9  Python (dict) by movatica          6103     Yes
10  Python (Counter)                   6435     Yes
11  Ruby (tally) by daniero           10316     Yes
12  AWK + sort                        13329     Yes
13  McIlroy (tr + sort + uniq)        40970     Yes

* 3 가지 테스트 각각에서 최고의 프로그램에 비해 시간 성능이 우수합니다.

지금까지 최고의 프로그램 : here (두 번째 솔루션)

fastest-code

— 안드리 마 쿠카
소스

점수는 율리시스의 시간입니까? 그것은 묵시적 것 같다 있지만 명시 적으로 말했다하지 않습니다

— 포스트 록 Garf 헌터

@ SriotchilismO'Zaic, 지금은 그렇습니다. 그러나 더 큰 테스트 사례가 뒤따를 수 있으므로 첫 번째 테스트 사례에 의존해서는 안됩니다. ulysses64는 반복적이라는 명백한 단점이 있습니다. 파일의 1/64 이후에는 새로운 단어가 나타나지 않습니다. 따라서 매우 좋은 테스트 사례는 아니지만 배포 (또는 재생산)가 쉽습니다.

— Andriy Makukha

3

이전에 얘기했던 숨겨진 테스트 사례를 의미했습니다. 실제 텍스트를 공개 할 때 해시를 게시하면 답변에 공평하고 왕이 아닌 것을 확인할 수 있습니다. Ulysses의 해시가 다소 유용하다고 생각하지만.

— Post Rock Garf Hunter

1

@tsh 그것은 나의 이해입니다 : 예를 들어 e와 g라는 두 단어

— Moogie

1

@AndriyMakukha 아, 감사합니다. 그것들은 단지 버그 일뿐입니다. 나는 그들을 고쳤다.

— Anders Kaseorg

5

[씨]

다음은 2.8Ghz Xeon W3530에서 테스트 1 에 대해 1.6 초 이내에 실행됩니다 . Windows 7에서 MinGW.org GCC-6.3.0-1을 사용하여 빌드 :

입력으로 두 개의 인수가 필요합니다 (텍스트 파일 경로 및 k 개의 가장 빈번한 단어 나열)

단순히 단어의 글자로 분기 된 트리를 만든 다음 잎 글자에서 카운터를 증가시킵니다. 그런 다음 현재 리프 카운터가 가장 빈번한 단어 목록에서 가장 작은 단어보다 큰지 확인합니다. (목록 크기는 명령 행 인수를 통해 결정된 숫자입니다.) 그렇다면 리프 문자로 표시되는 단어를 가장 자주 사용하도록 승격하십시오. 이것은 더 이상 문자를 읽을 때까지 반복됩니다. 그 후 가장 빈번한 단어 목록은 가장 빈번한 단어 목록에서 가장 빈번한 단어에 대한 비효율적 인 반복 검색을 통해 출력됩니다.

현재 처리 시간을 출력하도록 기본 설정되어 있지만 다른 제출과의 일관성을 위해 소스 코드에서 TIMING 정의를 비활성화하십시오.

또한 업무용 컴퓨터에서 제출했으며 테스트 2 텍스트를 다운로드 할 수 없습니다. 수정없이이 테스트 2에서 작동해야하지만 MAX_LETTER_INSTANCES 값을 늘려야 할 수도 있습니다.

// comment out TIMING if using external program timing mechanism
#define TIMING 1

// may need to increase if the source text has many unique words
#define MAX_LETTER_INSTANCES 1000000

// increase this if needing to output more top frequent words
#define MAX_TOP_FREQUENT_WORDS 1000

#define false 0
#define true 1
#define null 0

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#ifdef TIMING
#include <sys/time.h>
#endif

struct Letter
{
    char mostFrequentWord;
    struct Letter* parent;
    char asciiCode;
    unsigned int count;
    struct Letter* nextLetters[26];
};
typedef struct Letter Letter;

int main(int argc, char *argv[]) 
{
#ifdef TIMING
    struct timeval tv1, tv2;
    gettimeofday(&tv1, null);
#endif

    int k;
    if (argc !=3 || (k = atoi(argv[2])) <= 0 || k> MAX_TOP_FREQUENT_WORDS)
    {
        printf("Usage:\n");
        printf("      WordCount <input file path> <number of most frequent words to find>\n");
        printf("NOTE: upto %d most frequent words can be requested\n\n",MAX_TOP_FREQUENT_WORDS);
        return -1;
    }

    long  file_size;
    long dataLength;
    char* data;

    // read in file contents
    FILE *fptr;
    size_t read_s = 0;  
    fptr = fopen(argv[1], "rb");
    fseek(fptr, 0L, SEEK_END);
    dataLength = ftell(fptr);
    rewind(fptr);
    data = (char*)malloc((dataLength));
    read_s = fread(data, 1, dataLength, fptr);
    if (fptr) fclose(fptr);

    unsigned int chr;
    unsigned int i;

    // working memory of letters
    Letter* letters = (Letter*) malloc(sizeof(Letter) * MAX_LETTER_INSTANCES);
    memset(&letters[0], 0, sizeof( Letter) * MAX_LETTER_INSTANCES);

    // the index of the next unused letter
    unsigned int letterMasterIndex=0;

    // pesudo letter representing the starting point of any word
    Letter* root = &letters[letterMasterIndex++];

    // the current letter in the word being processed
    Letter* currentLetter = root;
    root->mostFrequentWord = false;
    root->count = 0;

    // the next letter to be processed
    Letter* nextLetter = null;

    // store of the top most frequent words
    Letter* topWords[MAX_TOP_FREQUENT_WORDS];

    // initialise the top most frequent words
    for (i = 0; i<k; i++)
    {
        topWords[i]=root;
    }

    unsigned int lowestWordCount = 0;
    unsigned int lowestWordIndex = 0;
    unsigned int highestWordCount = 0;
    unsigned int highestWordIndex = 0;

    // main loop
    for (int j=0;j<dataLength;j++)
    {
        chr = data[j]|0x20; // convert to lower case

        // is a letter?
        if (chr > 96 && chr < 123)
        {
            chr-=97; // translate to be zero indexed
            nextLetter = currentLetter->nextLetters[chr];

            // this is a new letter at this word length, intialise the new letter
            if (nextLetter == null)
            {
                nextLetter = &letters[letterMasterIndex++];
                nextLetter->parent = currentLetter;
                nextLetter->asciiCode = chr;
                currentLetter->nextLetters[chr] = nextLetter;
            }

            currentLetter = nextLetter;
        }
        // not a letter so this means the current letter is the last letter of a word (if any letters)
        else if (currentLetter!=root)
        {

            // increment the count of the full word that this letter represents
            ++currentLetter->count;

            // ignore this word if already identified as a most frequent word
            if (!currentLetter->mostFrequentWord)
            {
                // update the list of most frequent words
                // by replacing the most infrequent top word if this word is more frequent
                if (currentLetter->count> lowestWordCount)
                {
                    currentLetter->mostFrequentWord = true;
                    topWords[lowestWordIndex]->mostFrequentWord = false;
                    topWords[lowestWordIndex] = currentLetter;
                    lowestWordCount = currentLetter->count;

                    // update the index and count of the next most infrequent top word
                    for (i=0;i<k; i++)
                    {
                        // if the topword  is root then it can immediately be replaced by this current word, otherwise test
                        // whether the top word is less than the lowest word count
                        if (topWords[i]==root || topWords[i]->count<lowestWordCount)
                        {
                            lowestWordCount = topWords[i]->count;
                            lowestWordIndex = i;
                        }
                    }
                }
            }

            // reset the letter path representing the word
            currentLetter = root;
        }
    }

    // print out the top frequent words and counts
    char string[256];
    char tmp[256];

    while (k > 0 )
    {
        highestWordCount = 0;
        string[0]=0;
        tmp[0]=0;

        // find next most frequent word
        for (i=0;i<k; i++)
        {
            if (topWords[i]->count>highestWordCount)
            {
                highestWordCount = topWords[i]->count;
                highestWordIndex = i;
            }
        }

        Letter* letter = topWords[highestWordIndex];

        // swap the end top word with the found word and decrement the number of top words
        topWords[highestWordIndex] = topWords[--k];

        if (highestWordCount > 0)
        {
            // construct string of letters to form the word
            while (letter != root)
            {
                memmove(&tmp[1],&string[0],255);
                tmp[0]=letter->asciiCode+97;
                memmove(&string[0],&tmp[0],255);
                letter=letter->parent;
            }

            printf("%u %s\n",highestWordCount,string);
        }
    }

    free( data );
    free( letters );

#ifdef TIMING   
    gettimeofday(&tv2, null);
    printf("\nTime Taken: %f seconds\n", (double) (tv2.tv_usec - tv1.tv_usec)/1000000 + (double) (tv2.tv_sec - tv1.tv_sec));
#endif
    return 0;
}

테스트 1과 상위 10 개의 빈번한 단어 및 타이밍이 활성화 된 경우 다음을 인쇄해야합니다.

 968832 the
 528960 of
 466432 and
 421184 a
 322624 to
 320512 in
 270528 he
 213120 his
 191808 i
 182144 s

 Time Taken: 1.549155 seconds

— 무기
소스

감동적인! list를 사용하면 최악의 경우 O (Nk)가되므로 k = 100,000의 giganovel에 대한 참조 C ++ 프로그램보다 느리게 실행됩니다. 그러나 k << N의 경우 확실한 승자입니다.

— Andriy Makukha

1

@AndriyMakukha 감사합니다! 나는 그런 간단한 구현이 큰 속도를 냈다는 것에 약간 놀랐다. 목록을 정렬하여 더 큰 k 값을 더 잘 만들 수 있습니다. (목록 순서가 느리게 변경되므로 정렬이 너무 비싸지 않아야 함) 복잡성이 추가되고 k 값이 낮을수록 속도에 영향을 줄 수 있습니다. 실험해야 할 것

— Moogie

그래, 나도 놀랐어. 참조 프로그램이 많은 함수 호출을 사용하고 컴파일러가이를 올바르게 최적화하지 못하기 때문일 수 있습니다.

— Andriy Makukha

또 다른 성능 이점은 letters배열 의 반 정적 할당에서 비롯된 반면 참조 구현에서는 트리 노드를 동적으로 할당합니다.

— Andriy Makukha

mmap-ing는해야 더 빨리 (~ 내 리눅스 노트북에 5 %) : #include<sys/mman.h>, <sys/stat.h>, <fcntl.h>,로 파일 읽기 대체 int d=open(argv[1],0);struct stat s;fstat(d,&s);dataLength=s.st_size;data=mmap(0,dataLength,1,1,d,0);및 주석으로free(data);

— NGN

4

녹

내 컴퓨터에서 이것은 Moogie의 C "접두사 트리 + 쓰레기통"C 솔루션보다 약 42 % 더 빠른 giganovel 100000을 실행합니다 (10.64 초 대 18.24 초). 또한 단어 길이, 고유 단어, 반복 단어 등에 대한 제한을 미리 정의하는 C 솔루션과 달리 미리 정의 된 제한이 없습니다.

`src/main.rs`

use memmap::MmapOptions;
use pdqselect::select_by_key;
use std::cmp::Reverse;
use std::default::Default;
use std::env::args;
use std::fs::File;
use std::io::{self, Write};
use typed_arena::Arena;

#[derive(Default)]
struct Trie<'a> {
    nodes: [Option<&'a mut Trie<'a>>; 26],
    count: u64,
}

fn main() -> io::Result<()> {
    // Parse arguments
    let mut args = args();
    args.next().unwrap();
    let filename = args.next().unwrap();
    let size = args.next().unwrap().parse().unwrap();

    // Open input
    let file = File::open(filename)?;
    let mmap = unsafe { MmapOptions::new().map(&file)? };

    // Build trie
    let arena = Arena::new();
    let mut num_words = 0;
    let mut root = Trie::default();
    {
        let mut node = &mut root;
        for byte in &mmap[..] {
            let letter = (byte | 32).wrapping_sub(b'a');
            if let Some(child) = node.nodes.get_mut(letter as usize) {
                node = child.get_or_insert_with(|| {
                    num_words += 1;
                    arena.alloc(Default::default())
                });
            } else {
                node.count += 1;
                node = &mut root;
            }
        }
        node.count += 1;
    }

    // Extract all counts
    let mut index = 0;
    let mut counts = Vec::with_capacity(num_words);
    let mut stack = vec![root.nodes.iter()];
    'a: while let Some(frame) = stack.last_mut() {
        while let Some(child) = frame.next() {
            if let Some(child) = child {
                if child.count != 0 {
                    counts.push((child.count, index));
                    index += 1;
                }
                stack.push(child.nodes.iter());
                continue 'a;
            }
        }
        stack.pop();
    }

    // Find frequent counts
    select_by_key(&mut counts, size, |&(count, _)| Reverse(count));
    // Or, in nightly Rust:
    //counts.partition_at_index_by_key(size, |&(count, _)| Reverse(count));

    // Extract frequent words
    let size = size.min(counts.len());
    counts[0..size].sort_by_key(|&(_, index)| index);
    let mut out = Vec::with_capacity(size);
    let mut it = counts[0..size].iter();
    if let Some(mut next) = it.next() {
        index = 0;
        stack.push(root.nodes.iter());
        let mut word = vec![b'a' - 1];
        'b: while let Some(frame) = stack.last_mut() {
            while let Some(child) = frame.next() {
                *word.last_mut().unwrap() += 1;
                if let Some(child) = child {
                    if child.count != 0 {
                        if index == next.1 {
                            out.push((word.to_vec(), next.0));
                            if let Some(next1) = it.next() {
                                next = next1;
                            } else {
                                break 'b;
                            }
                        }
                        index += 1;
                    }
                    stack.push(child.nodes.iter());
                    word.push(b'a' - 1);
                    continue 'b;
                }
            }
            stack.pop();
            word.pop();
        }
    }
    out.sort_by_key(|&(_, count)| Reverse(count));

    // Print results
    let stdout = io::stdout();
    let mut stdout = io::BufWriter::new(stdout.lock());
    for (word, count) in out {
        stdout.write_all(&word)?;
        writeln!(stdout, " {}", count)?;
    }

    Ok(())
}

`Cargo.toml`

[package]
name = "frequent"
version = "0.1.0"
authors = ["Anders Kaseorg <andersk@mit.edu>"]
edition = "2018"

[dependencies]
memmap = "0.7.0"
typed-arena = "1.4.1"
pdqselect = "0.1.0"

[profile.release]
lto = true
opt-level = 3

용법

cargo build --release
time target/release/frequent ulysses64 10

— 앤더스 카 소르 그
소스

1

훌륭한! 세 가지 설정 모두에서 매우 우수한 성능. 나는 문자 그대로 Carol Nichols가 Rust에 대해 최근에 이야기 한 것을 보았습니다. :) 다소 특이한 구문이지만 언어를 배우게되어 기쁩니다. 개발자의 삶을 훨씬 쉽게 만드는 동시에 많은 성능을 희생하십시오.

— Andriy Makukha

아주 빨리! 내가 감동! C (tree + bin)에 대한 더 나은 컴파일러 옵션이 비슷한 결과를 줄지 궁금합니다.

— Moogie

@Moogie 나는 이미와 당신을 테스트했다 -O3, 그리고 -Ofast측정 가능한 차이를하지 않습니다.

— Anders Kaseorg

@Moogie, 나는 같은 코드를 컴파일하고있었습니다 gcc -O3 -march=native -mtune=native program.c.

— Andriy Makukha

@ 안드리 Makukha 아. 그것은 당신이 얻는 결과와 내 결과 사이의 속도의 큰 차이를 설명합니다 : 이미 최적화 플래그를 적용하고 있습니다. 큰 코드 최적화가 많이 남아 있다고 생각하지 않습니다. mingw는 구현이 없기 때문에 다른 사람들이 제안한대로 map을 사용하여 테스트 할 수 없습니다 ... 그리고 5 % 만 증가시킵니다. 앤더스의 멋진 입장에 항복해야 할 것 같습니다. 잘 했어!

— Moogie

3

APL (Dyalog Unicode)

다음은 Windows 10에서 64 비트 Dyalog APL 17.0을 사용하는 2.6Ghz i7-4720HQ에서 8 초 안에 실행됩니다.

⎕{m[⍺↑⍒⊢/m←{(⊂⎕UCS⊃⍺),≢⍵}⌸(⊢⊆⍨96∘<∧<∘123)83⎕DR 819⌶80 ¯1⎕MAP⍵;]}⍞

먼저 파일 이름을 묻는 메시지를 표시 한 다음 k를 입력하라는 메시지를 표시합니다. 실행 시간의 상당 부분 (약 1 초)은 파일을 읽는 것입니다.

시간을 내려면 다음을 dyalog실행 파일 로 파이프 할 수 있어야합니다 (가장 빈번한 단어 10 개).

⎕{m[⍺↑⍒⊢/m←{(⊂⎕UCS⊃⍺),≢⍵}⌸(⊢⊆⍨96∘<∧<∘123)83⎕DR 819⌶80 ¯1⎕MAP⍵;]}⍞
/tmp/ulysses64
10
⎕OFF

인쇄해야합니다.

 the  968832
 of   528960
 and  466432
 a    421184
 to   322624
 in   320512
 he   270528
 his  213120
 i    191808
 s    182144

— 아담
소스

아주 좋아요! 그것은 파이썬을 이깁니다. 후에 가장 잘 작동했습니다 export MAXWS=4096M. 해시 테이블을 사용합니까? 작업 공간 크기를 2GB로 줄이면 전체 2 초가 느려집니다.

— Andriy Makukha

@AndriyMakukha 예 ∊에 따라 해시 테이블을 사용하는 이 , 나는 확신 ⌸도 내부적으로한다.

— Adám

왜 O (N log N)입니까? Python (모든 고유 단어의 힙을 복원하는 k 배) 또는 AWK (고유 단어 만 정렬) 솔루션처럼 보입니다. McIlroy의 쉘 스크립트에서와 같이 모든 단어를 정렬하지 않으면 O (N log N)가되어서는 안됩니다.

— Andriy Makukha

@AndriyMakukha 그것은 모든 카운트를 등급 . 성능 담당자 가 저에게 쓴 내용은 다음과 같습니다 . 해시 테이블에 대해 이론적으로 모호한 사항을 믿지 않는 한 시간 복잡도는 O (N log N)입니다.이 경우 O (N)입니다.

— Adám

글쎄, 8, 16 및 32 Ulysses에 대해 코드를 실행하면 정확히 선형으로 느려집니다. 아마도 성능 담당자가 해시 테이블의 시간 복잡성에 대한 견해를 재고해야 할 수도 있습니다.) 또한이 코드는 더 큰 테스트 사례에서는 작동하지 않습니다. WS FULL작업 공간을 6GB로 늘려도을 반환합니다 .

— Andriy Makukha

2

[C] 접두사 트리 + 저장소

참고 : 사용 된 컴파일러는 프로그램 실행 속도에 큰 영향을줍니다! gcc (MinGW.org GCC-8.2.0-3) 8.2.0을 사용했습니다. 사용하는 경우 -Ofast 스위치를, 프로그램은 거의 50 % 더 빨리 일반적으로 컴파일 된 프로그램에 비해 실행됩니다.

알고리즘 복잡성

나는 내가 수행중인 Bin 정렬 이 Pigeonhost 정렬 의 한 형태 라는 것을 깨달았습니다. 이것은이 솔루션의 Big O 복잡성을 유도 할 수 있음을 의미합니다.

나는 그것을 다음과 같이 계산합니다.

Worst Time complexity: O(1 + N + k)
Worst Space complexity: O(26*M + N + n) = O(M + N + n)

Where N is the number of words of the data
and M is the number of letters of the data
and n is the range of pigeon holes
and k is the desired number of sorted words to return
and N<=M

트리 구조의 복잡성은 트리 순회와 동일하므로 어떤 레벨에서 순회해야하는 올바른 노드가 O (1)이기 때문에 (각 문자가 노드에 직접 매핑되고 각 문자에 대해 항상 한 레벨의 트리 만 순회하므로)

비둘기 구멍 정렬은 O (N + n)입니다. 여기서 n은 키 값의 범위입니다. 그러나이 문제의 경우 모든 값을 정렬 할 필요는 없습니다. k 숫자 만 있으므로 최악의 경우 O (N + k)가됩니다.

함께 결합하면 O (1 + N + k)가 생성됩니다.

트리 구성을위한 공간 복잡도는 데이터가 M 개의 문자로 구성된 하나의 단어로 구성되고 각 노드에 26 개의 노드가있는 경우 (즉, 알파벳 문자의 경우) 최악의 경우 26 * M 노드이기 때문입니다. 따라서 O (26 * M) = O (M)

비둘기 구멍 정렬의 경우 공간 복잡도가 O (N + n)입니다.

함께 결합하면 O (26 * M + N + n) = O (M + N + n)

연산

입력으로 두 개의 인수가 필요합니다 (텍스트 파일 경로 및 k 개의 가장 빈번한 단어 나열)

내 다른 항목을 기반 으로이 버전은 다른 솔루션에 비해 k 값이 증가하여 시간 비용이 매우 적습니다. 그러나 k 값이 낮을수록 눈에 띄게 느리지 만 k 값이 클수록 훨씬 빠릅니다.

단어 글자로 분기 된 트리를 만든 다음 리프 글자에서 카운터를 증가시킵니다. 그런 다음 단어를 같은 크기의 단어 묶음에 추가합니다 (먼저 단어가 이미 있던 bin에서 제거한 후). 이것은 더 이상 문자를 읽을 때까지 반복됩니다. 그 후 빈은 가장 큰 빈부터 시작하여 k 번 역 반복되고 각 빈의 단어가 출력됩니다.

현재 처리 시간을 출력하도록 기본 설정되어 있지만 다른 제출과의 일관성을 위해 소스 코드에서 TIMING 정의를 비활성화하십시오.

// comment out TIMING if using external program timing mechanism
#define TIMING 1

// may need to increase if the source text has many unique words
#define MAX_LETTER_INSTANCES 1000000

// may need to increase if the source text has many repeated words
#define MAX_BINS 1000000

// assume maximum of 20 letters in a word... adjust accordingly
#define MAX_LETTERS_IN_A_WORD 20

// assume maximum of 10 letters for the string representation of the bin number... adjust accordingly
#define MAX_LETTERS_FOR_BIN_NAME 10

// maximum number of bytes of the output results
#define MAX_OUTPUT_SIZE 10000000

#define false 0
#define true 1
#define null 0
#define SPACE_ASCII_CODE 32

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#ifdef TIMING
#include <sys/time.h>
#endif

struct Letter
{
    //char isAWord;
    struct Letter* parent;
    struct Letter* binElementNext;
    char asciiCode;
    unsigned int count;
    struct Letter* nextLetters[26];
};
typedef struct Letter Letter;

struct Bin
{
  struct Letter* word;
};
typedef struct Bin Bin;


int main(int argc, char *argv[]) 
{
#ifdef TIMING
    struct timeval tv1, tv2;
    gettimeofday(&tv1, null);
#endif

    int k;
    if (argc !=3 || (k = atoi(argv[2])) <= 0)
    {
        printf("Usage:\n");
        printf("      WordCount <input file path> <number of most frequent words to find>\n\n");
        return -1;
    }

    long  file_size;
    long dataLength;
    char* data;

    // read in file contents
    FILE *fptr;
    size_t read_s = 0;  
    fptr = fopen(argv[1], "rb");
    fseek(fptr, 0L, SEEK_END);
    dataLength = ftell(fptr);
    rewind(fptr);
    data = (char*)malloc((dataLength));
    read_s = fread(data, 1, dataLength, fptr);
    if (fptr) fclose(fptr);

    unsigned int chr;
    unsigned int i, j;

    // working memory of letters
    Letter* letters = (Letter*) malloc(sizeof(Letter) * MAX_LETTER_INSTANCES);
    memset(&letters[0], null, sizeof( Letter) * MAX_LETTER_INSTANCES);

    // the memory for bins
    Bin* bins = (Bin*) malloc(sizeof(Bin) * MAX_BINS);
    memset(&bins[0], null, sizeof( Bin) * MAX_BINS);

    // the index of the next unused letter
    unsigned int letterMasterIndex=0;
    Letter *nextFreeLetter = &letters[0];

    // pesudo letter representing the starting point of any word
    Letter* root = &letters[letterMasterIndex++];

    // the current letter in the word being processed
    Letter* currentLetter = root;

    // the next letter to be processed
    Letter* nextLetter = null;

    unsigned int sortedListSize = 0;

    // the count of the most frequent word
    unsigned int maxCount = 0;

    // the count of the current word
    unsigned int wordCount = 0;

////////////////////////////////////////////////////////////////////////////////////////////
// CREATING PREFIX TREE
    j=dataLength;
    while (--j>0)
    {
        chr = data[j]|0x20; // convert to lower case

        // is a letter?
        if (chr > 96 && chr < 123)
        {
            chr-=97; // translate to be zero indexed
            nextLetter = currentLetter->nextLetters[chr];

            // this is a new letter at this word length, intialise the new letter
            if (nextLetter == null)
            {
                ++letterMasterIndex;
                nextLetter = ++nextFreeLetter;
                nextLetter->parent = currentLetter;
                nextLetter->asciiCode = chr;
                currentLetter->nextLetters[chr] = nextLetter;
            }

            currentLetter = nextLetter;
        }
        else
        {
            //currentLetter->isAWord = true;

            // increment the count of the full word that this letter represents
            ++currentLetter->count;

            // reset the letter path representing the word
            currentLetter = root;
        }
    }

////////////////////////////////////////////////////////////////////////////////////////////
// ADDING TO BINS

    j = letterMasterIndex;
    currentLetter=&letters[j-1];
    while (--j>0)
    {

      // is the letter the leaf letter of word?
      if (currentLetter->count>0)
      {
        i = currentLetter->count;
        if (maxCount < i) maxCount = i;

        // add to bin
        currentLetter->binElementNext = bins[i].word;
        bins[i].word = currentLetter;
      }
      --currentLetter;
    }

////////////////////////////////////////////////////////////////////////////////////////////
// PRINTING OUTPUT

    // the memory for output
    char* output = (char*) malloc(sizeof(char) * MAX_OUTPUT_SIZE);
    memset(&output[0], SPACE_ASCII_CODE, sizeof( char) * MAX_OUTPUT_SIZE);
    unsigned int outputIndex = 0;

    // string representation of the current bin number
    char binName[MAX_LETTERS_FOR_BIN_NAME];
    memset(&binName[0], SPACE_ASCII_CODE, MAX_LETTERS_FOR_BIN_NAME);


    Letter* letter;
    Letter* binElement;

    // starting at the bin representing the most frequent word(s) and then iterating backwards...
    for ( i=maxCount;i>0 && k>0;i--)
    {
      // check to ensure that the bin has at least one word
      if ((binElement = bins[i].word) != null)
      {
        // update the bin name
        sprintf(binName,"%u",i);

        // iterate of the words in the bin
        while (binElement !=null && k>0)
        {
          // stop if we have reached the desired number of outputed words
          if (k-- > 0)
          {
              letter = binElement;

              // add the bin name to the output
              memcpy(&output[outputIndex],&binName[0],MAX_LETTERS_FOR_BIN_NAME);
              outputIndex+=MAX_LETTERS_FOR_BIN_NAME;

              // construct string of letters to form the word
               while (letter != root)
              {
                // output the letter to the output
                output[outputIndex++] = letter->asciiCode+97;
                letter=letter->parent;
              }

              output[outputIndex++] = '\n';

              // go to the next word in the bin
              binElement = binElement->binElementNext;
          }
        }
      }
    }

    // write the output to std out
    fwrite(output, 1, outputIndex, stdout);
   // fflush(stdout);

   // free( data );
   // free( letters );
   // free( bins );
   // free( output );

#ifdef TIMING   
    gettimeofday(&tv2, null);
    printf("\nTime Taken: %f seconds\n", (double) (tv2.tv_usec - tv1.tv_usec)/1000000 + (double) (tv2.tv_sec - tv1.tv_sec));
#endif
    return 0;
}

편집 : 이제 나무가 구성되고 출력 구성이 최적화 될 때까지 채우기 빈을 지연시킵니다.

EDIT2 : 속도 최적화를 위해 어레이 액세스 대신 포인터 산술을 사용합니다.

— 무기
소스

와! 11 초 만에 1GB 파일에서 10 만 개의 가장 빈번한 단어 ... 일종의 마술처럼 보입니다.

— Andriy Makukha

트릭 없음 ... 효율적인 메모리 사용을 위해 CPU 시간을 교환하십시오. 나는 당신의 결과에 놀랐습니다 ... 내 오래된 PC에서는 60 초 이상 걸립니다. II는 불필요한 비교를 수행하고 파일이 처리 될 때까지 비닝을 연기 할 수 있음을 알았습니다. 더 빨라져야합니다. 곧 시도하고 답변을 업데이트하겠습니다.

— Moogie

@AndriyMakukha 나는 이제 모든 단어가 처리되고 나무가 구성 될 때까지 쓰레기통 채우기를 연기했습니다. 이것은 불필요한 비교와 빈 요소 조작을 피합니다. 또한 인쇄하는 데 상당한 시간이 걸리는 것으로 출력이 구성되는 방식을 변경했습니다!

— Moogie

내 컴퓨터에서이 업데이트는 눈에 띄는 차이를 만들지 않습니다. 그러나 ulysses64한 번에 매우 빠르게 수행 되었으므로 현재의 리더입니다.

— Andriy Makukha

내 PC의 고유 한 문제 여야합니다. :)이 새로운 출력 알고리즘을 사용할 때 5 초 속도가

— 빨라짐

2

제이

9!:37 ] 0 _ _ _

'input k' =: _2 {. ARGV
k =: ". k

lower =: a. {~ 97 + i. 26
words =: ((lower , ' ') {~ lower i. ]) (32&OR)&.(a.&i.) fread input
words =: ' ' , words
words =: -.&(s: a:) s: words
uniq =: ~. words
res =: (k <. # uniq) {. \:~ (# , {.)/.~ uniq&i. words
echo@(,&": ' ' , [: }.@": {&uniq)/"1 res

exit 0

로 스크립트를 실행하십시오 jconsole <script> <input> <k>. 예를 들어 giganovelwith 의 출력은 다음 과 k=100K같습니다.

$ time jconsole solve.ijs giganovel 100000 | head 
11309 e
11290 ihit
11285 ah
11260 ist
11255 aa
11202 aiv
11201 al
11188 an
11187 o
11186 ansa

real    0m13.765s
user    0m11.872s
sys     0m1.786s

사용 가능한 시스템 메모리의 양을 제외하고는 제한이 없습니다.

— 마일
소스

더 작은 테스트 케이스에 매우 빠릅니다! 좋은! 그러나 임의로 큰 단어의 경우 출력에서 단어를 자릅니다. 단어의 문자 수에 제한이 있는지 또는 출력을보다 간결하게 만드는지 확실하지 않습니다.

— Andriy Makukha

@AndriyMakukha 그래, ...줄 당 출력 잘림으로 인해 발생합니다. 시작 부분에 한 줄을 추가하여 모든 잘림을 비활성화했습니다. 고유 한 단어가 많을수록 더 많은 메모리를 사용하므로 giganovel에서 속도가 느려집니다.

— 마일

큰! 이제 일반성 테스트를 통과했습니다. 그리고 내 컴퓨터 속도가 느려지지 않았습니다. 실제로 약간의 속도 향상이있었습니다.

— Andriy Makukha

2

C ++ (라 크 누스)

Knuth의 프로그램이 어떻게 작동하는지 궁금해서 그의 원래의 Pascal 프로그램을 C ++로 번역했습니다.

Knuth의 주요 목표는 속도가 아니라 웹 프로그래밍 시스템의 WEB 시스템을 설명하는 것이었지만,이 프로그램은 놀랍도록 경쟁력이 있으며 지금까지의 답변보다 빠른 솔루션을 제공합니다. 다음은 그의 프로그램을 번역 한 것입니다 (웹 프로그램의 해당 "섹션"번호는 " {§24}" 와 같은 주석에 언급되어 있습니다 ).

#include <iostream>
#include <cassert>

// Adjust these parameters based on input size.
const int TRIE_SIZE = 800 * 1000; // Size of the hash table used for the trie.
const int ALPHA = 494441;  // An integer that's approximately (0.61803 * TRIE_SIZE), and relatively prime to T = TRIE_SIZE - 52.
const int kTolerance = TRIE_SIZE / 100;  // How many places to try, to find a new place for a "family" (=bunch of children).

typedef int32_t Pointer;  // [0..TRIE_SIZE), an index into the array of Nodes
typedef int8_t Char;  // We only care about 1..26 (plus two values), but there's no "int5_t".
typedef int32_t Count;  // The number of times a word has been encountered.
// These are 4 separate arrays in Knuth's implementation.
struct Node {
  Pointer link;  // From a parent node to its children's "header", or from a header back to parent.
  Pointer sibling;  // Previous sibling, cyclically. (From smallest child to header, and header to largest child.)
  Count count;  // The number of times this word has been encountered.
  Char ch;  // EMPTY, or 1..26, or HEADER. (For nodes with ch=EMPTY, the link/sibling/count fields mean nothing.)
} node[TRIE_SIZE + 1];
// Special values for `ch`: EMPTY (free, can insert child there) and HEADER (start of family).
const Char EMPTY = 0, HEADER = 27;

const Pointer T = TRIE_SIZE - 52;
Pointer x;  // The `n`th time we need a node, we'll start trying at x_n = (alpha * n) mod T. This holds current `x_n`.
// A header can only be in T (=TRIE_SIZE-52) positions namely [27..TRIE_SIZE-26].
// This transforms a "h" from range [0..T) to the above range namely [27..T+27).
Pointer rerange(Pointer n) {
  n = (n % T) + 27;
  // assert(27 <= n && n <= TRIE_SIZE - 26);
  return n;
}

// Convert trie node to string, by walking up the trie.
std::string word_for(Pointer p) {
  std::string word;
  while (p != 0) {
    Char c = node[p].ch;  // assert(1 <= c && c <= 26);
    word = static_cast<char>('a' - 1 + c) + word;
    // assert(node[p - c].ch == HEADER);
    p = (p - c) ? node[p - c].link : 0;
  }
  return word;
}

// Increment `x`, and declare `h` (the first position to try) and `last_h` (the last position to try). {§24}
#define PREPARE_X_H_LAST_H x = (x + ALPHA) % T; Pointer h = rerange(x); Pointer last_h = rerange(x + kTolerance);
// Increment `h`, being careful to account for `last_h` and wraparound. {§25}
#define INCR_H { if (h == last_h) { std::cerr << "Hit tolerance limit unfortunately" << std::endl; exit(1); } h = (h == TRIE_SIZE - 26) ? 27 : h + 1; }

// `p` has no children. Create `p`s family of children, with only child `c`. {§27}
Pointer create_child(Pointer p, int8_t c) {
  // Find `h` such that there's room for both header and child c.
  PREPARE_X_H_LAST_H;
  while (!(node[h].ch == EMPTY and node[h + c].ch == EMPTY)) INCR_H;
  // Now create the family, with header at h and child at h + c.
  node[h]     = {.link = p, .sibling = h + c, .count = 0, .ch = HEADER};
  node[h + c] = {.link = 0, .sibling = h,     .count = 0, .ch = c};
  node[p].link = h;
  return h + c;
}

// Move `p`'s family of children to a place where child `c` will also fit. {§29}
void move_family_for(const Pointer p, Char c) {
  // Part 1: Find such a place: need room for `c` and also all existing children. {§31}
  PREPARE_X_H_LAST_H;
  while (true) {
    INCR_H;
    if (node[h + c].ch != EMPTY) continue;
    Pointer r = node[p].link;
    int delta = h - r;  // We'd like to move each child by `delta`
    while (node[r + delta].ch == EMPTY and node[r].sibling != node[p].link) {
      r = node[r].sibling;
    }
    if (node[r + delta].ch == EMPTY) break;  // There's now space for everyone.
  }

  // Part 2: Now actually move the whole family to start at the new `h`.
  Pointer r = node[p].link;
  int delta = h - r;
  do {
    Pointer sibling = node[r].sibling;
    // Move node from current position (r) to new position (r + delta), and free up old position (r).
    node[r + delta] = {.ch = node[r].ch, .count = node[r].count, .link = node[r].link, .sibling = node[r].sibling + delta};
    if (node[r].link != 0) node[node[r].link].link = r + delta;
    node[r].ch = EMPTY;
    r = sibling;
  } while (node[r].ch != EMPTY);
}

// Advance `p` to its `c`th child. If necessary, add the child, or even move `p`'s family. {§21}
Pointer find_child(Pointer p, Char c) {
  // assert(1 <= c && c <= 26);
  if (p == 0) return c;  // Special case for first char.
  if (node[p].link == 0) return create_child(p, c);  // If `p` currently has *no* children.
  Pointer q = node[p].link + c;
  if (node[q].ch == c) return q;  // Easiest case: `p` already has a `c`th child.
  // Make sure we have room to insert a `c`th child for `p`, by moving its family if necessary.
  if (node[q].ch != EMPTY) {
    move_family_for(p, c);
    q = node[p].link + c;
  }
  // Insert child `c` into `p`'s family of children (at `q`), with correct siblings. {§28}
  Pointer h = node[p].link;
  while (node[h].sibling > q) h = node[h].sibling;
  node[q] = {.ch = c, .count = 0, .link = 0, .sibling = node[h].sibling};
  node[h].sibling = q;
  return q;
}

// Largest descendant. {§18}
Pointer last_suffix(Pointer p) {
  while (node[p].link != 0) p = node[node[p].link].sibling;
  return p;
}

// The largest count beyond which we'll put all words in the same (last) bucket.
// We do an insertion sort (potentially slow) in last bucket, so increase this if the program takes a long time to walk trie.
const int MAX_BUCKET = 10000;
Pointer sorted[MAX_BUCKET + 1];  // The head of each list.

// Records the count `n` of `p`, by inserting `p` in the list that starts at `sorted[n]`.
// Overwrites the value of node[p].sibling (uses the field to mean its successor in the `sorted` list).
void record_count(Pointer p) {
  // assert(node[p].ch != HEADER);
  // assert(node[p].ch != EMPTY);
  Count f = node[p].count;
  if (f == 0) return;
  if (f < MAX_BUCKET) {
    // Insert at head of list.
    node[p].sibling = sorted[f];
    sorted[f] = p;
  } else {
    Pointer r = sorted[MAX_BUCKET];
    if (node[p].count >= node[r].count) {
      // Insert at head of list
      node[p].sibling = r;
      sorted[MAX_BUCKET] = p;
    } else {
      // Find right place by count. This step can be SLOW if there are too many words with count >= MAX_BUCKET
      while (node[p].count < node[node[r].sibling].count) r = node[r].sibling;
      node[p].sibling = node[r].sibling;
      node[r].sibling = p;
    }
  }
}

// Walk the trie, going over all words in reverse-alphabetical order. {§37}
// Calls "record_count" for each word found.
void walk_trie() {
  // assert(node[0].ch == HEADER);
  Pointer p = node[0].sibling;
  while (p != 0) {
    Pointer q = node[p].sibling;  // Saving this, as `record_count(p)` will overwrite it.
    record_count(p);
    // Move down to last descendant of `q` if any, else up to parent of `q`.
    p = (node[q].ch == HEADER) ? node[q].link : last_suffix(q);
  }
}

int main(int, char** argv) {
  // Program startup
  std::ios::sync_with_stdio(false);

  // Set initial values {§19}
  for (Char i = 1; i <= 26; ++i) node[i] = {.ch = i, .count = 0, .link = 0, .sibling = i - 1};
  node[0] = {.ch = HEADER, .count = 0, .link = 0, .sibling = 26};

  // read in file contents
  FILE *fptr = fopen(argv[1], "rb");
  fseek(fptr, 0L, SEEK_END);
  long dataLength = ftell(fptr);
  rewind(fptr);
  char* data = (char*)malloc(dataLength);
  fread(data, 1, dataLength, fptr);
  if (fptr) fclose(fptr);

  // Loop over file contents: the bulk of the time is spent here.
  Pointer p = 0;
  for (int i = 0; i < dataLength; ++i) {
    Char c = (data[i] | 32) - 'a' + 1;  // 1 to 26, for 'a' to 'z' or 'A' to 'Z'
    if (1 <= c && c <= 26) {
      p = find_child(p, c);
    } else {
      ++node[p].count;
      p = 0;
    }
  }
  node[0].count = 0;

  walk_trie();

  const int max_words_to_print = atoi(argv[2]);
  int num_printed = 0;
  for (Count f = MAX_BUCKET; f >= 0 && num_printed <= max_words_to_print; --f) {
    for (Pointer p = sorted[f]; p != 0 && num_printed < max_words_to_print; p = node[p].sibling) {
      std::cout << word_for(p) << " " << node[p].count << std::endl;
      ++num_printed;
    }
  }

  return 0;
}

크 누스 프로그램과의 차이점 :

내가 크 누스의 4 개 배열을 결합 link, sibling, count및 ch(A)의 배열로struct Node (쉽게이 방법을 이해 찾기).
나는 문단의 프로그래밍 언어 (WEB- 스타일) 텍스트 번역을보다 일반적인 함수 호출 (및 두 개의 매크로)로 변경했다.
우리가 사용하므로, 표준 파스칼의 이상한 I / O 규칙 / 제한을 사용할 필요가 없습니다 fread및data[i] | 32 - 'a' 대신 파스칼 해결 방법으로, 여기에 다른 답변에서와 같이.
프로그램이 실행되는 동안 한계가 초과되면 (공간 부족) Knuth의 원래 프로그램은 이후 단어를 삭제하고 마지막에 메시지를 인쇄하여 우아하게 처리합니다. McIlroy가 "성경의 전체 텍스트를 처리 할 수 없다고 Knuth의 해결책을 비판했다"고 말하는 것은 옳지 않은 일입니다. "성경에서 오류 상태는 무해하지 않습니다.) 나는 단순히 프로그램을 종료하는 더 시끄러운 접근 방식을 취했습니다.
이 프로그램은 상수 TRIE_SIZE를 선언하여 메모리 사용량을 제어합니다. (32767의 상수는 원래 요구 사항에 따라 선택되었습니다. "사용자는 20 페이지 기술 논문 (대략 50K 바이트 파일)에서 가장 자주 100 개의 단어를 찾을 수 있어야합니다.") 테스트 입력이 현재 2 천만 배 더 커짐에 따라 25 배에서 800,000 개로 늘려야했습니다.)
줄의 마지막 인쇄를 위해, 우리는 단지 trie를 걷고 멍청한 (2 차적인) 문자열 추가를 할 수 있습니다.

그 외에도, 이것은 해시 트라이 / 팩 트리 데이터 구조 및 버킷 정렬을 사용하는 Knuth의 프로그램과 거의 동일하며 입력의 모든 문자를 반복하면서 Knuth의 파스칼 프로그램과 거의 동일한 작업을 수행합니다. 외부 알고리즘이나 데이터 구조 라이브러리를 사용하지 않으며 동일한 빈도의 단어가 알파벳 순서로 인쇄됩니다.

타이밍

로 컴파일

clang++ -std=c++17 -O2 ptrie-walktrie.cc

여기에서 가장 큰 테스트 사례 ( giganovel10 만 단어 요청)에서 실행하고 지금까지 게시 된 가장 빠른 프로그램과 비교할 때 약간 있지만 지속적으로 빠릅니다.

target/release/frequent:   4.809 ±   0.263 [ 4.45.. 5.62]        [... 4.63 ...  4.75 ...  4.88...]
ptrie-walktrie:            4.547 ±   0.164 [ 4.35.. 4.99]        [... 4.42 ...   4.5 ...  4.68...]

(맨 위 줄은 Anders Kaseorg의 Rust 솔루션입니다. 맨 아래 줄은 위의 프로그램입니다. 이들은 평균, 최소, 최대, 중앙값 및 사 분위수를 포함한 100 회 실행의 타이밍입니다.)

분석

왜 이것이 더 빠릅니까? C ++이 Rust보다 빠르거나 Knuth의 프로그램이 가장 빠르다는 것은 아닙니다. 실제로 Knuth의 프로그램은 (메모리를 보존하기 위해) trie-packing으로 인해 삽입이 느립니다. 그 이유는 Knuth 가 2008 년에 불평 한 것과 관련이 있다고 생각합니다 .

64 비트 포인터에 대한 불꽃

4 기가 바이트 미만의 RAM을 사용하는 프로그램을 컴파일 할 때 64 비트 포인터를 갖는 것은 절대적으로 바보입니다. 이러한 포인터 값이 구조체 내부에 나타나면 메모리 절반을 낭비 할뿐만 아니라 캐시의 절반을 효과적으로 버립니다.

위의 프로그램은 32 비트 배열 인덱스 (64 비트 포인터가 아님)를 사용하므로 "Node"구조체는 메모리를 덜 차지하므로 스택에 더 많은 노드가 있고 캐시 누락이 줄어 듭니다. (사실, x32 ABI 로서 이것 에 대한 연구 가 있었지만 , 좋은 상태 가 아닌 것 같습니다. 아이디어는 예를 참조 분명 유용에도 불구하고 최근 발표 의 V8에서 포인터 압축을 . 아 글쎄.) 등giganovel ,이 프로그램은 (포장 된) 트라이에 12.8MB를 사용하고 Rust 프로그램의 트라이에 32.18MB를 사용합니다 (giganovel ). "giganovel"에서 "teranovel"로 1000 배까지 확장 할 수 있지만 여전히 32 비트 인덱스를 초과 할 수 없으므로 이는 합리적인 선택입니다.

더 빠른 변형

속도를 최적화하고 패킹을 포기할 수 있으므로 Rust 솔루션에서와 같이 (포장되지 않은) 트라이를 실제로 포인터 대신 인덱스와 함께 사용할 수 있습니다. 이것은 더 빠른 것을 제공합니다 하고 별개의 단어, 문자 등의 수에 미리 고정 된 제한이 없습니다 :

#include <iostream>
#include <cassert>
#include <vector>
#include <algorithm>

typedef int32_t Pointer;  // [0..node.size()), an index into the array of Nodes
typedef int32_t Count;
typedef int8_t Char;  // We'll usually just have 1 to 26.
struct Node {
  Pointer link;  // From a parent node to its children's "header", or from a header back to parent.
  Count count;  // The number of times this word has been encountered. Undefined for header nodes.
};
std::vector<Node> node; // Our "arena" for Node allocation.

std::string word_for(Pointer p) {
  std::vector<char> drow;  // The word backwards
  while (p != 0) {
    Char c = p % 27;
    drow.push_back('a' - 1 + c);
    p = (p - c) ? node[p - c].link : 0;
  }
  return std::string(drow.rbegin(), drow.rend());
}

// `p` has no children. Create `p`s family of children, with only child `c`.
Pointer create_child(Pointer p, Char c) {
  Pointer h = node.size();
  node.resize(node.size() + 27);
  node[h] = {.link = p, .count = -1};
  node[p].link = h;
  return h + c;
}

// Advance `p` to its `c`th child. If necessary, add the child.
Pointer find_child(Pointer p, Char c) {
  assert(1 <= c && c <= 26);
  if (p == 0) return c;  // Special case for first char.
  if (node[p].link == 0) return create_child(p, c);  // Case 1: `p` currently has *no* children.
  return node[p].link + c;  // Case 2 (easiest case): Already have the child c.
}

int main(int, char** argv) {
  auto start_c = std::clock();

  // Program startup
  std::ios::sync_with_stdio(false);

  // read in file contents
  FILE *fptr = fopen(argv[1], "rb");
  fseek(fptr, 0, SEEK_END);
  long dataLength = ftell(fptr);
  rewind(fptr);
  char* data = (char*)malloc(dataLength);
  fread(data, 1, dataLength, fptr);
  fclose(fptr);

  node.reserve(dataLength / 600);  // Heuristic based on test data. OK to be wrong.
  node.push_back({0, 0});
  for (Char i = 1; i <= 26; ++i) node.push_back({0, 0});

  // Loop over file contents: the bulk of the time is spent here.
  Pointer p = 0;
  for (long i = 0; i < dataLength; ++i) {
    Char c = (data[i] | 32) - 'a' + 1;  // 1 to 26, for 'a' to 'z' or 'A' to 'Z'
    if (1 <= c && c <= 26) {
      p = find_child(p, c);
    } else {
      ++node[p].count;
      p = 0;
    }
  }
  ++node[p].count;
  node[0].count = 0;

  // Brute-force: Accumulate all words and their counts, then sort by frequency and print.
  std::vector<std::pair<int, std::string>> counts_words;
  for (Pointer i = 1; i < static_cast<Pointer>(node.size()); ++i) {
    int count = node[i].count;
    if (count == 0 || i % 27 == 0) continue;
    counts_words.push_back({count, word_for(i)});
  }
  auto cmp = [](auto x, auto y) {
    if (x.first != y.first) return x.first > y.first;
    return x.second < y.second;
  };
  std::sort(counts_words.begin(), counts_words.end(), cmp);
  const int max_words_to_print = std::min<int>(counts_words.size(), atoi(argv[2]));
  for (int i = 0; i < max_words_to_print; ++i) {
    auto [count, word] = counts_words[i];
    std::cout << word << " " << count << std::endl;
  }

  return 0;
}

이 프로그램은 여기에있는 솔루션보다 정렬을 위해 많은 어려움을 겪고 있지만 ( giganovel 어려움을 겪고 12.2MB 만 하고 더 빠릅니다. 앞서 언급 한 타이밍과 비교 한이 프로그램의 타이밍 (마지막 라인) :

target/release/frequent:   4.809 ±   0.263 [ 4.45.. 5.62]        [... 4.63 ...  4.75 ...  4.88...]
ptrie-walktrie:            4.547 ±   0.164 [ 4.35.. 4.99]        [... 4.42 ...   4.5 ...  4.68...]
itrie-nolimit:             3.907 ±   0.127 [ 3.69.. 4.23]        [... 3.81 ...   3.9 ...   4.0...]

Rust로 번역 하면이 (또는 해시 트리 프로그램)가 무엇을 원하는지보고 싶어합니다. . :-)

자세한 내용

여기에 사용 된 데이터 구조 : "패킹"시도에 대한 설명은 TAOCP 3 권 6.3 절 (디지털 검색, 즉 시도)의 연습 4와 TeX의 하이픈에 대한 Knuth의 학생 Frank Liang의 논문에서 간결하게 설명됩니다. : 컴퓨터로 단어 하이 펜션 .
벤틀리 칼럼, 크 누스 프로그램, 맥 일로이의 리뷰 (유닉스 철학에 대한 부분 만)는 이전과 이후의 칼럼 과 컴파일러, TAOCP, TeX를 포함한 Knuth의 이전 경험 에 비추어 더 명확 합니다.
전체 도서있다 프로그래밍 스타일에 연습이 특정 프로그램에 대한 다양한 접근 방식을 보여주는 .

위의 사항에 대해 자세히 설명하지 않은 블로그 게시물이 있습니다. 완료되면이 답변을 편집 할 수 있습니다. 한편, Knuth의 생일 행사 (1 월 10 일)에이 답변을 여기에 게시하십시오. :-)

— 슈 리바트 사
소스

대박! 누군가가 Knuth의 솔루션을 게시했을뿐 아니라 (파스칼로), 이전의 최고의 게시물 중 일부를 능가하는 뛰어난 분석 및 성능을 제공했을뿐만 아니라 다른 C ++ 프로그램으로 속도에 대한 새로운 기록을 세웠습니다! 훌륭한.

— Andriy Makukha

내가 가지고있는 유일한 두 가지 의견 : 1) 현재 두 번째 프로그램 Segmentation fault: 11이 임의로 큰 단어와 간격이있는 테스트 사례 에서 실패합니다 . 2) McIlroy의 "비평"을 동정한다고 생각할지라도 Knuth의 의도는 그의 문맹 프로그래밍 기술을 과시하는 것임을 잘 알고 있지만 McIlroy는 공학적 관점에서 그것을 비판했습니다. McIlroy 자신은 나중에 그것이 공정한 일이 아니라고 인정했습니다.

— 안드리 마 쿠카

@AndriyMakukha 아 죄송합니다, 그것은 재귀 적이었습니다 word_for. 지금 고쳤다. 예 Unix 파이프의 발명가 인 McIlroy는 작은 도구를 구성 하는 Unix 철학 을 복음화 할 기회를 얻었습니다 . Knuth의 실망스럽게 (프로그램을 읽으려고한다면) 모 놀리 식 접근법과 비교할 때 좋은 철학이지만, 문맥 상 그것은 약간 불공평합니다. 또 다른 이유로도 있습니다 : 오늘날 Unix 방식은 널리 이용 가능하지만 1986 년에 국한되었습니다. Bell Labs, Berkeley 등에 게 (“그 회사는 사업에서 최고의 조립식을 만든다”)

— ShreevatsaR

공장! 새로운 왕에게 축하를 전한다 : -P 유닉스와 크 누스에 관해서는, 그는 다른 도구들 사이에 통일성이 거의없고 (예를 들어, 많은 도구들이 정규식을 다르게 정의하기 때문에) 시스템을 좋아하지 않는 것처럼 보였다.

— Andriy Makukha

1

파이썬 3

간단한 사전 으로이 구현은 Counter내 시스템에서 사용하는 것보다 약간 빠릅니다 .

def words_from_file(filename):
    import re

    pattern = re.compile('[a-z]+')

    for line in open(filename):
        yield from pattern.findall(line.lower())


def freq(textfile, k):
    frequencies = {}

    for word in words_from_file(textfile):
        frequencies[word] = frequencies.get(word, 0) + 1

    most_frequent = sorted(frequencies.items(), key=lambda item: item[1], reverse=True)

    for i, (word, frequency) in enumerate(most_frequent):
        if i == k:
            break

        yield word, frequency


from time import time

start = time()
print('\n'.join('{}:\t{}'.format(f, w) for w,f in freq('giganovel', 10)))
end = time()
print(end - start)

— 모바 티카
소스

1

시스템에서 giganovel으로 만 테스트 할 수 있었고 꽤 오랜 시간이 걸립니다 (~ 90 초). gutenbergproject는 법적 이유로 독일에서 차단되었습니다 ...

— movatica

흥미 롭군 그것은 하나의 heapq받는 모든 성능을 추가하지 않는 Counter.most_common방법, 또는 enumerate(sorted(...))도 사용 heapq내부적.

— Andriy Makukha

필자는 Python 2로 테스트했으며 성능이 비슷했기 때문에 정렬이 ~만큼 빠르다고 생각 Counter.most_common합니다.

— Andriy Makukha

네, 아마도 내 시스템에 지터가 있었을 것입니다 ... 적어도 그것은 느리지 않습니다 :) 그러나 정규 표현식 검색은 문자를 반복하는 것보다 훨씬 빠릅니다. 꽤 성능이 좋은 것으로 보입니다.

— movatica

1

[C] 접두사 트리 + 정렬 된 연결 목록

입력으로 두 개의 인수가 필요합니다 (텍스트 파일 경로 및 k 개의 가장 빈번한 단어 나열)

내 다른 항목을 기준 으로이 버전은 k 값이 클수록 훨씬 빠르지 만 k 값이 낮을 때 성능이 약간 떨어집니다.

단어 글자로 분기 된 트리를 만든 다음 리프 글자에서 카운터를 증가시킵니다. 그런 다음 현재 리프 카운터가 가장 빈번한 단어 목록에서 가장 작은 단어보다 큰지 확인합니다. (목록 크기는 명령 행 인수를 통해 결정된 숫자입니다.) 그렇다면 리프 문자로 표시되는 단어를 가장 자주 사용하도록 승격하십시오. 이미 가장 빈번한 단어 인 경우 단어 수가 더 많으면 가장 자주 다음 단어로 바꾸어 목록을 정렬하십시오. 더 이상 글자를 읽을 때까지이 과정이 모두 반복됩니다. 그 후 가장 빈번한 단어 목록이 출력됩니다.

현재 처리 시간을 출력하도록 기본 설정되어 있지만 다른 제출과의 일관성을 위해 소스 코드에서 TIMING 정의를 비활성화하십시오.

// comment out TIMING if using external program timing mechanism
#define TIMING 1

// may need to increase if the source text has many unique words
#define MAX_LETTER_INSTANCES 1000000

#define false 0
#define true 1
#define null 0

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#ifdef TIMING
#include <sys/time.h>
#endif

struct Letter
{
    char isTopWord;
    struct Letter* parent;
    struct Letter* higher;
    struct Letter* lower;
    char asciiCode;
    unsigned int count;
    struct Letter* nextLetters[26];
};
typedef struct Letter Letter;

int main(int argc, char *argv[]) 
{
#ifdef TIMING
    struct timeval tv1, tv2;
    gettimeofday(&tv1, null);
#endif

    int k;
    if (argc !=3 || (k = atoi(argv[2])) <= 0)
    {
        printf("Usage:\n");
        printf("      WordCount <input file path> <number of most frequent words to find>\n\n");
        return -1;
    }

    long  file_size;
    long dataLength;
    char* data;

    // read in file contents
    FILE *fptr;
    size_t read_s = 0;  
    fptr = fopen(argv[1], "rb");
    fseek(fptr, 0L, SEEK_END);
    dataLength = ftell(fptr);
    rewind(fptr);
    data = (char*)malloc((dataLength));
    read_s = fread(data, 1, dataLength, fptr);
    if (fptr) fclose(fptr);

    unsigned int chr;
    unsigned int i;

    // working memory of letters
    Letter* letters = (Letter*) malloc(sizeof(Letter) * MAX_LETTER_INSTANCES);
    memset(&letters[0], 0, sizeof( Letter) * MAX_LETTER_INSTANCES);

    // the index of the next unused letter
    unsigned int letterMasterIndex=0;

    // pesudo letter representing the starting point of any word
    Letter* root = &letters[letterMasterIndex++];

    // the current letter in the word being processed
    Letter* currentLetter = root;

    // the next letter to be processed
    Letter* nextLetter = null;
    Letter* sortedWordsStart = null;
    Letter* sortedWordsEnd = null;
    Letter* A;
    Letter* B;
    Letter* C;
    Letter* D;

    unsigned int sortedListSize = 0;


    unsigned int lowestWordCount = 0;
    unsigned int lowestWordIndex = 0;
    unsigned int highestWordCount = 0;
    unsigned int highestWordIndex = 0;

    // main loop
    for (int j=0;j<dataLength;j++)
    {
        chr = data[j]|0x20; // convert to lower case

        // is a letter?
        if (chr > 96 && chr < 123)
        {
            chr-=97; // translate to be zero indexed
            nextLetter = currentLetter->nextLetters[chr];

            // this is a new letter at this word length, intialise the new letter
            if (nextLetter == null)
            {
                nextLetter = &letters[letterMasterIndex++];
                nextLetter->parent = currentLetter;
                nextLetter->asciiCode = chr;
                currentLetter->nextLetters[chr] = nextLetter;
            }

            currentLetter = nextLetter;
        }
        // not a letter so this means the current letter is the last letter of a word (if any letters)
        else if (currentLetter!=root)
        {

            // increment the count of the full word that this letter represents
            ++currentLetter->count;

            // is this word not in the top word list?
            if (!currentLetter->isTopWord)
            {
                // first word becomes the sorted list
                if (sortedWordsStart == null)
                {
                  sortedWordsStart = currentLetter;
                  sortedWordsEnd = currentLetter;
                  currentLetter->isTopWord = true;
                  ++sortedListSize;
                }
                // always add words until list is at desired size, or 
                // swap the current word with the end of the sorted word list if current word count is larger
                else if (sortedListSize < k || currentLetter->count> sortedWordsEnd->count)
                {
                    // replace sortedWordsEnd entry with current word
                    if (sortedListSize == k)
                    {
                      currentLetter->higher = sortedWordsEnd->higher;
                      currentLetter->higher->lower = currentLetter;
                      sortedWordsEnd->isTopWord = false;
                    }
                    // add current word to the sorted list as the sortedWordsEnd entry
                    else
                    {
                      ++sortedListSize;
                      sortedWordsEnd->lower = currentLetter;
                      currentLetter->higher = sortedWordsEnd;
                    }

                    currentLetter->lower = null;
                    sortedWordsEnd = currentLetter;
                    currentLetter->isTopWord = true;
                }
            }
            // word is in top list
            else
            {
                // check to see whether the current word count is greater than the supposedly next highest word in the list
                // we ignore the word that is sortedWordsStart (i.e. most frequent)
                while (currentLetter != sortedWordsStart && currentLetter->count> currentLetter->higher->count)
                {
                    B = currentLetter->higher;
                    C = currentLetter;
                    A = B != null ? currentLetter->higher->higher : null;
                    D = currentLetter->lower;

                    if (A !=null) A->lower = C;
                    if (D !=null) D->higher = B;
                    B->higher = C;
                    C->higher = A;
                    B->lower = D;
                    C->lower = B;

                    if (B == sortedWordsStart)
                    {
                      sortedWordsStart = C;
                    }

                    if (C == sortedWordsEnd)
                    {
                      sortedWordsEnd = B;
                    }
                }
            }

            // reset the letter path representing the word
            currentLetter = root;
        }
    }

    // print out the top frequent words and counts
    char string[256];
    char tmp[256];

    Letter* letter;
    while (sortedWordsStart != null )
    {
        letter = sortedWordsStart;
        highestWordCount = letter->count;
        string[0]=0;
        tmp[0]=0;

        if (highestWordCount > 0)
        {
            // construct string of letters to form the word
            while (letter != root)
            {
                memmove(&tmp[1],&string[0],255);
                tmp[0]=letter->asciiCode+97;
                memmove(&string[0],&tmp[0],255);
                letter=letter->parent;
            }

            printf("%u %s\n",highestWordCount,string);
        }
        sortedWordsStart = sortedWordsStart->lower;
    }

    free( data );
    free( letters );

#ifdef TIMING   
    gettimeofday(&tv2, null);
    printf("\nTime Taken: %f seconds\n", (double) (tv2.tv_usec - tv1.tv_usec)/1000000 + (double) (tv2.tv_sec - tv1.tv_sec));
#endif
    return 0;
}

— 무기
소스

k = 100,000 :에 대해 정렬되지 않은 출력을 리턴합니다 12 eroilk 111 iennoa 10 yttelen 110 engyt.

— Andriy Makukha

나는 그 이유에 대한 아이디어가 있다고 생각합니다. 내 생각은 현재 단어의 다음으로 높은 단어를 확인할 때 목록에서 스왑 단어를 반복해야한다는 것입니다. 시간이 있으면 확인합니다

— Moogie

흠 잘하면 if를 변경하는 간단한 수정은 효과가 있지만 k 값이 클수록 알고리즘 속도가 크게 느려집니다. 더 영리한 해결책을 생각해야 할 수도 있습니다.

— Moogie

1

씨#

이것은 최신 .net SDK 에서 작동합니다 .

using System;
using System.IO;
using System.Diagnostics;
using System.Collections.Generic;
using System.Linq;
using static System.Console;

class Node {
    public Node Parent;
    public Node[] Nodes;
    public int Index;
    public int Count;

    public static readonly List<Node> AllNodes = new List<Node>();

    public Node(Node parent, int index) {
        this.Parent = parent;
        this.Index = index;
        AllNodes.Add(this);
    }

    public Node Traverse(uint u) {
        int b = (int)u;
        if (this.Nodes is null) {
            this.Nodes = new Node[26];
            return this.Nodes[b] = new Node(this, b);
        }
        if (this.Nodes[b] is null) return this.Nodes[b] = new Node(this, b);
        return this.Nodes[b];
    }

    public string GetWord() => this.Index >= 0 
        ? this.Parent.GetWord() + (char)(this.Index + 97)
        : "";
}

class Freq {
    const int DefaultBufferSize = 0x10000;

    public static void Main(string[] args) {
        var sw = Stopwatch.StartNew();

        if (args.Length < 2) {
            WriteLine("Usage: freq.exe {filename} {k} [{buffersize}]");
            return;
        }

        string file = args[0];
        int k = int.Parse(args[1]);
        int bufferSize = args.Length >= 3 ? int.Parse(args[2]) : DefaultBufferSize;

        Node root = new Node(null, -1) { Nodes = new Node[26] }, current = root;
        int b;
        uint u;

        using (var fr = new FileStream(file, FileMode.Open))
        using (var br = new BufferedStream(fr, bufferSize)) {
            outword:
                b = br.ReadByte() | 32;
                if ((u = (uint)(b - 97)) >= 26) {
                    if (b == -1) goto done; 
                    else goto outword;
                }
                else current = root.Traverse(u);
            inword:
                b = br.ReadByte() | 32;
                if ((u = (uint)(b - 97)) >= 26) {
                    if (b == -1) goto done;
                    ++current.Count;
                    goto outword;
                }
                else {
                    current = current.Traverse(u);
                    goto inword;
                }
            done:;
        }

        WriteLine(string.Join("\n", Node.AllNodes
            .OrderByDescending(count => count.Count)
            .Take(k)
            .Select(node => node.GetWord())));

        WriteLine("Self-measured milliseconds: {0}", sw.ElapsedMilliseconds);
    }
}

다음은 샘플 출력입니다.

C:\dev\freq>csc -o -nologo freq-trie.cs && freq-trie.exe giganovel 100000
e
ihit
ah
ist
 [... omitted for sanity ...]
omaah
aanhele
okaistai
akaanio
Self-measured milliseconds: 13619

처음에는 문자열 키가있는 사전을 사용하려고했지만 너무 느 렸습니다. .net 문자열은 내부적으로 2 바이트 인코딩으로 표현되어 있기 때문에이 응용 프로그램에 낭비가되기 때문입니다. 그런 다음 방금 순수한 바이트와 추악한 goto 스타일 상태 시스템으로 전환했습니다. 대소 문자 변환은 비트 연산자입니다. 빼기 후 단일 비교에서 문자 범위 검사가 수행됩니다. 런타임의 0.1 % 미만을 사용하고 있기 때문에 최종 정렬을 최적화하는 데 노력을 기울이지 않았습니다.

수정 : 알고리즘은 본질적으로 정확했지만 단어의 모든 접두사를 세어 총 단어를 과도하게보고했습니다. 총 단어 수는 문제의 요구 사항이 아니기 때문에 해당 출력을 제거했습니다. 모든 k 단어를 출력하기 위해 출력도 조정했습니다. 결국 string.Join()에는 전체 목록 을 사용 하고 한 번에 작성하기로 결정했습니다. 놀랍게도 이것은 100k 동안 각 단어를 개별적으로 쓰는 내 컴퓨터에서 약 1 초 더 빠릅니다.

— 재귀
소스

1

매우 인상적! 나는 당신의 비트 tolower및 단일 비교 트릭을 좋아합니다 . 그러나 귀하의 프로그램이 왜 예상보다 더 명확한 단어를보고하는지 이해가 가지 않습니다. 또한 원래 문제 설명에 따르면 프로그램은 모든 k 단어를 빈도의 감소 순서로 출력해야하므로 10 개의 가장 빈번한 단어를 출력 해야하는 마지막 테스트에 프로그램을 세지 않았습니다.

— Andriy Makukha

@AndriyMakukha : 나는 또한 최종 카운트에서 결코 발생하지 않은 단어 접두사를 세고 있음을 알 수 있습니다. 콘솔 출력이 Windows에서 매우 느리기 때문에 모든 출력을 쓰지 않았습니다. 출력을 파일에 쓸 수 있습니까?

— 재귀

표준 출력으로 인쇄하십시오. k = 10의 경우 모든 시스템에서 빠릅니다. 명령 행에서 출력을 파일로 경로 재 지정할 수도 있습니다. 이것처럼 .

— Andriy Makukha

@AndriyMakukha : 모든 문제를 해결했다고 생각합니다. 많은 런타임 비용없이 필요한 모든 출력을 생성하는 방법을 찾았습니다.

— 재귀

이 출력은 빠르게 타 오르고 있습니다! 아주 좋아요 다른 솔루션과 마찬가지로 빈도 수를 인쇄하도록 프로그램을 수정했습니다.

— Andriy Makukha

1

루비 2.7.0-preview1 with `tally`

최신 버전의 Ruby에는라는 새로운 메소드가 tally있습니다. 로부터 릴리스 노트 :

Enumerable#tally추가됩니다. 각 요소의 발생 횟수를 계산합니다.

["a", "b", "c", "b"].tally
#=> {"a"=>1, "b"=>2, "c"=>1}

이것은 우리를 위해 전체 작업을 거의 해결합니다. 먼저 파일을 읽고 나중에 최대 값을 찾아야합니다.

모든 것이 여기 있습니다 :

k = ARGV.shift.to_i

pp ARGF
  .each_line
  .lazy
  .flat_map { @1.scan(/[A-Za-z]+/).map(&:downcase) }
  .tally
  .max_by(k, &:last)

편집 : k명령 줄 인수로 추가

ruby k filename.rb input.txtRuby 2.7.0-preview1 버전을 사용하여 실행할 수 있습니다 . 릴리스 정보 페이지의 다양한 링크에서 다운로드하거나을 사용하여 rbenv와 함께 설치할 수 있습니다 rbenv install 2.7.0-dev.

내 자신의 구타 오래된 컴퓨터에서 실행 예 :

$ time ruby bentley.rb 10 ulysses64 
[["the", 968832],
 ["of", 528960],
 ["and", 466432],
 ["a", 421184],
 ["to", 322624],
 ["in", 320512],
 ["he", 270528],
 ["his", 213120],
 ["i", 191808],
 ["s", 182144]]

real    0m17.884s
user    0m17.720s
sys 0m0.142s

— 다니에로
소스

1

소스에서 Ruby를 설치했습니다. 그것은 당신의 컴퓨터만큼이나 빠르게 실행됩니다 (15 초 대 17).

— Andriy Makukha

벤틀리의 코딩 과제 : 가장 빈번한 단어

문제 설명

샘플 테스트 사례

참조 구현

테스팅

지금까지의 순위

[씨]

녹

src/main.rs

Cargo.toml

용법

APL (Dyalog Unicode)

[C] 접두사 트리 + 저장소

제이

C ++ (라 크 누스)

타이밍

분석

더 빠른 변형

자세한 내용

파이썬 3

[C] 접두사 트리 + 정렬 된 연결 목록

씨#

루비 2.7.0-preview1 with tally

`src/main.rs`

`Cargo.toml`

루비 2.7.0-preview1 with `tally`