Телеграмм чат группы proalgorithms страница 3228

Дан файл с текстом. Просто поэмы, рассказы, романы на английском языке. Охота посчитать частоту слов самым топорным способом, как эквивалент вот этого:

LC_ALL="C" awk -F '[^A-Za-z]+' '{ for(i = 1; i <= NF; ++i) if ($i) ++w[tolower($i)] } END { for(i in w) print w[i], i }' $1 | sort -k1gr,2

(т.е. всё, кроме a-zA-Z — это пробелы, остальное — формирует слова; выводятся две колонки: сколько раз встретилось слово в убывающем порядке и само слово в нижнем регистре)
Какие есть способы, быстрее, чем counting trie? Пусть максимум встречаются "слова" длиной 70 символов, но таких единицы. Топ выглядит вот так:

3343241 the
1852717 and
1715705 of
1560152 to
1324244 a
956926 in
933954 i
781286 he
713514 that
690876 was
665710 it
...

но нужны не только из топа, а вообще все. Полная информация

источник

23:13пожаловаться #11

Anatoly Tomilov in pro.algorithms

вот такое распределение по длинам

источник

23:25пожаловаться #12

Anatoly Tomilov in pro.algorithms

такую пока нашёл штуку http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.96.2143

citeseerx.ist.psu.edu

CiteSeerX — TRASH A dynamic LC-trie and hash data structure

CiteSeerX - Document Details (Isaac Councill, Lee Giles, Pradeep Teregowda): A dynamic LC-trie is currently used in the Linux kernel to implement address lookup in the IP routing table [6, 9]. The main virtue of this data structure is that it supports both fast address lookups and frequent updates of the table. Also, it has an efficient memory management scheme and supports multi-processor architectures using the RCU locking mechanism. The structure scales nicely: the expected number of memory accesses for one lookup is O(log log n), where n is the number of entries in the lookup table. In particular, the time does not depend on the length of the keys, 32-bit IPv4 addresses and 128-bit addresses does not make a difference in this respect. In this article we introduce TRASH, a combination of a dynamic LC-trie and a hash function. TRASH is a general purpose data structure supporting fast lookup, insert and delete operations for arbitrarily long bit strings. TRASH enhances the level-compression part of the LC-trie…

источник

23:42пожаловаться #13

2020 April 12

lbh in pro.algorithms

Anatoly Tomilov

LC_ALL="C" awk -F '[^A-Za-z]+' '{ for(i = 1; i <= NF; ++i) if ($i) ++w[tolower($i)] } END { for(i in w) print w[i], i }' $1 | sort -k1gr,2

3343241 the
1852717 and
1715705 of
1560152 to
1324244 a
956926 in
933954 i
781286 he
713514 that
690876 was
665710 it
...

но нужны не только из топа, а вообще все. Полная информация

uniq -c ?

источник

00:16пожаловаться #14

Anatoly Tomilov in pro.algorithms

lbh

uniq -c ?

многозначительно. Что ты хотел этим сказать?

источник

00:16пожаловаться #15

lbh in pro.algorithms

то что оно тебе все посчитает. тебе результат нужен или сам процесс?

источник

00:17пожаловаться #16

lbh in pro.algorithms

в линуксе кроме sort есть uniq. man uniq

источник

00:18пожаловаться #17

Anatoly Tomilov in pro.algorithms

мне нужен алгоритм

источник

00:22пожаловаться #18

Anatoly Tomilov in pro.algorithms

я всякую хрень, типа uniq -c в случае, если не знаю (в данном случае знаю — прочти внимательно вопрос), то догадываюсь что такая существует и гуглить умею.

источник

00:23пожаловаться #19

Anatoly Tomilov in pro.algorithms

lbh

то что оно тебе все посчитает. тебе результат нужен или сам процесс?

что за вопрос в чате с таким названием?

источник

00:24пожаловаться #20