This is a text-only version of the following page on --- Title : Word occurrence counter and analyzer Author : Remy van Elst Date : 07-03-2013 URL : Format : Markdown/HTML --- With these commands you can analyze a text file. It will count all the occurrences of all words and put out the stats. It is usefull for song lyrics, books, notes and everything. It helps me analyze my writing style, which words do I use more often, where are my spelling errors and such. It is also nice to win an argument against someone over a dragonforce song. This example will use lyrics as example, but it is applicable to all text files.

##### Get the Lyrics (text) First get the lyrics, or the text you want to analyze into a text file. I've heard nano, vi(m) and emacs are quite good with text. In this song I will use a song by Dragonforce. It does not matter which one because they're all full of the same words. My lyrics file is named: `df1.txt` ##### Sanitize them The tools we are going to use do not like all those comma's, colons, exclamation marks and weird non-alphanumeric characters. So sanitize the file like this: cat df1.txt | tr -cd '[:alnum:] [:space:]' > df1san.txt What this does is pump the file through the tr command, that command (with these arguments) strips everything which is not a-zA-Z0-9 or a space. Exactly what we want. ##### Analyze it Now we do the magic: sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' dfsan.txt | sort | uniq -c | sort -nr | head -n 20 remy@vps8:~$ sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' dfsan.txt | sort | uniq -c | sort -nr | head -n 20 72 the 32 25 and 22 of 20 in 17 we 16 on 14 our 13 a 8 were 8 lost 8 for 7 will 7 still 7 light 6 to 6 so 6 fire 6 far 5 through ### Other Example #### on my class notes about blood and the immune system remy@vps8:~$ cat afweer.txt | tr -cd '[:alnum:] [:space:]' > afweersan.txt remy@vps8:~$ sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' afweersan.txt | sort | uniq -c | sort -nr | head -n 20 195 108 de 80 een 72 van 65 het 51 in 46 is 40 en 24 zijn 24 op 24 afweer 22 die 20 vraag 20 deze 19 worden 18 kan 17 bij 16 dit 15 er 14 of After stripping it of the non-usefull words: remy@vps8:~$ cat afwres.txt | head -n 10 24 afweer 14 cellen 11 bacterin 9 waar 9 reactie 9 antigeen 8 specifieke 7 milieu 7 lymfocyten 7 lichaam #### Fabian Scherschels NanoWriMo 2011 Book: Nightwatch [GIT tree of the book][2] & [NaNoWiMo page][3] Book is `Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License` 1020 the 454 he 421 and 418 of 357 to 347 had 297 a 267 was 257 his 241 that 216 in 132 it 130 marc 112 him 108 as 105 this 105 they 93 with 90 but 82 were 82 from 82 been 82 at 74 on 70 would 68 for 68 could 56 their 56 be 53 out 51 into 50 man 49 all 48 there 48 so 48 by 47 looked 46 not 44 up 44 them 44 like #### Analyzing IP and log files Today I found another usefull use for this command. Analyzing IP adresses. First I grepped my entire lighttpd log file: cat access.log | egrep -o '[[:digit:]]{1,3}.[[:digit:]]{1,3}.[[:digit:]]{1,3}.[[:digit:]]{1,3}' | tr [:space:] '\n' | grep -v "^\s*$" | sort | uniq -c | sort -bnr (egrep -o spits out only the IP adress, not the whole line on which the IP adress is on) That gives out this nice list (this list is made up, not real IP adresses): 2 2 2 2 2 3 3 5 348 467 [Thanks to the wonderfull community at stackexchange][4] [1]: [2]: [3]: [4]: