# English 360 (from angles)

Tags: None
1. ## English 360 (from angles)

The thread in which I used to post is closed for some reason.

It would be nice everyone who has some view/question to have a free way to express his/hers mind.

As a continuation of my last post #1894(http://www.allthelyrics.com/forum/le...l#post876796):

What a stupid blunder from my side: 'I have a friend of mine'.
Where is the (real-time) detector for wrong n-grams to warn me that the proper phrase is: 'A friend of mine'?

The point of that was to show the quatrain unfolded/n-gramed. And that was only the small part of the full n-graming.
You understand that one unique 27-gram (here the unfolded quatrain) is built from much many n-grams as follows:

1 2
sub-grams: 1,2,1 2
All: 2+(1)=3

1 2 3
sub-grams: 1,2,1 2,3,2 3,1 2 3
All: 3+(2+1)=6

1 2 3 4
sub-grams: 1,2,1 2,3,2 3,1 2 3,4,3 4,2 3 4,1 2 3 4
All: 4+(3+2+1)=10

1 2 3 4 5
sub-grams: 1,2,1 2,3,2 3,1 2 3,4,3 4,2 3 4,1 2 3 4,5,4 5,3 4 5,2 3 4 5,1 2 3 4 5
All: 5+(4+3+2+1)=15

...

That is, they are 27+...+1 = 27*(27+1)/2 = 378 in total. Only after realizing that each of them must be ranked individually you can see a GLIMPSE of the evaluation process.

Having been posted all-of-them I guess the moderator would go bananas, ha-ha.
My idea was (actually) to provide a visual food. Down-written they are more moving, I saw that only talking about them was a burden.

And I sense some static view of yours regarding the quatrain's good-English-or-even-commonly-used-English-ness, while mine is neither static nor dynamic, that is the lyrics' author can coin his/hers own definitions (I mean free-for-all i.e. slang/vulgar/archaic/obsolete usage ranked not "judged").

Your attention is on things that hardly have something in common with my passion: ripping and indexing the whole written English in order to create a brute-force phrase-checker (or n-gram-checker to be more precise).
John, I want to salute you with a lyric (I "forgot" intentionally to mention the song info):

[
I look at the river
But I'm thinking of the sea
The past overwritten
By what I hope to be

I search for the essence
It's my command
That is the centre
It's what I demand

In my mind
I'm standing in a glass house
Looking below
Don't wanna look without seeing
Don't wanna touch without feeling

I dream in colour,
yeah I dream in colour
When the world's in black and sepia tone
Or in sleepy monochrome
I dream in colour,
yeah I dream in colour
I see much further than this
I see much further than this, yeah

In a room with no windows
And painted doors
What once was my ceiling
Now is the floor
My landscape is changing
From out of the dust
It's some kind of healing
Like the sun's coming up

In my mind
I'm standing in a glass house
Looking below
Don't wanna hear without listening
Don't wanna talk without speaking

I can see through the clouds of grey
Got a window on the world
I can sweep them all away
Got a window on the world
Don't wanna look without seeing
Don't wanna touch without feeling

I dream in colour,
yeah I dream in colour
When the world's in black and sepia tone
Or in sleepy monochrome
I dream in colour,
yeah I dream in colour
I see much further than this
There is gotta be more than this, yeah ...
]

Just wanted you to feel what is my place in this world (similar to the classic 'The man from the silver mountain' here A man from the glass house).

P.S.
My hurry mode is always buggy: the name of the thread should be 'from all angles'.
Last edited by Sanmayce; 06-19-2011 at 10:24 AM.

2. This interests me from the angle of (a) free and unbridled expressions of thought, but also of course because (b) it's been said countless times that mathematics can express every known concept in the universe....like the idea that even the seeming randomness of weather can be expressed in a Markov chain and thus predicted, although almost no one seems to be able to adequately interpret the computational results. This is to say more or less that you really shouldn't feel bad about that "blunder"

May I ask, what is your particular core of study? And is it your goal to develop an institutional teaching aid?

3. Hi LycaNightmareLuc,
glad I am that at last someone dared to enter this 2,200,000,000 (yes, 2+billion people using English) deep water as you did.
I see you feel the theme, and believe me it take NOT a scientist to get how important no-no better said fundamental are these n-grams for all kind of analyses.

My English is broken and simple but this (amazingly it is my advantage) helps me a lot when I must figure out some approaches as how to do transitions from 1-gram all the way to the full-meaning n-gram let say 8-gram.
And I follow the Diamond (not golden) rule: 'keep it simple'. I am saying this in order to emphasize that Markov's and other similar (AI) conceptions are too complex (at least for me) I stick to the basics, yet. As you know there is a "classic song" named 'Walk this way' meaning that transition to more complex things must be paved by traversing the basics stuff first, don't you think? I wonder how it is possible big/rich organizations powered by some data-center cluster having not implemented yet a free on-line PHRASE-CHECKER (one of my dreams).

My core is x-grams, as to your next (wow, you are playing on my finest strings) I would like a lot - but the problems are too many though - after all I am only an amateur with no solid background neither in mathematics/programming nor linguistics, grumble.
You are (along with other people interested in exploring the basics) welcome to my free-and-open-sub-project: Leprechaun at http://www.sanmayce.com/Downloads/index.html#Leprechaun
Feel free to ask whatever interests you.

4. Giving definitions and basic goals is essential for any further explanations but for now I will skip them.
Only to state the difference between n-grams and (my) x-grams: since there are no definitive definitions here come mine:
1) A n-gram is a sequence of words, for example for a given sequence of files (i.e. texts/contexts) with length (measured in words) 30 trillion we can say it is a single 30,000,000,000,000-gram. That is not as scary as it seems because the chunks (which constitute this big sequence i.e. files) can be mixed randomly with no significant impact on the current approach (targeted on smaller orders i.e. books/chapters/paragraphs/sentences).
Even clearly said: let some 30 billion files form these 30 trillion long n-gram, whether the former are in one or other order/sequence - it doesn't matter because we assume that each one of these files is a context by itself.
In my current view the usable electronic English can be roughly 4-grammed down to 10x800,000,000 or 8 billion phrases 4 words in length.
2) An x-gram is a n-gram derived by applying some rules on the latter.

I have so much to say... But here comes only my previous post 2-grammed (my wish is to show how insufficiently is to stop somewhere in-betweens i.e. not to make full mix by going up and down through the rest x-grams).

The following dump (2-grams_checked_with_Gamera_corpus.txt) was generated after 2-gramming (resulting in 245 distinct 2-grams) the post and got checked (resulting in 231 found i.e. familiar 2-grams) against 124,669,942 Gamera corpus 2-grams:
The first 14 2-grams are unfamiliar to this corpus - it means either the corpus is not rich enough or phrases are either wrong or suspicious:
The numbers at right show the occurrences:

ai_conceptions
background_neither
basics_stuff
hi_lycanightmareluc
html_leprechaun
leprechaun_at
leprechaun_feel
named_walk
next_wow
nor_linguistics
organizations_powered
phrase_checker
say_gram
similar_ai
a_classic .................. 0,012,381
a_free ..................... 0,069,761
a_lot ...................... 0,258,571
a_scientist ................ 0,007,458
after_all .................. 0,223,436
all_i ...................... 0,099,114
all_kind ................... 0,002,310
all_the .................... 2,562,629
along_with ................. 0,211,834
am_only .................... 0,013,582
am_saying .................. 0,012,284
am_that .................... 0,007,216
amateur_with ............... 0,000,054
amazingly_it ............... 0,000,021
an_amateur ................. 0,004,950
and_believe ................ 0,015,673
and_i ...................... 2,039,312
and_open ................... 0,040,190
and_other .................. 0,624,922
and_simple ................. 0,028,709
approaches_as .............. 0,000,672
are_along .................. 0,000,433
are_playing ................ 0,005,494
are_these .................. 0,033,237
are_too .................... 0,059,746
as_how ..................... 0,010,652
as_to ...................... 0,905,845
as_you ..................... 0,573,837
at_http .................... 0,056,920
at_last .................... 0,474,990
at_least ................... 0,813,060
basics_welcome ............. 0,000,005
be_paved ................... 0,000,286
believe_me ................. 0,045,296
better_said ................ 0,000,328
big_rich ................... 0,000,091
billion_people ............. 0,001,440
broken_and ................. 0,012,883
but_the .................... 1,651,981
but_this ................... 0,300,958
by_some .................... 0,145,557
by_traversing .............. 0,000,459
center_cluster ............. 0,000,030
checker_one ................ 0,000,003
classic_song ............... 0,000,027
cluster_having ............. 0,000,016
complex_at ................. 0,000,808
complex_things ............. 0,000,292
conceptions_are ............ 0,000,946
core_is .................... 0,002,888
dared_to ................... 0,036,908
data_center ................ 0,010,922
deep_water ................. 0,007,289
diamond_not ................ 0,000,017
do_transitions ............. 0,000,004
don_t ...................... 2,521,353
emphasize_that ............. 0,004,050
english_deep ............... 0,000,001
english_is ................. 0,005,444
enter_this ................. 0,005,529
exploring_the .............. 0,011,097
feel_free .................. 0,015,551
feel_the ................... 0,069,388
figure_out ................. 0,028,661
finest_strings ............. 0,000,004
follow_the ................. 0,109,596
for_all .................... 0,435,338
for_me ..................... 0,409,489
free_and ................... 0,039,828
free_on .................... 0,006,985
free_to .................... 0,065,917
from_gram .................. 0,000,219
full_meaning ............... 0,002,540
fundamental_are ............ 0,000,032
get_how .................... 0,000,096
golden_rule ................ 0,003,688
gram_all ................... 0,000,004
gram_let ................... 0,000,001
grams_for .................. 0,000,167
having_not ................. 0,002,642
helps_me ................... 0,001,961
how_important .............. 0,006,141
how_it ..................... 0,135,087
how_to ..................... 0,675,794
i_must ..................... 0,349,202
i_see ...................... 0,245,025
i_stick .................... 0,001,332
i_wonder ................... 0,076,102
i_would .................... 0,524,124
implemented_yet ............ 0,000,219
important_no ............... 0,000,085
in_exploring ............... 0,002,022
in_mathematics ............. 0,007,647
in_order ................... 0,667,182
interested_in .............. 0,148,777
interests_you .............. 0,001,517
is_broken .................. 0,020,062
is_my ...................... 0,173,353
is_possible ................ 0,160,823
it_is ...................... 6,850,417
it_simple .................. 0,002,551
it_take .................... 0,009,024
keep_it .................... 0,064,214
kind_of .................... 0,563,256
know_there ................. 0,013,646
last_someone ............... 0,000,081
least_for .................. 0,014,429
let_say .................... 0,000,029
like_a ..................... 0,981,190
line_phrase ................ 0,000,006
lot_but .................... 0,000,356
lot_when ................... 0,000,644
many_though ................ 0,000,178
markov_s ................... 0,000,104
mathematics_programming .... 0,000,002
me_it ...................... 0,022,484
meaning_n .................. 0,000,127
meaning_that ............... 0,023,299
more_complex ............... 0,043,985
must_be .................... 1,258,275
must_figure ................ 0,000,384
my_core .................... 0,000,144
my_dreams .................. 0,007,886
my_english ................. 0,001,838
my_finest .................. 0,000,240
my_free .................... 0,001,459
n_gram ..................... 0,000,477
n_grams .................... 0,000,337
neither_in ................. 0,009,841
no_better .................. 0,042,076
no_no ...................... 0,023,610
no_solid ................... 0,001,315
not_a ...................... 0,771,246
not_golden ................. 0,000,150
not_implemented ............ 0,003,183
of_analyses ................ 0,000,738
of_my ...................... 0,965,421
on_line .................... 0,025,976
on_my ...................... 0,256,709
one_of ..................... 2,356,901
only_an .................... 0,039,438
open_sub ................... 0,000,018
order_to ................... 0,527,624
other_people ............... 0,086,478
other_similar .............. 0,007,512
out_some ................... 0,018,895
paved_by ................... 0,000,122
people_interested .......... 0,000,823
people_using ............... 0,002,410
playing_on ................. 0,007,589
possible_big ............... 0,000,010
powered_by ................. 0,006,378
problems_are ............... 0,014,630
programming_nor ............ 0,000,011
rich_organizations ......... 0,000,002
s_and ...................... 0,131,390
said_fundamental ........... 0,000,004
saying_this ................ 0,010,057
scientist_to ............... 0,000,777
see_you .................... 0,144,477
simple_but ................. 0,004,667
solid_background ........... 0,000,316
some_approaches ............ 0,000,384
some_data .................. 0,006,241
someone_dared .............. 0,000,012
song_named ................. 0,000,007
stick_to ................... 0,023,675
strings_i .................. 0,000,201
stuff_first ................ 0,000,081
sub_project ................ 0,000,053
t_you ...................... 0,340,406
take_not ................... 0,002,317
that_at .................... 0,103,975
that_markov ................ 0,000,011
that_transition ............ 0,000,514
the_basics ................. 0,016,494
the_diamond ................ 0,014,836
the_full ................... 0,160,394
the_problems ............... 0,040,573
the_theme .................. 0,018,891
the_way .................... 0,859,514
there_is ................... 2,465,476
these_n .................... 0,000,352
things_must ................ 0,006,472
this_amazingly ............. 0,000,066
this_in .................... 0,076,097
this_way ................... 0,218,235
though_after ............... 0,001,198
to_do ...................... 1,507,886
to_emphasize ............... 0,013,045
to_enter ................... 0,155,262
to_get ..................... 0,856,721
to_more .................... 0,046,096
to_my ...................... 0,511,518
to_the ..................... 9,999,999
too_complex ................ 0,002,959
too_many ................... 0,061,664
transition_to .............. 0,009,243
transitions_from ........... 0,001,349
traversing_the ............. 0,004,304
using_english .............. 0,000,306
walk_this .................. 0,000,930
water_as ................... 0,009,663
way_meaning ................ 0,000,051
way_to ..................... 0,551,881
welcome_to ................. 0,042,132
whatever_interests ......... 0,000,083
when_i ..................... 0,725,337
with_no .................... 0,179,336
with_other ................. 0,141,155
wonder_how ................. 0,013,046
would_like ................. 0,122,628
x_grams .................... 0,000,032
yet_a ...................... 0,033,118
you_are .................... 1,943,806
you_did .................... 0,100,879
you_feel ................... 0,090,641
you_know ................... 0,742,812
you_think .................. 0,334,433
your_next .................. 0,007,257
your_questions ............. 0,006,720

Note: to see aligned output use unproportional font as Courier.

Back then in 2000 I had had a favorite song 'Crush' by Jennifer Page, from its lyrics I borrowed the phrase 'it doesn't take a scientist' and used an amplified variant 'it takes not a scientist' stupidly omitting the 's'.
As you can see other dumb-dumbs like me have been polluted the corpus by using the wrong phrase 'it_take' 9,024 times, it is tricky how to make an effective screening of such bad entries.
As far as I can see such bad "collocations" are result either of pure errors or of not using commas as delimiters in sentences as: 'If you are really into it[,] take off.' Or, mostly because of phrases like: 'why_did_it_take_us_so_long_to', 'how_long_would_it_take_you_to_isolate', 'it_and_make_it_take_care_of_scrolling', 'let_it_take_her_wherever_she_wished_to' ...
Here however comes the reinforcement of (skipping order 3) 4-gram checking where the phrases 'it_take_not_a', 'me_it_take_not' along with 'believe_me_it_take' are already marked as unfamiliar i.e with zero occurrences. Thus I see the screening: no algorithms, no heuristics whatsoever - just smooth transition between orders (1 to 8 is enough generally):
Again the first 156 4-grams (out of total 216) are unfamiliar whereas the bottom ones (60) are familiar (presumably correct):
Apparently the 879,557,846 distinct 4-grams being used are far from needed to cover (i.e. to evaluate fully) a general text like my post (do the 8 billion x-grams mentioned before look as an exaggerated number now?):

a_classic_song_named
a_scientist_to_get
ai_conceptions_are_too
all_kind_of_analyses
am_that_at_last
amateur_with_no_solid
amazingly_it_is_my
and_open_sub_project
and_other_similar_ai
and_simple_but_this
approaches_as_how_to
are_playing_on_my
are_these_n_grams
are_too_complex_at
are_too_many_though
at_last_someone_dared
background_neither_in_mathematics
basics_welcome_to_my
be_paved_by_traversing
believe_me_it_take
better_said_fundamental_are
big_rich_organizations_powered
billion_people_using_english
broken_and_simple_but
but_this_amazingly_it
by_some_data_center
by_traversing_the_basics
center_cluster_having_not
checker_one_of_my
classic_song_named_walk
cluster_having_not_implemented
complex_things_must_be
conceptions_are_too_complex
core_is_x_grams
data_center_cluster_having
deep_water_as_you
diamond_not_golden_rule
do_transitions_from_gram
emphasize_that_markov_s
english_deep_water_as
english_is_broken_and
exploring_the_basics_welcome
figure_out_some_approaches
finest_strings_i_would
follow_the_diamond_not
for_me_i_stick
free_and_open_sub
free_on_line_phrase
from_gram_all_the
full_meaning_n_gram
fundamental_are_these_n
get_how_important_no
gram_all_the_way
gram_let_say_gram
grams_for_all_kind
having_not_implemented_yet
how_important_no_no
how_to_do_transitions
html_leprechaun_feel_free
i_follow_the_diamond
implemented_yet_a_free
important_no_no_better
in_exploring_the_basics
in_mathematics_programming_nor
is_a_classic_song
is_broken_and_simple
is_possible_big_rich
it_is_possible_big
it_take_not_a
last_someone_dared_to
least_for_me_i
leprechaun_feel_free_to
line_phrase_checker_one
lot_but_the_problems
lot_when_i_must
many_though_after_all
markov_s_and_other
mathematics_programming_nor_linguistics
me_i_stick_to
me_it_take_not
meaning_n_gram_let
meaning_that_transition_to
more_complex_things_must
must_be_paved_by
my_english_is_broken
my_finest_strings_i
my_free_and_open
n_gram_let_say
n_grams_for_all
named_walk_this_way
neither_in_mathematics_programming
no_better_said_fundamental
no_no_better_said
no_solid_background_neither
not_a_scientist_to
not_implemented_yet_a
on_line_phrase_checker
on_my_finest_strings
only_an_amateur_with
organizations_powered_by_some
other_similar_ai_conceptions
out_some_approaches_as
paved_by_traversing_the
people_interested_in_exploring
people_using_english_deep
phrase_checker_one_of
playing_on_my_finest
possible_big_rich_organizations
powered_by_some_data
problems_are_too_many
rich_organizations_powered_by
said_fundamental_are_these
scientist_to_get_how
see_you_feel_the
similar_ai_conceptions_are
simple_but_this_amazingly
solid_background_neither_in
some_approaches_as_how
some_data_center_cluster
someone_dared_to_enter
song_named_walk_this
strings_i_would_like
take_not_a_scientist
that_markov_s_and
that_transition_to_more
the_basics_stuff_first
the_diamond_not_golden
the_full_meaning_n
these_n_grams_for
things_must_be_paved
this_amazingly_it_is
this_way_meaning_that
to_do_transitions_from
to_emphasize_that_markov
to_get_how_important
to_my_free_and
to_your_next_wow
too_complex_at_least
too_many_though_after
transitions_from_gram_all
traversing_the_basics_stuff
using_english_deep_water
walk_this_way_meaning
way_meaning_that_transition
welcome_to_my_free
when_i_must_figure
with_no_solid_background
would_like_a_lot
yet_a_free_on
you_feel_the_theme
a_free_on_line ............... 0,000,005
a_lot_but_the ................ 0,000,007
after_all_i_am ............... 0,000,147
all_i_am_only ................ 0,000,023
all_the_way_to ............... 0,012,417
along_with_other_people ...... 0,000,050
am_only_an_amateur ........... 0,000,021
am_saying_this_in ............ 0,000,001
an_amateur_with_no ........... 0,000,002
and_believe_me_it ............ 0,000,026
and_i_follow_the ............. 0,000,036
are_along_with_other ......... 0,000,002
as_to_your_next .............. 0,000,009
as_you_know_there ............ 0,000,028
at_least_for_me .............. 0,000,218
but_the_problems_are ......... 0,000,025
complex_at_least_for ......... 0,000,002
dared_to_enter_this .......... 0,000,007
don_t_you_think .............. 0,024,484
for_all_kind_of .............. 0,000,059
helps_me_a_lot ............... 0,000,011
how_it_is_possible ........... 0,001,278
i_am_saying_this ............. 0,000,337
i_must_figure_out ............ 0,000,008
i_see_you_feel ............... 0,000,026
i_stick_to_the ............... 0,000,100
i_wonder_how_it .............. 0,000,467
i_would_like_a ............... 0,000,724
in_order_to_emphasize ........ 0,000,370
interested_in_exploring_the .. 0,000,112
know_there_is_a .............. 0,000,994
like_a_lot_but ............... 0,000,010
me_a_lot_when ................ 0,000,004
must_figure_out_some ......... 0,000,002
one_of_my_dreams ............. 0,000,136
order_to_emphasize_that ...... 0,000,033
other_people_interested_in ... 0,000,016
s_and_other_similar .......... 0,000,010
saying_this_in_order ......... 0,000,001
stick_to_the_basics .......... 0,000,021
that_at_last_someone ......... 0,000,006
the_basics_welcome_to ........ 0,000,002
the_problems_are_too ......... 0,000,016
the_way_to_the ............... 0,020,412
there_is_a_classic ........... 0,000,037
this_in_order_to ............. 0,000,879
though_after_all_i ........... 0,000,019
to_more_complex_things ....... 0,000,005
to_the_full_meaning .......... 0,000,046
transition_to_more_complex ... 0,000,002
water_as_you_did ............. 0,000,006
way_to_the_full .............. 0,000,035
with_other_people_interested . 0,000,003
wonder_how_it_is ............. 0,000,222
you_are_along_with ........... 0,000,003
you_are_playing_on ........... 0,000,095
you_know_there_is ............ 0,001,117

I am like a talking (not smiling) cat throwing a toy from paw-to-paw while thoughts running free. I mean my mode of approaching things is not go-for-it type but rather play-with-it. In other words I do not strive after my dreams, here, I do not want to learn English language but to give a powerful and simple brute-force sidekick tool (as first stage of analysis) while playing with it.

There are so many facets (blinking/screaming to be examined) remaining...

5. Caramba, I am doomed to commit/track/find all errors in existence, but this of course comes in handy, AGAIN, just to explain another important facet, namely pseudo-syntax-checking coming as a by-product of long x-grams.
The dumb 'been' mistake follows:
'As you can see other dumb-dumbs like me have BEEN polluted the corpus...'
Of course 'have_been_polluted_the' yields no matches as well as 'been_polluted_the'.

And to illustrate the same idea but this time by replacing/mistaking the correct 'had' with 'have':
Incorrect one: 'Back then in 2000 I have had a favorite...'
One of useful properties of x-grams is the auto-omission of '2000' not being a literal sequence but digital.
The x-gram of order 6 'back_then_in_i_have_had' by not yielding a match suggests the incorrectness of 'have had' construction (due to usage of 'back then').
And to be able to handle other scenarios like "60's" or "19th century" used in place of "2000" dictates the need of using WILDCARDS - but this hurts performance.
So let us expand the MISTAKEN sub-sentences with some in-between x-grams in addition to 's' and 'th_century' (they are: 'the_s', 'late_s', 'the_late_s'):
'Back then in 2000 I HAVE had a favorite...'
'Back then in 60's I HAVE had a favorite...'
'Back then in the 60's I HAVE had a favorite...'
'Back then in late 60's I HAVE had a favorite...'
'Back then in the late 60's I HAVE had a favorite...'

'Back then in 19th century I HAVE had a favorite...'

Here arises a very powerful feature/facet not examined yet: a branch of phrase-checking - phrase-suggesting.
The point is: to write the sure part(s) and to get/receive some suggestions as feedback for in-between x-grams not only for preceding and following x-grams (the latter is already partially implemented in big search-sites).
Let the sure part(s) are:
1] Back then in
2] I HAVE had a favorite
No suggestions due to the wrong tense.
Let the sure part(s) are:
1] Back then in