Semantic Map word embedding

By Johannes E. Mosig (Rasa).

Semantic Map
- Francisco Webber @ Cortical IO
- words are represented by semantic fingerprints
  - 2d array of 0/1 (98% are 0s) 128x128
    - top and bottom edges are the same, so are left/right, so it's a torus
    - size of 128 works best so far
  - each cell describes a class of contexts (e.g. sentences or paragraphs)
    - paragraphs are kind of like bags of words
    - called "context class"
  - cells nearby are similar
- semantic map = fingerprints of all words

Training
- split wikipedia into paragraphs, append headings to each paragraph
- maybe remove stop words... right now no
- each snippet (paragraph) is represented by a set of its words (as bits) so it's ~50k bit vector
- [self-organizing map algorithm][1] to project the snippet vectors onto points of a square
  - distance is cartesian distance
  - there's an idea of bayesian distance that would give more weight to the coordinates that are rarely different
  - it's easy to parallelize
  - you can give each cell its own update radius: this gives better topographic error convergence but worse maps in the end
- each word is represented by top 2% of the images of the snippets that contain it

What can you do with it
- Compare words by their fingerprints (more overlap = more similar/related)
- Merge words to get new words
  - (my guess is minimal sum of distances)
  - unitize(threshold(boost(A) + boost(B)))
    - boost = amplify cells with active members
    - threshold = take N strongest
    - unitize = 1 if not 0, otherwise 0

Questions?
- can we do it in 3d? yes, but not sure if it's better.
- could we pay attention to the order of words when building word vectors?
  - maybe words that appear many times in the snippet should have more weight in the snippet vector (it would not be a binary vector then)?
  - maybe closer words in snippet should have more weight (here we would modify snippet vectors for each word when we're mapping this word)?
  - or maybe we can use some kind of attention mechanism to see which words to amplify?
- when you are picking the top 2%, you can divide appearance frequencies by frequencies of that pixel among all snippet vectors or among snippet vectors of all words (in this second approach you would count each snippet vector for each word in it).
  - seems that you're already doing this
- when comparing words, near pixels could give part of a point?
- stop words are propably represented by the most common clusters in the corpus

[1]: self-organizing map algorithm