Express the meaning of a linguistic entity (word, phrase, sentence, paragraph, or document) as a vector in a multi-dimensional space, such that the similarity between entities expresses their similarity in meaning
Edit me

All such methods rely on the maxim “You shall know a word by the company it keeps”: words with similar meanings will tend to occur in similar linguistic contexts. For instance, think of the things you might say about your cat: she’s so cute, she’s furry, she needs to be fed, I need to let her out, she bit me, etc. You might also these things about your dog, because dogs and cats are similar kinds of things, but you wouldn’t say them about, say, your blender. So tabulating co-occurrences amongst words in sentences (or paragraphs, or documents, etc) can provide clues about their similarity in meaning. All methods for computing vector-space representations of word meanings rely on this general intuition, but they differ in the specific ways these co-occurrence statistics are computed and analyzed. This page collects documentation for X different methods.

Singular value decomposition or SVD (e.g. Latent Semantic Analysis, Holistic Analog to Language, etc.)

These methods create a giant word-to-word matrix where entries indicate how often two words appear together within some window of time. These data are converted into a dissimilarity matrix, and the eigen-vectors of the matrix are computed with singular value decomposition. The analysis returns an n-dimensional vector for each word, with proximity in the n-dimensional space indicating how similar the words are in meaning.

  • Watch this high-level video
  • Read this online tutorial and carry out exercises in Python
  • Work through this JuPyteR notebook example

Videos

  • Video 1
  • Video 2

Applied papers

  • Applied paper 1
  • Applied paper 2

Online Tutorials

  • Online tutorial 1
  • Online tutorial 2

Theory papers

  • Theory paper 1
  • Theory paper 2

Latent Dirichlet Allocation or LDA (e.g. Topics models)

These methods imagine that documents are generated by sampling words from a mixture of underlying topics. Each topic is represented as a probability distribution over words, and each document is viewed as having been generated by sampling from a probability distribution over topics. Given a large set of documents plus a parameter that specifies how many topics there are, it is possible to estimate the latent topics as well as the mixture probabilities over topics for each document, and for new documents. These mixture probabilities constitute a semantic vector: documents that sample the underlying topics in the same way probably contain similar content. So this is a good way of generating semantic vectors representing whole documents. By inspecting the most-probable words associated with each topic, it is also possible to get a sense of what each topic “means,” which in turn produces human-interpretable information about documents.

  • Watch this high-level video
  • Read this online tutorial and carry out exercises in Python
  • Work through this JuPyteR notebook example

Videos

  • Video 1
  • Video 2

Applied papers

  • Applied paper 1
  • Applied paper 2

Online Tutorials

  • Online tutorial 1
  • Online tutorial 2

Theory papers

  • Theory paper 1
  • Theory paper 2