Группа авторов

The Handbook of Speech Perception


Скачать книгу

An even greater expansion of these semantic regions can be found in more recent work (Huth et al., 2016).

      1 airplane

      2 boat

      3 celery

      4 strawberry

      One way to encode each of these as a list of numbers is to simply assign one number to each word: ‘airplane’ = [1], ‘boat’ = [2], ‘celery’ = [3], and ‘strawberry’ = [4]. We have enclosed the numbers in square brackets to mean that these are lists. Note that it is possible to have only one item in a list. A good thing about this encoding of the words, as lists of numbers, is that the resulting lists are short and easy to decode: we only have to look them up in our memory or in a table. But this encoding does not do a very good job of capturing the differences in meanings between the words. For example, ‘airplane’ and ‘boat’ are both manufactured vehicles that you could ride inside, whereas ‘celery’ and ‘strawberry’ are both edible parts of plants. A more involved semantic coding might make use of all of these descriptive features to produce the following representations.

Word Manufactured Vehicle Ride inside Edible Plant part
airplane 1 1 1 0 0
boat 1 1 1 0 0
celery 0 0 0 1 1
strawberry 0 0 0 1 1

      So far, our example may seem tedious and somewhat arbitrary: we had to come up with attributes such as “manufactured” or “edible,” then consider their merit as semantic feature dimensions without any obvious objective criteria. However, there are many ways to automatically search for word embeddings without needing to dream up a large set of semantic fields. An incrementally more complex way is to rely on the context words that each one of our target words occurs within a corpus of sentences. Consider a corpus that contains exactly four sentences.

      1 The boy rode on the airplane.

      2 The boy also rode on the boat.

      3 The celery tasted good.

      4 The strawberry tasted better.

      Unlike the previous semantic‐field embeddings, which were constructed using our “expert opinions,” these context‐word embeddings were learned from data (a corpus of four sentences). Learning a set of word embeddings from data can be very powerful. Indeed we can automate the procedure; and even a modest computer can process very large corpora of text to produce embeddings for hundreds of thousands of words in seconds. Another strength of creating word embeddings like these is that the procedure is not limited to concrete nouns, since context words can be found for any target word – whether an abstract noun, verb, or even a function word. You may be wondering how context words are able to represent meaning, but notice that words with similar meanings are bound to co‐occur with similar context words. For example, an ‘airplane’ and a ‘boat’ are both vehicles that you ride in, so they will both occur quite frequently in sentences with the word ‘rode’; however, one will rarely find sentences that contain both ‘celery’ and ‘rode.’ Compared to ‘airplane’ and ‘boat,’ ‘celery’ is more likely to occur in sentences containing the word ‘tasted.’ As the English phonetician Firth (1957, p. 11) wrote: “You shall know a word by the company it keeps.”

Word also better boy good