Photograph
Daniel Kluver

about

blog

research

CV


contact

email

twitter

Portmanteau Algorithms

I am someone who likes to play with words. For me, this normally means a good (or rather, bad) pun. But there is a special place in my heart for portmanteaus - words made by linguistically combining two other words. This has been especially reliant in my Tuesday night role playing game group, which is currently playing a campaign in space. We don't take the stairs, we take the space stairs, the spairs. We don't use money, we use space money, sponey.

Recently I wondered, do we know how to have computer make portmanteaus? I expect that a computer program to generate portmanteaus could be quite fun. I was supp sired to see that there hasn't been much serious work on the subject, so I thought I might give it a try myself. Assuming this project holds my attention, this blog post will be the first of several on the subject of automatic portmanteau generation. In this blog post I intend to introduce the problem of portmanteau generation, discuss algorithms that I found in the wild, and introduce a dataset that I intend to use for future algorithm development.

Portmanteaus

Wikipedia defines a portmanteau as "a linguistic blend of words in which parts of multiple words, or their phones (sounds), and their meanings are combined into a new word". Making portmanteaus is one of the ways in which a language is expanded by combining words to make a new word that has a combined meaning. Looking at Wikipedia's list of portmanteaus, which it notes is only a selection of English language portmanteaus, it seems clear to me that the generation of portmanteaus is a relatively important tool for humans to make new words and communicate novel ideas. Therefore, I believe that algorithms to generate and/or understand portmanteaus will be important for future human facing computer programs, both to understand the novel ideas of humans, and to explain the novel ideas of computers. A recent paper has already faced a place where portmanteau generation may benefit science, as researchers tried to predict how people would mix hash tags in order to predict future trending hash tag mixtures. This work very well may benefit from an algorithm that can predict hash tag blends that are more robust than simply appending two hash tags.

An important part of the definition of portmanteau is that it is a blend of words. While this excludes straight combinations of two words, contractions, and compound words, this definition does allow a large range of word combination schemes. As I am not an expert in the field of linguistics, I will plan on being very open with my definition of portmanteaus. While I typically think of a portmanteau as taking the first part of word A and combining it with the second half of word B (spoon + fork = spork), I don't want to miss better blends such as those from Lewis Carrol, who came up with words like slithy - slimy and lithe. I would love if I could find an algorithm that can reach this level of poetic word combination, but I will settle for much less.

To my knowledge very little research has gone into how to algorithmically derive good portmanteaus. This is not terribly surprising as this isn't a critical task to natural language processing tasks. Very few programs are in a place where they would benefit from creating new words to express themselves. Likewise, for the purpose of understanding humans it is entirely acceptable to not understand a novel portmanteau until it reaches common usage, at which point it will be learned as any other new word would be. Regardless, I have found three general solutions online.

The first solution is from github user aparrish. This solution first randomizes which word will be taken first, and takes up to the first vowel of that word, then takes everything after the first vowel of the second word. This is a very straightforward solution, but it does tend to fail in some cases, especially those in which a word starts with a vowel.

The second solution is actually from a practice exam on a udacity course This is quite a nice solution. This solution takes a list of words and returns its best "portmanteau" from that list. It does this by finding all subwords pre, mid, and post such that pre+mid is a word in the list and mid+post is a word in the list. In this way it looks at all possible matching where the beginning of one word overlaps the end of another. It then returns one based on length and the goal that pre be about one fourth of the length, mid be one half of the length, and post be the final fourth of the length.

Finally there are several websites that will take a word and help a human make a portmanteau by presenting all words that overlap the beginning or end of that word. In essence these are the same algorithm as the last one, but one word is pulled out as special, and the word list is all English words. To play with these use the following links:http://www.lexiconcept.com/, http://www.portmanteaur.com/, http://www.brands.so/ideas/portmanteau/blend-words-to-invent-new-words.php.

The first approach is the only one that is capable of generating words given two specific seed words, the other algorithms require a larger list of seed words as they will only return words based on a beginning of word, end of word, overlap. The first approach, unfortunately, is also the most primitive. Therefore I feel it is safe to say that there is more work to be done on this problem.

Plan of action.

My plan of action is to use this as a way to brush up some of my machine learning skills. This means that I do not intend to aim for a heuristic approach like the previous algorithms, instead I want to, as much as possible, follow a traditional machine learning work flow.

First, I will collect a sample representing the problem I want to solve. I will then manually inspect the dataset in an effort to understand the structure of the problem, what kind of combinations tend to be used, how frequently is the new word re-spelled to represent an expected pronunciation, etc. Based on this I will try to find a way to restructure this problem to a better understood problem. My plan for this is to re-cast the problem as one of evaluating if a given new word is a "good" combination of two other words. In this way I can use standard classification / regression algorithms possibly with custom features. Finally I will build, test, and compare algorithms using a hold-out set of data. Of course, after this point I will iterate and seek improvements.

To do all of this I first need a dataset of portmanteaus. To do this I have collected the portmanteaus from Wikipedia's list of portmanteaus. I have removed "explanation text", (such as explaining why words might have been chosen or referencing similar portmanteaus. I have sought to otherwise make minimal changes, for example if a portmanteau is made from two proper names I have left the first and last name (where it was previously listed) even if only the first or last name was used in the blend. This will be the raw portmanteau dataset.

The raw portmanteau dataset, portmanteau.csv. This is a nonstandard csv file, by which I mean some rows have more or fewer columns than others. This is because some portmanteaus listed had more than two source words, and in an effort to not bias the dataset I have left these in. Each row contains the portmanteau, then each source word in the order listed on Wikipedia. Words that are used first in the portmanteau were frequently listed

I will probably create a more cleaned up (and therefore subjective) dataset for later blog posts when I explain exactly how I will use the datasets. If anybody knows of a bigger dataset, or an important portmanteau algorithm I didn't list, please let me know by twitter or email. Otherwise, I look forward to playing with, and hearing what the community does with, this dataset.