Dirk Siepmann: Universita«t-GH Siegen, Fachbereich 3, Adolf-Reichwein-Strae,
D-57068 Siegen,Germany ([email protected])
This article attempts to synthesise recent advances in collocational theory into a
coherent framework for lexicological theory and lexicographic practice. By posing a
number of fundamental questions related to the definition of collocation, it critically
reviews frequency-based, semantic and pragmatic approaches to collocation. It is found,
among other things, that two types of collocation, namely ‘long-distance’ collocation
and collocation between semantic features, have suffered almost total neglect. This leads
to suggestions for a new division of the collocational spectrum and for a revised
definition of ‘collocation’ based on the notions of ‘usage norm’ (Steyer 2000) and
‘holisticity’ (Siepmann 2003). It is argued that this new view of collocation considerably
widens the dictionary maker’s brief, since future lexicography will have to provide a full
account of both structurally simple and structurally complex units, including fixed
expressions of regular syntactic-semantic composition (see Part II of this article, to be
published in the March issue of this journal).
1. Introduction
According to modern science, there is no such thing as ‘independent existence’;
at least since the advent of chaos theory, there has been full recognition that
all forms of life and material phenomena, whether at the micro-level or at the
macro-level, are interdependent. In linguistics, this realization has found its
fittest expression in the idea of linguistic rather than literary ‘intertextuality’,
whereby the meaning of one text and its constituent elements depends on
millions of other texts using similar or identical elements. Textual meaning is
thus created by the interplay of two types of repetition, viz. (a) collocation
(in the largest possible sense, including colligation1 and phraseology) and
(b) cohesion. It turns out that one instance of collocation and the entire
language are mutually illuminating, since the instance is understood in terms of
Dirk Siepmann
the whole, and the whole in terms of the instance (cf. Hunston 2001: 31); taking
this a bit further, we might say that not only is each pattern necessary for
comprehending the sum total of similar patterns, but each pattern is
also a miniature version of that sum total, as shown by the fact that the
meaning of individual patterns (e.g. German ‘sonniges Gemüt’ [‘sunny
disposition’ ¼ irrepressible high spirits] vs ‘sonnige Lage’ [sunny location]),
even if shorn of any context, is evident to the native speaker.
This relatively recent view of meaning creation (Hoey 1991, 1998, 2000,
Feilke 1994, 1996) seems much more in keeping with speakers’ intuitive
knowledge about language than was the case in earlier structuralist theories.
The latter tended to assume that expressions such as ‘sonnige Lage’ have
both a compositional, literal meaning and a non-compositional, figurative
meaning (Feilke 1996: 128). In an intertextual or socially-based view of
meaning creation, the compositional meaning is exposed for what it is, namely
an abstraction of the linguist which has no base in the native speaker’s
mental lexicon; the expression ‘sonnige Lage’ is then considered to be a
‘holistic’ sign that is irreducible to the sum of its parts. In a related
development, computational and cognitive linguists have used corpus-linguistic
insights to work out models of language grounded in actual usage rather
than abstract general rules (Chandler 1993, Croft and Cruse 2003, Skousen
1989). In these models word or clause formation is by analogy with existing
exemplars, and it will be seen that such models can also be applied to
This article reviews, one by one, the various defining criteria that have in
the last half century been called upon to define the notion of collocation,
pursuing a dual objective: (a) to show that none of these criteria apply in
all cases, so that we can at best give a prototypical definition of collocation,
and (b) to demonstrate that the problems associated with the definition
of collocation stem from the mechanistic, old-paradigm view of language
embodied in structuralist theories which try to impose theoretical abstractions
on an infinitely complex reality arising from communicative interaction and
the institutional practices such interaction puts in place. This will then allow
us to provide a more secure and more broadly based underpinning for the
treatment of colligation and collocation in lexicography. With the exception
of Steyer (2000), no such model has as yet been proposed.
The subject of collocation has been approached from two main angles:
on one side are the semantically-based approaches (e.g. Benson 1986, Mel’čuk
1998, González-Rey 2002, Hausmann 2003, Grossmann and Tutin 2003) which
assume a particular meaning relationship between the constituents of a
collocation; on the other is the frequency-oriented approach (e.g. Jones and
Sinclair 1974, Sinclair 1991, Sinclair 2004, Kjellmer 1994) which looks at
statistically significant cooccurrences of two or more words. This theoretical
distinction is paralleled by a geographical divide: the semantic approach has its
Collocation, Colligation and Encoding Dictionaries
origins in continental European research into phraseology, while the frequency
approach is firmly rooted in British contextualism. There has until now been
surprisingly little exchange between the two groups, and when the semanticist
Hausmann (2003) claims to have won the war over collocation, one wonders if
that war has ever been fought.
A third, more recent approach to phrasemes and collocations (Feilke 1996,
2003) might be termed ‘pragmatic’, since it claims that the structural
irregularities and non-compositionality underlying such expressions are
diachronically and functionally subordinate to pragmatic regularities determining the relationship between the situational context and linguistic forms.
In this view, collocation can best be explained via recourse to contextualisation
theory (Fillmore 1976).
In what follows, I shall argue that there is no reason to resort to the military
metaphor, let alone go to war on matters of collocation. It is much wiser to
unify the three approaches. Tersely stated, I shall argue the following theses:
(1) Only the frequency-based approach can provide a heuristic for discovering
the entire class of co-occurrences; in a way, it is safe from refutation, but
empty – it gives us all the raw material, but tells us nothing about how this
material came to be or how it is to be structured; it has also resulted in
lexicographic products of doubtful value, such as Kjellmer (1994) and
Sinclair (1995) (cf. Hausmann 2003: 319–320, Siepmann 1998).
(2) By contrast, the semantically-based approach is fragmentary – it cannot
account for all possible cases. It would nevertheless seem absurd to
abandon such an intuitively appealing approach at the first appearance of a
counterexample, since it has given rise to reliable collocational dictionaries
such as Langenscheidts Kontextwörterbuch Französisch-Deutsch.
(3) Likewise, as I shall explain below, a purely pragmatic approach relying on
the extralinguistic context cannot explain a large number of co-occurrences
operating at the level of semantic features.
(4) It follows from this that the debate between the various approaches is a
more/less rather than a yes/no issue. What is needed is an extension of the
semantically-based approach that will take account of strings of regular
syntactic composition which form a sense unit with a relatively stable
meaning. ‘Lexical bundles’ (Biber et al. 1999) such as je sais que c’est or it’s
been will not be included among the class of collocations (cf. Siepmann
2003). Although such sequences may perform similar or identical functions
across a range of texts, they have no meaning ‘by themselves’. In sharp
contrast, there are good theoretical and practical reasons for subsuming
under the notion of collocation such colligational patterns as regarde où tu
vas, dans les colonnes de (þ name of newspaper or magazine) or si elle est
prise à temps (referring to an illness), which have so far been regarded as
free sequences of words subject only to general rules of syntax and semantics.
Dirk Siepmann
For greater expository convenience, the various questions raised by the
discussion of the above theses will be broken down under five separate heads:
How many elements make a collocation?
What elements make a collocation?
Are collocations arbitrary?
Can we distinguish between collocations and phraseology on the one hand,
and collocations and free combinations on the other?
(5) Are collocations monosemous and monoreferential? Are there synonymic
This will lead to a division of the collocational spectrum into four major
categories, all of which have their role to play in the making of dictionaries,
especially those aimed at the non-native speaker.
My theoretical arguments will be leavened with a large number of concrete
examples encountered during the ongoing compilation of three unabridged
bilingual thesauri intended mainly for non-native speakers of English, French
and German (the ‘Bilexicon’ project). All of these examples have been drawn
from the following authentic sources (for a detailed account of corpus
construction, see Siepmann 2005):
(a) electronic editions of wide-circulation quality newspapers and news
magazines (The Times, The Guardian, The Economist, Le Monde,
Le Monde Diplomatique, Süddeutsche Zeitung, Frankfurter Rundschau,
Der Spiegel );
(b) a large corpus of academic texts produced from reviews, journal articles,
doctoral theses and portions of books;
(c) 50-million-word corpora of fiction and fan fiction freely available on
the Internet;
(d) a 100-million word corpus of the language of motoring based mainly on
Internet sources.
Table 1 gives a breakdown of the sources used by corpus type, content, size,
baseline year and analysis software.
2. How many elements make a collocation?
It is accepted wisdom among European researchers that collocations are
binary units, and this is probably true for the majority of the class. Thus, the
most common type of collocation is the combination of a noun with a verb,
and there are hundreds of thousands of examples which confirm this point
of view (e.g. take a step, launch an appeal ). Mel’čuk (most recently 2003)
argues that the constituents of such collocations tend to be linked by a
standard lexical function, such as Magn (rely on [Magn] ¼ heavily, beautiful
Table 1: Corpora used in this study
Word Count Baseline Year
Corpus of Academic
English (CAE)
Corpus of Academic
French (CAF)
Corpus of Academic
German (CAG)
Corpus of English
Fiction (FE)
Corpus of French
Fiction (FF)
Corpus of German
Fiction (FG)
Corpus of English
Motoring (CME)
reviews, journal articles, doctoral
theses and portions of books
reviews, journal articles, doctoral
theses and portions of books
reviews, journal articles, doctoral
theses and portions of books
reviews, journal articles and
portions of books from CAE
reviews, journal articles and
portions of books from CAF
reviews, journal articles and
portions of books from CAG
Internet forums and chatrooms,
electronic magazines, transport
sites, encyclopaedia and
dictionary articles
Issues of The Times, The
Guardian and The Economist,
published in London
30 million
full-text and samples
full-text and samples
full-text and samples
full-text and samples
British Newspapers and full-text
News Magazines (NE)
50 million
1980 (less than 5% of texts
predate 1980)
1980 (less than 5% of texts
predate 1980)
1980 (less than 5% of texts
predate 1980)
50 million
50 million
100 million
100 million
30 million
30 million
Collocation, Colligation and Encoding Dictionaries
Corpus (Abbreviation)
Dirk Siepmann
Table 1: Continued
Corpus (Abbreviation)
Word Count
Baseline Year
French Newspapers and
News Magazines
German Newspapers and
News Magazines (NG)
Issues of Le Monde and Le Monde
diplomatique, published in Paris
Issues of Süddeutsche Zeitung,
Frankfurter Rundschau and Der
Spiegel, published respectively
in Stuttgart, Frankfurt and
100 million
100 million
Collocation, Colligation and Encoding Dictionaries
[Magn] ¼ drop-dead2). Furthermore, as Hausmann (2003), Siepmann (2003,
2004) and Schafroth (2003) have argued, many three-element collocations can
be shown to be reducible to a binary structure:
(1) allgemeine Gültigkeit haben -4 (allgemein þ Gültigkeit) þ haben
hohes Ansehen genießen -4 (hohes þ Ansehen) þ genießen
ulcère gastrique bénin -4 (ulcère þ gastrique) þ bénin
prendre une bouffée d’air -4 (air þ bouffée) þ prendre
joli petit cul -4 (cul þ petit) þ joli
not wildly original -4 (original þ wildly) þ not
The same goes for combinations of multi-word idioms and other items;
consider for example his plan came to fruition or their disagreement brought
them to blows.
Some of these three-element collocations have a higher frequency of
occurrence than their constituents, which suggests that they are learned and
reproduced as wholes rather than recombined each time, but this presents no
serious challenge to the view of collocation as a binary phenomenon. More
threatening to this view are irreducible three-element collocations such as the
(2) the car holds the road well (?holds the road [may be used of tyres]) -4
la voiture tient bien la route/tient la route (meaning either ‘holds the
road well’ or ‘stays on course’) -4 der Wagen hat eine gute Straßenlage
(*hat eine Straßenlage)
the car has too wide a turning circle -4 la voiture braque mal -4
der Wagen hat einen zu großen Wendekreis
In two of the languages under consideration the three-element collocation
cannot be broken down into what seem to be its two major constituents. Thus,
while it is perfectly possible to single out gute Straßenlage as one constituent
of the German collocation, the word combination *eine Straßenlage haben
is inadmissible in German.3 It is also pertinent to note that the English
collocation does not appear to have a negative counterpart (a search for
hold the road poorly/badly on Google yields no results), whereas the opposite
is true of the French collocation, where the adverb is optional and a negative
wording appears admissible (la voiture tient mal la route). Other examples of
this type include:
(3) avoir un geste déplacé (FF) -4 (?)avoir un geste
recevoir un accueil chaleureux -4 (?)recevoir un accueil
take a harder line (against) (NE) -4 (?)take a line (against)
shall I break this note into something smaller (NE)
den Kasten sauber halten (NG) (football)
Dirk Siepmann
Once we have grasped this concept of the three-element collocation, it is easy
to see that many binary word combinations which have traditionally been
regarded as free (such as accepter des pie`ces) are in fact embedded in larger
structures of a collocational nature, such as the following three-element
collocations with a non-human object:
(4) the pay and display machine (parking meter, etc.) only takes twenty cent
coins -4 l’horodateur (le parcmètre, etc.) n’accepte que des pièces de
20 centimes -4 der Parkscheinautomat (die Parkuhr usw.) nimmt nur
20 Cent-Stücke
the battle (war, etc.) claimed many casualties -4 la bataille (la guerre, etc.)
a fait beaucoup de victimes -4 die Schlacht (der Krieg, etc.) hat viele
Opfer gefordert
cette expérience a marqué ma vie -4 dieses Erlebnis hat mein Leben
The list of such examples could be lengthened. With collocations such as
hold the road (subject: tyre), tomber à gros flocons (subject: neige), emporter la
conviction (subject: argument) or eine Kurve machen (subject: Straße), it would
clearly seem difficult to identify a standard lexical function (in the sense of
Mel’čuk) that can provide a systematic link between the verb and the noun;
this is because the entire collocation is semantically dependent on a specific
The English translation of the German collocation eine Kurve machen,
where the prepositional phrase road is a standard postmodifier of bend
(Kurve ¼ bend in the road or bend ), shows how closely the two concepts4 are
(5) die Straße macht hier eine Kurve (NG) -4 there’s a bend in the road
here (CME)
Likewise, current theorizing on collocation does not make allowances for
the relationship between collocation and verb complementation, or ‘valency’.
Thus, (auto)route þ filer (literally: ‘road’ þ ‘rush’) may well be considered a
collocation in Hausmann’s and Mel’čuk’s theories, but this disregards the
fact that the collocation itself requires a particular verb pattern including a
locative element (l’autoroute file vers la valle´e, à gauche, etc.); in other words,
it cannot be used with all the verb patterns entered by filer (cf. *l’autoroute
file, *l’autoroute file à toute allure). A related case is that of the German
collocations ein Kind schenken and ein Kind machen, where the former can only
be used with a female subject and the latter only with a male subject. In other
words, collocation and verb complementation are intimately related, since
many noun-verb collocations require a specific distribution of semantic roles.
Collocation, Colligation and Encoding Dictionaries
Clearly, then, the two-word combination (auto)route þ filer cannot possibly
be viewed as a fully-fledged collocation.
Evidence is also gathering of three-word collocations one of whose
constituents is delexicalised, and hence redundant. Kenny (2003: 343) cites
the German phrase die Augen weit aufreißen, where the semantic feature
‘wide open’ (= weit) is included in the meaning of the verb aufreißen. Such
delexicalisation has long been observed in so-called ‘support verb constructions’ (take a decision), but it seems to be just as common in other types
of collocation.
The evidence of such examples points to the conclusion that multi-word
collocations cannot always be split up into two basic constituents, and that
collocations consisting of three items or syntactic ‘slots’ are in fact quite
common. This is particularly true of collocations involving neither a human
subject nor a human object, such as expe´rience þ marquer þ vie. Strictly
speaking, then, we would not be entitled to define collocations as binary
units, as do Hausmann and Mel’čuk, unless we are willing to adopt a very
broad prototypical definition.
From the vantage point of practical lexicography, then, it is preferable, and
to some extent already established practice, to record tripartite lexical units
even where binary units could be justified (e.g. schmal geschnitten -4 die Hose
ist schmal geschnitten [NG]), since dictionary users could be led astray if such
information were missing.
3. What elements make a collocation?
This section starts by discussing the distinction made by European collocation
scholars between semantically dependent ‘collocates’ and semantically
autonomous ‘bases’, or nodes. It goes on to show that this distinction, as
well as the related assumption of directionality in collocational attraction,
is not applicable in a great many cases, and that a large number of word
combinations, notably long-distance collocations, operate at the level of
semantic features rather than lexemes. This leads to suggestions for a new
typology of collocations.
3.1 The autonomous/dependent distinction
At the heart of collocational theory is the assumption that the constituents of
the collocation differ in what Hausmann (1999: 122ff.) calls their ‘semiotactic’
status: an ‘Autosemantikon’, or semantically autonomous lexeme such as
decision or disaster functions as the base, which co-occurs with an arbitrarily
selected, semantically dependent ‘collocate’ (‘Synsemantikon’) such as take
or unmitigated. Intimately connected with this is the idea that the collocate
Dirk Siepmann
Table 2: Hausmann’s distinction between free word combinations and
semantically autonomous þ
semantically autonomous ¼ free
word combination
semantically autonomous þ
semantically dependent ¼ collocation
he likes money
look at the sea!
he prefers fish to meat
money þ withdraw
decision þ take
clouds þ scudding
takes on a meaning peculiar to the collocation. Diagrammatically this can
be represented as in Table 2.
This definition is echoed in meaning-text theory (cf. Mel’čuk 1998), albeit
in slightly different terms:
‘A collocation AB of language L is a semantic phraseme of L such that
its signified ‘‘X’’ is constructed out of the signified of one of its two
constituent lexemes – say, of A – and a signified ‘‘C’’ [‘‘X’’ ¼ ‘‘AþC’’] such
that the lexeme B expresses ‘‘C’’ only contigent on A.’ (Mel’čuk 1998: 30)
The dependency relation between B and C covers four types of collocations
(cf. Mel’čuk 1998: 30–31):
(a) ‘C’ 6¼ ‘B’, that is, B does not have the corresponding meaning in the
lexicon, and a) ‘C’ is empty, that is, the lexeme B is a delexical support verb
selected by A [e.g. give ( a vacuum, take a decision, porter un jugement)
or b) ‘C’ is not empty but the lexeme B expresses ‘C’ only in combination
with A [e.g. black coffee, bie`re bien frappe´e]
(b) ‘C’ ¼ ‘B’, that is, B has the corresponding meaning in the lexicon, and a)
‘B’ cannot be replaced by any synonym when it appears in conjunction
with A [e.g. strong coffee rather than *powerful coffee, heavy smoker] or b)
‘B’ includes the meaning ‘A’ (e.g. rancid butter, artesian well )
3.2 Criticism ofthe autonomous/dependent distinction
There are four main problems inherent in the autonomous/dependent
distinction; let us expound these in greater detail.
3.2.1 Semantic autonomy vs semantic dependency. The dividing line between
semantically autonomous and semantically dependent words is hazy and not
clearly defined (cf. Brauße 1992). For some linguists, it runs parallel to the
boundary between word classes, with items able to function as sentence
Collocation, Colligation and Encoding Dictionaries
constituents (nouns, verbs and adjectives) on one side, and words with a
morphological or syntactic function (articles, prepositions, etc.) on the other.
Other scholars assume that the boundary cuts across different parts of speech;
according to them, a noun such as scholar is semantically autonomous, whilst
a noun like member is semantically dependent on its linguistic environment
(e.g. party member, family member). Yet others (e.g. Lutzeier 1981) go so far as
to claim that there are no criteria at all allowing us to differentiate between
words that have lexical content and those that do not. Indeed, words that
have been intuited as semantically dependent by collocation scholars may, on
inspection, turn out to be semantically autonomous (see 2.2.3 below).
3.2.2 Collocations ofregular syntactic-semantic composition. As seen in Section 1, the
collocational character of seemingly free combinations such as accepter des
pie`ces (‘take coins’) only comes to light if the wider context is taken into
consideration. Similar considerations hold true for other types of combinations
involving items with the same or a similar semiotactic status; here are a few
typical examples:
(6) I’ve got grease all over my shirt. (FE)
regarde où tu vas! (FF) (¼ pass auf, wo du hintrittst; watch where you are
going/stepping!, watch where you put your feet!)
I didn’t bring the car (FE)
look at the time! (FE)
From the perspective of structuralist linguistics, such sentences would be
considered composite units whose meaning is the sum total of the literal
meaning of its constituents; in other words, they would be viewed as falling
within the scope of the open-choice principle. On inspection, however, they
turn out to be semi-phrasemes (i.e. collocations). Three main reasons can be
advanced for this: firstly, they are clearly not idioms, since they are
immediately comprehensible to anyone who is familiar with their basic
constituents; thus, the first example can be analysed as follows: [subject] þ have
got þ [object] þ [locative]. Secondly, it is evident that the ‘literal’ meaning of
the first sentence could only be construed as referring to a shirt every square
inch of which was entirely smeared with grease, but, of course, this is not what
the sentence means to a native speaker, who will take it to mean that only
part of the shirt’s surface has been stained.5 Thirdly, the same meaning could
be expressed quite differently in another language such as German: ich habe
mein Hemd mit Fett beschmiert/mein Hemd ist voller Fett/mein Hemd ist ganz
fettig. What we are dealing with, then, is an instance of a collocational
framework (Renouf and Sinclair 1991) or, more precisely, a type of colligation,
that is, a recurrent grammatical pattern that is lexically restrained: have
got þ [liquid, crumbs, etc.] þ on/all over [item of clothing, body, body part].
Dirk Siepmann
Table 3: Translational equivalences at different levels
Seemingly free combination
1. den Rock enger machen
2. on a clear motorway / sur (une)
autoroute dégagée
3. einen Unfall nach dem anderen
bauen [‘have one accident after
4. his attempt on the (NP:
mountain / record)
free combination of morphemes
5. Freizeit-(N), Hobby-(N),
Freizeitmaler, Hobbykoch]
take in the skirt
auf freier Strecke (alongside: auf
einer freien Autobahn)
collectionner les accidents
colligation 1 collocation
sein Versuch, [Berg] zu bezwingen /
[Rekord] zu brechen
N à ses heures [ poe`te, peintre,
cuisinier à ses heures]
Similar observations can be made for the second example, where the
interlingual equivalents clearly show that the phrase is idiomatically
constrained. The standard German translation uses two entirely different
and more specific verbs (regarder -4 aufpassen (¼ ‘pay attention’), aller -4
hintreten (= ‘step [somewhere]’).
This kind of finding links up with Hausmann’s (1997) claim that ‘everything
in language is idiomatic’ and with Hunston’s (2001) investigation into
colligation, which shows that even grammatical strings of a fairly random
nature may carry a particular semantic prosody. Thus, the sequence NP
may not be a(n) NP is used as a signal of concession commonly followed by
a contrasting clause introduced by but (Hunston 2001: 24).
This is also obvious from such interlingual correspondences as those given
in Table 3.
These examples show that translational equivalence can usually be achieved
at the level of ‘constructions’ (in the sense of Fillmore). Probably the most
frequent case is the rendition of one construction type by the same type in
another language (e.g. espionner, c’est attendre; to spy is to wait; spionieren
heißt warten); it is by no means uncommon, however, to find one construction
type translated by another. Thus, equivalences 1–3 of Table 3 can be accounted
for in terms of a shift from an English complex and schematic construction,
whose rules of semantic composition are fairly general, to a German complex
and substantive construction, whose rules of semantic composition are more
specialized (for a listing of construction types, see Table 4). The French phrase
Collocation, Colligation and Encoding Dictionaries
Table 4: The syntax-lexicon continuum (Croft/Cruse 2003: 255)
Construction type
Traditional name
Complex and schematic
complex, substantive verb
syntactic category
SBJ be-TNS Verb –en
by OBL
SBJ consume OBJ
complex and substantive
complex but bound
atomic and schematic
atomic and substantive
kick-TNS the bucket
[DEM], [ADJ]
[this], [green]
sur (une) autoroute de´gage´e (example 2), where the indefinite article is
optional, shows how increased use may result in greater fixity and brevity,
in other words, in ‘phraseologicization’ (cf. German Porsche fahren alongside
einen Porsche fahren, or French sur chausse´e mouille´e alongside sur une
chausse´e mouille´e). Equivalence 4 is remarkable as demonstrating that mainly
schematic constructions in one language may correspond to combinations of
schematic and substantive constructions in another. Even stronger support
for the notion of different construction types comes from such equivalences
as 5, where a complex but bound construction in German corresponds to a
complex and schematic construction in French.
3.2.3 Contingent meaning. The autonomous/dependent distinction presupposes
that, in the words of Mel’čuk (1998: 31), ‘the problem of the lexicographic
description of lexical units is an independent problem that has to be
solved . . . prior to any discussion of phraseology’. Thus, Mel’čuk seems to
assume that the meaning of the adjective rancid, which occurs in the noun-verb
collocation rancid butter, can only be defined with reference to butter. This
assumption is, however, belied by even the briefest corpus enquiry; it is found
that the adjective itself has a wide combinatorial range, which divides into
two separate meaning groups, viz. (a) food, butter, bacon, milk, meat, cream,
fat, grease, flour, wheat, oil, chocolate; smell, odour, aroma; socks, sweat; water
and (b) atmosphere, sentiment, academics, affair, show, humour, prune. This
shows that the adjective has at least two metonymically related meanings of its
own which might be glossed respectively as ‘(of food) having a rank smell or
taste as the result of decomposition or chemical change’ and ‘(of people or
things) having vile, revolting, obnoxious qualities’; these two meanings would
have to be recorded in the dictionary. Similar analyses have been proposed for
other seemingly ‘unique’ collocations of Mel’čuk’s type 2(b), such as schütteres
Haar (‘thin hair’; Steyer 2003: 107), with the same results. Another reason why
Dirk Siepmann
lexical entries cannot simply be presupposed as given is that some nouns simply
do not have any meaning in isolation. One example cited by Feilke is German
Lage (‘situation’), and the same goes for its standard English and French
equivalents. The French collocation situation þ faire (‘la situation faite aux
protestants’) could therefore be said to consist of two semantically empty
items, and yet the combination of the two yields a meaningful collocation.
3.2.4 Collocation of semantically autonomous items. Even if we assume that a
sharp line can be drawn between content words and ‘delexical’ words, there
remain numerous examples of collocations made up of two semantically
autonomous items (printed in bold below), some of which have interlingual
(7) an empty parking space (or: a vacant parking space) -4 un emplacement
libre -4 ein freier Parkplatz (cf. ein leerer Parkplatz ¼ an empty/deserted
car park)
a quiet drink (hypallage) -4 (cf. prendre un verre en toute tranquillité) -4
(cf. the idiom: in Ruhe einen trinken)
(have) cold feet (in the non-figurative sense) -4 (avoir) les pieds gelés /
glacés (cf. also: avoir froid aux pieds) -4 kalte Füße (haben)
to stop for petrol (for coffee, for a pee) -4 (free combination: s’arrêter pour
faire le plein) -4 (free combination: anhalten um zu tanken)
to tell a joke -4 faire une blague -4 einen Witz erzählen
The first example shows that English distinguishes between ‘free’ (¼ free
of charge) and ‘empty’ (¼ unoccupied) parking spaces. The second example
illustrates a case of ‘frozen’ hypallage: the semantic features of the adjective
quiet are incompatible with the noun drink; it is the situational context in which
the drink is taken that would normally be described as ‘quiet’.6 The third
example demonstrates that French cannot use the adjective froid attributively
when reference is made to parts of the body. The fourth example illustrates
equivalences between seemingly free combinations in German and fixed
expressions in English. Although there is a small number of variants in
evidence, we cannot assume compositionality here. The fifth example is
interesting in that there are synonymic collocations where the verb would be
regarded as semantically contingent on the noun: crack/make a joke.
3.3 Collocations of verbs with locative prepositional phrases
Once we have realised that there are too many exceptions to the definition of
collocation as a combination of items with a distinct semiotactic status, it
becomes evident that a large number of other lexical units should be classified
Collocation, Colligation and Encoding Dictionaries
as collocations. A clear example is afforded by combinations of a locative
prepositional phrase with a verb:
(8) to hide behind the curtain (cf. also ‘to be a curtain twitcher’) -4 guetter
derrière le rideau -4 hinter dem Vorhang stehen (sich hinter dem Vorhang
to wipe out on the bend/to go out of control on the bend/to be unable
to stay on its own side of the road -4 se déporter dans le virage -4 aus der
Kurve getragen werden
There are two main reasons for including such items among the class of
collocations. One is that they are both cognitively and semantically similar
to noun þ verb collocations of the type trim þ hedge, serrer þ vis or Hörer þ
abnehmen. In the case of the latter, the verbs (trim, serrer, abnehmen) describe
an action that is typically performed with the object in question; similarly,
verbs such as guetter and se de´porter designate actions that typically occur in
particular places: nosy neighbours make a habit of hiding behind the curtain,
and speeding drivers run the risk of losing control of their vehicles on a bend.
The second reason is that these word combinations tend to be interlingually
unpredictable (cf. the above examples), making them prime sources of
difficulty for second-language learners.
3.4 Directionality
A related problem is the assumption of directionality (Hausmann 1979)
or of a hierarchical relationship between the constituents of the collocation
(González-Rey 2002), whereby the selection of the collocate is contingent on
the prior selection of the base. While this is more or less obvious with items
such as table þ lay/set or money þ withdraw, we have already seen above that
examples such as road þ hold cast serious doubt upon the validity of the theory.
Hartenstein (1996: 95) cites counterexamples of the type he`re þ pauvre (‘poor
wretch’) where the noun cannot be viewed as semantically autonomous since
it has no referent in present-day French. In similar vein, Scherfer (2002) notes
that even such textbook examples of collocational theory as ce´libataire þ
endurci (‘confirmed bachelor’) may be viewed as bidirectional, since the
adjective endurci combines with any noun carrying the semantic feature [þ figé
dans son comportement]: criminel, catholique, Parisien, etc; it is monosemous,
semantically autonomous and just as clearly defined as the noun ce´libataire.
Similar considerations hold for adjectives such as crowded or busy in
combination with nouns like street, road or square. Another example of this
is the French adjective sauf, as witness the concordance given in Table 5
(cf. Siepmann 2003).
Dirk Siepmann
Table 5: sauf/sauve 1 honneur, morale, etc.
être assuré, c’est que son univers est
mique du pays. Le consensus social est
euve. L’honneur des Bafana Bafana est
int. Mine de rien, l’honneur a été
silence. Le conservatisme ambiant est
ité fait sa force. Mais la morale est
le conseil de guerre. La morale est
udrier incarne le péché. La morale est
(XO de préférence). La tradition est
CERTES, toutes les apparences sont
sensibles. Certes, les apparences sont
i les apparences de la démocratie sont
sauf, et qu’en dépit de bien des zig
sauf, les exportations allemandes ne
sauf, leurs séances d’entraı̂nement a
sauf, on a terminé septième sur neuf
sauf. Dominique Lecourt évoque en
sauve ; le bon sens aussi : ’’ Vivre
sauve : les innocents seront punis e
sauve puisque l’auteur châtie son hé
sauve ! Pour l’amateur de Oolong . . . P
sauves, et on peut mettre au crédit
sauves, et le parti sorti vainqueur
sauves, la guérilla est quand même
The syntagm we are dealing with here can be formalised as NP
(abstract) þ eˆtre þ sauf. From a directional point of view, the adjectival
collocate sauf would only take on its full meaning through the presence of
the semantically independent noun phrase (e.g. l’honneur des Bafana Bafana)
(cf. Hausmann 1979: 191–192). In the present case, however, this argument
does not hold water. The adjective sauf is as sharply defined as honneur,
apparence, tradition and morale, and it is the adjective that is the invariable
factor in the equation. This becomes even more apparent if we look at English
or German translations of the phrase, which use the clearly delimited verbs
keep up/save and wahren/retten respectively:
(9) les apparences sont sauves -4 appearances are kept up -4 der Schein ist
la république était sauve -4 the republic had been saved -4 die Republik
war gerettet
3.5 Collocation between semantic features
Taking this one step further, I would like to suggest that dependencies exist
not merely between lexical units, but also between semantic features. Consider
the examples in Table 6.
As can be seen from these examples, the French lexical units (in the sense
defined by Cruse (1986) mordre sur (1) (‘veer off course onto/into’) and mordre
sur (2) (‘cut into’, ‘overlap with’) impose severe lexical constraints on the
choice of subject and (prepositional) object: mordre sur (1) takes a subject
designating a vehicle and an object designating a part of the road, mordre sur
Collocation, Colligation and Encoding Dictionaries
Table 6: Typical linguistic environments of the French verb mordre (sur)
subject (semantic field: vehicle)
un car
un bus
une voiture
subject (semantic field: region)
mord (sur) (1)
trois villages dont le territoire
la région urbaine de Lyon
sa bordure méridionale
mord (sur) (2)
prepositional object
(semantic field: part of the
le côté
la ligne blanche
la voie opposée
prepositional object
(semantic field: region)
les départements de la
Loire, de l’Ain et de
le continent africain
(2) requires both the subject and object slots to be filled by items denoting
areas (mainly geographical areas or parts of the body). The question
then arises whether the relationship between subject and object can be
best captured in terms of selectional restrictions inherent in the verb or in
terms of collocational restrictions operating across the entire phrase
(verb þ two nouns).
To resolve this question, we may turn to Cruse’s (1986: 278–279) distinction
between selectional and collocational restrictions. Cruse defines selectional
restrictions as being logically necessary: according to him, it is logically
necessary for the subject of the verb die to carry the semantic traits ‘organic’,
‘alive’ and ‘mortal’. It is different with kick the bucket, which, although
identical in meaning to die, arbitrarily requires a human rather than an animal
subject (*the horse kicked the bucket vs the horse died ). Following Cruse, we
would be entitled to consider the above example as an instance of collocational
rather than selectional restriction. Firstly, there are no logical constraints on
the subjects of mordre sur (1) and (2), whose meaning is simply glossed as
‘empiéter sur’ (‘overlap into’, ‘eat into’) in the Tre´sor de la Langue Française;
indeed, mordre sur occurs with a wide range of subjects and objects in a more
general sense:
(10) les luttes politiques, religieuses et morales, les activite´s de parti, l’agitation
e´lectorale, le fait que les associations croissent de façon excessive, tout
ceci . . . mord sur le temps de de´tente (‘all this . . . takes up a lot of our
spare time’)
je ne voudrais pas mordre sur le temps des questions (heard in a lecture)
(‘I don’t want to take up any of the time reserved for questions’)
Dirk Siepmann
plus nous vivons dans les signes, et moins les choses mordent sur nous
(‘less things will affect us’)
sans jamais leur (aux lois, D.S.) permettre de mordre sur son esprit (‘never
allowing them to affect one’s mental state’)
le nazisme a mordu sur une large tranche du prole´tariat (‘many workingclass people were drawn to Nazi ideology’)
une abstraction qui mord sur le re´el (‘an abstraction which is close to
(all examples except the second from NF)
Secondly, there is a mutual dependency between the subject noun phrase
and the object noun phrase in that (e.g.) a subject noun phrase denoting
a vehicle will entail an object noun phrase designating a part of a road,
and vice versa. We are thus dealing with collocation between certain semantic
properties rather than between specific lexical items. Again, as with the
example of autoroute þ filer þ locative discussed above, we have a three-slot
collocation mixing collocational attraction and valency: vehicle þ mordre
(sur) þ locative(part of a road).7 Valency theory does not make allowances
for collocational constraints of such a specific nature, as it posits only three
levels of semantic restrictions, the ‘highest’ of which is selectional restrictions
of the type [þ human] (cf. Blank 2001: 238). Collocation thus turns out
to have a paradigmatic as well as a syntagmatic dimension, with an entire
semantic set (body part, region) - rather than a clearly delimited lexical set
(tousled þ 1. hair 2. mane) - sharing the same syntagmatic environment.9
The case for collocation between semantic features is strengthened further
when we look at adjectival collocations. A fine example is provided by
cooccurrences of the adverb beautifully with participial adjectives such as
carved, draped, drawn, restored, etc. The verbs on which these participial
adjectives are based share a common semantic feature in describing artwork
or craftwork. Thus, there is a lexical dependency between a specific semantic
feature and a lexeme.10
The list of such examples could be lengthened. To take but one more case,
the adjective bad and the adverb badly co-occur significantly with a semantic
feature which can be glossed as ‘physical imperfection’; thus, we have:
(11) I never had a bad chest
he’s had a bad concussion
Never had a bad cough, not even a sniffle.
He had a bad heart. Hole in the left ventricle.
He stuttered badly. (all examples from FE)
Note that a distinction could be made between two types of collocation here,
viz. (a) words which share the semantic feature ‘bad’ (concussion, cough, stutter,
Collocation, Colligation and Encoding Dictionaries
limp) and (b) words which require the adjective to add the notion of ‘badness’
(chest, heart).
It is important to reiterate that many such collocations between semantic
features and lexemes are bidirectional. With a collocation such as beautifully
carved it is perfectly conceivable that speakers begin by encoding the type of
craftwork involved, but it is equally likely that they are awe-struck by the sheer
beauty of a painting or other work of art, and the first thing that comes to their
minds is an adverbial expression of the concept of beauty. This latter
hypothesis is also borne out by the high frequency of the unspecific collocation
beautifully done, which does not specify the type of work involved. The notion
of beauty would seem to be just as semantically or cognitively autonomous as
that of craftwork, so that the collocation should be regarded as bidirectional
or even as one conceptual unit.
Similar but less regular collocational dependencies have been observed by
Grossmann and Tutin (forthcoming), Mel’čuk and Wanner (1996) and
L’Homme (2003). These authors prefer to analyse such regularities in terms
of ‘semantic classes’. In weighing the two analyses, my judgement is that the
assumption of semantic features is more consistent, especially if long-distance
collocations (Siepmann 2003; 2005) are taken into account.
By long-distance collocations are meant lexical dependencies which
manifest themselves over considerable stretches of text. A convenient
illustration is provided by the topic initiator turning to, which is commonly
followed at some distance by informers such as I/we þ find/see/note or it
appears that:
(12) Turning to the use of semi-auxiliary is to/are to in if-clauses, we find that a
fifth of the instances in the sample (and 1340 in the corpus as a whole)
appear in this syntactic environment.
In this respect the speech of younger British speakers appears to be following
the lead of American English. Turning to the speech of older speakers,
we note some words which are suggestive of hesitation, uncertainty or turn
manipulation: well, mm, er.
The corresponding Middle High German forms are fuoss, füesse; mus, müse.
Modern German Fuss: Füsse, Maus: Mäuse are the regular developments
of these medieval forms. Turning to Anglo-Saxon, we find that our modern
English forms correspond to fot, fet; mus, mys.
Turning to requirements involving both age plus service, it appears there has
been an increase in the propensity of participants to have normal retirement
available at age 62 with a combination of years of service. (all examples
from CAE)
A similar phenomenon can be observed with the marker of
comparison any more than. This marker, which introduces a
Dirk Siepmann
subordinate clause, is always preceded by the negative particle not in the
main clause:
(13) Not all women are ‘carers’ any more than all women are ‘victims’ or
‘contractors’. (CAE)
Such examples could be multiplied; they force us to recognise that, in
order to account for at least some collocational links, it is necessary to
abandon the four-word span on either side of the node which Sinclair (1991)
postulates as the cut-off point for collocational significance because
95 per cent of collocational attraction occurs within this span (Jones and
Sinclair 1974: 21f.). Sinclair’s idiom principle should therefore perhaps
be revised to accommodate ‘long-distance’ collocations entered by multiword markers; I propose the following restatement of the idiom principle
for written text:
One of the main principles of the organisation of text is that the choice
of one semantic feature, word or phraseological unit affects the choice of
other words or phraseological units, usually within a maximum span of
several paragraphs. (based on Sinclair 1991: 173)
This reformulation of the idiom principle also takes account of cases
where there is a great deal of variation among the node and the collocate(s).
One typical case is the collocation of the contrast marker not so with lexical
items such as surely, seem, appear, you/one might think that, it was hoped
that or one hears that, all of which contain a semantic trait implying
‘uncertainty’ or ‘error’:
(14) Some might think Volkswagen, which now owns 70 per cent of the
Czech company, would have thought the Skoda’s identity problematic.
Not so. VW sees Skoda as one of the most recognised brand names in
advertising. (NE)
After recriminations last summer, when a number of big trading houses
were accused - nothing was ever proved - of forcing the FTSE 100 higher
ahead of options expiry dates, it was hoped the Stock Exchange had
nipped things in the bud. Not so. Yesterday afternoon, after a solid if
unspectacular morning’s business, shares in some of the biggest Footsie
companies - the ones heavily weighted in the premier index - motored
sharply upwards. (NE)
Regulators and providers ought surely to be kept apart. Not so, according
to the NRA’s board - and to Lord Crickhowell, who insists that water
management and regulation are inextricably linked. (NE)
Collocation, Colligation and Encoding Dictionaries
So when some 100,000 demonstrators clogged the streets of the capital,
Minsk, on April 10th to support striking industrial workers and to protest
against price rises, it seemed as if discontent had come out of the blue. Not so:
beneath the surface the republic had been stirring for months. (NE)
Here one might make a case for the collocation of underlying rhetorical
strategies rather than strings of words or semantic features. This would be
correct to the extent that the discourse preceding not so sets up an expectation
which is not fulfilled in the subsequent discourse. In actual fact, however,
rhetorical strategy and occurrence of semantic features are two sides of the
same linguistic coin.
Not surprisingly, the ‘error’ part of the above pattern may also be
found in nominal form; in the following example from an academic
text, you might think has been converted to the more formal noun
(15) Another misconception about meditation is that the meditator should
fall into a trance. Not so. As a famous Chinese Buddhist put it: There is a
class of foolish people who sit quietly and try to keep their minds blank (. . .)
A more complex realisation of a long-distance collocational pattern is
seen in the following extract:
(16) But if one considers that in college dictionaries the average number of
column-lines allotted to each entry (not each definition) is a bit less than
two, one will see why space is at a premium. (CAE)
In the present case the collocational relationship holds between two types
of SLDM which occur in, respectively, the main clause and the sub-clause of
a complex sentence: the topic shifter (if) one considers (þ wh-clause / NP) and
the suggestor one will see (þ wh-clause / NP). Again, it is not so much the
lexical items themselves which enter into collocation; rather, we are dealing
with a recurrent type of semantic-functional relationship, where both the
second and the first part of the collocation may be replaced by other lexical
items. A few more examples follow:
(17) If one considers that the various paths do not exist except as perceived
by some mind, then one immediately arrives at the conclusion that the
probability of a path should be chosen proportionally to its algorithmic
information. (CAE)
If we consider the nature of Christian persecution as it is
currently understood, we can easily see how the personal attitudes of the
Dirk Siepmann
presiding official could have been a significant factor in any particular
trial. (CAE)
If, however, one reads the early dramas of Augustus Thomas and Clyde
Fitch, it will be realized how dexterously the American playwright
profited by the French technician in whom the commercial manager had
faith. (CAE)
French concession markers, too, are evidence of lexical dependencies
operating across considerable spans of text. Thus, the concessive en admettant
que tends to pre-empt the choice of an adversative marker such as pourtant,
encore faut-il que or le fait demeure que:
(18) R.-L. Wagner (1968), qui note que le « terme de ‘‘mot’’ en est venu assez
tard en français à traduire la notion d’une unite´ lexicale autonome », tout en
admettant le bien-fonde´ de l’analyse qu’A. Martinet fait de la notion de
« mot », refuse pourtant d’abandonner ce terme parce que la lexicologie
porte sur l’e´tude des signes en situation. (CAF)
The uncovering of such patterns is of great value for language teaching.
Just as lexico-grammars (Francis, Hunston and Manning 1996, 1998) have
illustrated the close links between word complementation and meaning, so
future text grammars and dictionaries may reveal the collocational nature of
specific rhetorical moves.
Again, such examples could easily be multiplied. They all illustrate the
density and conformity of lexical patterning in text, and suggest that a
‘semantic feature’ approach to collocation holds greater explanatory power
than one based on the assumption of semantic classes, since it would be
difficult to group such items as it is hoped, misapprehension and seem in
one class.
To sum up our discussion so far, we can say that the case for distinguishing
semantically autonomous and semantically dependent constituents of collocations is extremely weak.
3.6 A typology of collocation
The inescapable conclusion to be drawn from this section is that collocational
phenomena span the entire range of morpho-syntactic constructions. The
terms ‘collocation’ and ‘construction’ turn out to be almost synonymous,
a clear indication of the fact that phraseology is at the centre of language
rather than at the periphery. The only category of collocation that
cannot be captured by the notion of construction is collocation of
Collocation, Colligation and Encoding Dictionaries
semantic features. We might therefore posit four main types of collocational
(a) Colligation (t’avais qu’à þ INF, ignorer tout de þ N, il n’y a qu’à þ INF,
ce/cette N [tradition, etc.] est reste´(e), NP dans l’âme, typisch þ N, far be it
from me to þ INF, etc.); note that this definition of colligation goes further
than Hoey’s (see endnote 1). Colligation concerns not only the
grammatical preferences of individual words, but also those of longer
syntagms. Thus, the phrase t’avais qu’à can be said to be in colligation with
an infinitive clause.
(b) Collocation between lexemes or phrasemes (de meˆme . . . de meˆme que,
briser ses chaussures, c’est-à-dire en l’occurrence, regarde où tu vas, bon ben,
à la fin, etc.).
(c) Collocation between lexemes and semantic-pragmatic (contextual)
features (beautifully þ [result of creative activity], [uncertainty] þ not so,
[question] þ eh bien, [expectation] þ duly, [negative contextual aspect] þ
(not) detract from s.o.’s enjoyment, [vehicle] þ mordre sur þ [part of
the road], help! (on such one-word collocations, cf. González-Rey 2002:
95, 101).
(d) Collocation between semantic-pragmatic features (extended lexical
units, cf. Sinclair 1996/2004, 1998/2004; long-distance collocations,
cf. Siepmann 2005).
We are now in a position to reconsider the question we started out from
in this section: what elements make a collocation? The answer now appears
almost disarmingly simple: any colligational pattern may provide the basis for
collocation. Some patterns are particularly common and therefore account
for the majority of collocations (cf. Siepmann 2003):
X þ Y (grand maigre, gros mal, re´action à chaud, bon ben, où là, de meˆme
que þ de meˆme)
X þ Y þ Z (þ n) (vilain petit canard, petit coin tranquille)
X þ et þ Y (sain et sauf, sel et poivre, sick and tired)
X þ Prep (wedded to his profession, averse to risk, à la fin)
X þ Prep þ Y (grand chasseur devant l’e´ternel)
X þ Verb þ Y (to say . . . is to say . . . ., la voiture a mordu sur la ligne
We have also seen that some collocations, especially long-distance
collocations, are not merely, or not at all, based on colligational, that is,
syntactic relations, but on semantic relations. Diagrammatically, this gives us:
semantic feature of X þ (semantic feature of) Y
Dirk Siepmann
4. Are collocations arbitrary?
It seems likely that collocational knowledge is prototypical: to return to one
of the aforementioned examples, children acquiring French as their first
language come across several prototypical utterances containing the lexical
unit mordre sur (1) and then intuitively proceed to build up paradigmatic
classes. These prototypical utterances are made against a specific situational
background, namely motoring. It is the entire figure-ground-relation (moving
object/person – mordre sur – a part of the road [background: account of a car
ride, a car race, etc.]) that is acquired, not just the verb. This creates numerous
associations in the speaker’s mind, so that there are several pathways to
accessing the prototype: seeing a car, using the word ‘car’ at the start of a
sentence, thinking of a car race, etc. Once such associations have been
acquired, it becomes possible for the native speaker to initiate language change
by modifying existing collocations syntactically or semantically via the same
processes (e.g. metaphor, metonymy) as those underlying change in individual
lexical units.
It is not surprising therefore that some authors (Grossmann and Tutin,
forthcoming) have entertained the bold hypothesis of an underlying semantic
systematicity of collocational networks, only to find it disproved by a detailed
study of intensifiers accompanying nouns denoting emotions ( parfait bonheur,
amour fou, etc.); Grossmann and Tutin (forthcoming) conclude that the
positioning and generativity of such adjectives is ‘hard to predict’. Further
confirmation for this is provided by the aforementioned investigation into
road transport vocabulary, where it became clear that, while collocational
synonymy makes for economy of learning (e.g. la route/l’autoroute/le chemin/
la rue passe/arrive/conduit/me`ne quelque part), there are also divergent
tendencies (e.g a little alley vs *a little boulevard, l’autoroute file vs
*le chemin file; desservi par une autoroute vs *desservi par un chemin de terre).
This is even clearer with collocations such as fashionably late or flou
artistique, where there does not seem to be any previous semantic model on
which the collocation could have been based. Thus, although a post hoc
explanation is sometimes possible, collocation remains an arbitrary phenomenon based on ‘language games’ where semantics clearly play an unpredictable
role. However, although semantic relationships can only be discerned post
hoc, we should not forget that they may lighten the language learner’s task.
5. Can we distinguish between collocations and phraseology on the one hand,
and collocations and free combinations on the other?
This section is concerned with the various strands of argument that have been
deployed in favour of a clear distinction between collocation and phraseology
on the one hand, and collocations and free combinations on the other. These
Collocation, Colligation and Encoding Dictionaries
arguments can be broadly classified into two variants, viz. the argument from
syntax and the argument from semantics.
First, let us look at the argument from syntax. It has been repeatedly claimed
by theoretical linguists that a sharp boundary can be drawn between
collocations and fixed expressions by resorting to standardised tests such as
passivization or pronominalisation (Gross 1996, Scherfer 2002). Thus, a fixed
expression such as prendre la tangente can indeed be neither passivized nor
pronominalised (or rather, it is not normally passivized or pronominalised):
*la tangente a été prise par lui
*la tangente, il l’a prise
Detailed observation of real language use, however, leaves the theoreticians
without a leg to stand on. As Moon (1998), Partington (1998), Burger (1998)
and Siepmann (2003) have shown, modification of ‘standard’ citation forms
of phrasemes is almost the rule rather than the exception, and we find
numerous instances of passivization or relativization where we might not have
expected it. A few examples will suffice:
(19) jeter un pavé dans la mare -4 ce pave´ dans la mare e´tait lance´ par
quelqu’un qui . . .
découvrir le pot aux roses -4 le pot aux roses a e´te´ de´couvert
cracher dans la soupe -4 la soupe dans laquelle peu osent cracher
avaler des couleuvres -4 en compensation des couleuvres qu’elle a dû
avaler (all examples from NF)
Our linguistic competence invariably allows us to modify previous
utterances, and this seems to occur quite commonly with phrasemes.
The argument from syntax is spurious for another reason, namely that, just
like phrasemes, collocations (in the traditional sense defined by Hausmann and
Mel’čuk) may also be syntactically or otherwise restricted. One such restricted
collocation is the French noun þ verb combination situation [‘ensemble des
circonstances dans lesquelles une personne (un pays, une collectivité) se
trouve’] þ faire (cf. Siepmann 2003: 244–245). In this construction faire
invariably introduces a participial relative clause:
(20) la situation faite aux protestants
la situation faite aux immigrants
la situation faite aux prisonniers guine´ens (all examples from NF)
A construction of the type ‘on a fait une situation (ADJ) aux protestants’
appears to run counter to the norms of French prose. Such examples could
be multiplied (e.g. la confiance qui l’habite; see Siepmann 2003); they show that
Dirk Siepmann
Table 7: Exocentric vs endocentric items
Exocentric (phrasemes)
Endocentric (collocations)
Quand le chat n’est pas là, les souris dansent.
poivre et sel (¼ gris)
un panier percé
faire l’autopsie d’un corps
avoir intérêt à
un panier à salade
grammatical preferences must not be left out of consideration when dealing
with collocation, despite claims – still to be found even in recent scholarship –
that collocations can be represented as quasi-mechanical associations of the
type Sonne þ sitzen (Steyer 2000: 110).
Turning now to the argument from semantics, we find that this argument is
far more difficult to get to grips with, since it raises fundamental questions
about a theory of collocation and language, some of which we dealt with in
Section 2 above. There we found that the assumption of a differing semiotactic
status for the constituents of a collocation, though intuitively appealing, runs
into severe difficulties.
Another semantically-based suggestion for drawing the dividing line between
collocations and phrasemes has been put forward by González-Rey (2002:
120ff.); it is based on the endocentric/exocentric distinction which is quite well
known from morphology, where it serves to differentiate different types of
compounds (e.g. credit card [endocentric] vs blackhead [exocentric]). Consider
the examples in Table 7.
The left-hand items are said to be exocentric because none of their
components can be deleted, their meaning is not derivable from their
constituents, and they can only be understood in a specific situational context.
Endocentric items, on the other hand, are said to be characterized by the
following features:
(a) the constituents are deletable (e.g. un ton aigre, un ton doux)
(b) the meaning of the whole is compositional
(c) the expression has a referential meaning
Unfortunately for Rey-Gonzalez’ theory, there is no basic difference
between the kind of context-dependence posited for exocentric items and
that which applies to purportedly endocentric items such as ‘quiet drink’,
‘sudden bend’, ‘le paysage défile’, ‘lu et approuvé’ or ‘pour valoir ce que de
raison’ (the last two being cited by Rey-Gonzalez). The meaning of such items
can hardly be referred to as compositional, since there is no compatibility
between their institutionalised senses. A landscape cannot ‘rush’, any more
than a bend in the road can be ‘sudden’. Hausmann (2003) cites a number of
Collocation, Colligation and Encoding Dictionaries
similar borderline cases, such as krummer Hund, where it must be assumed that
Hund has the langue-meaning ‘person’ if it is to be considered the base of
the collocation.
It is also doubtful whether deletability can serve as a valid definining
criterion. Counterexamples are not far to seek; thus, it is quite common to find
the second part of an idiom, especially a proverb, deleted, as in ‘speak of the
devil, . . .’ or ‘quand le chat n’est pas là, . . .’.
Feilke (1994, 1996, 2003) was the first to discern the root cause of such
classificatory problems with full conceptual clarity. Recognizing that linguistic
expressions can be ‘idiomatic’ while at the same time being syntactically
and semantically well-formed, he advocates the theoretical decoupling of
idiomaticity and syntactic-semantic compositionality (Feilke 2003: 60).
According to him, it is the context and the participants placed in that context
which, via a figure-ground relationship, bestow meaning on such collocations
as the landscape rushes past or lu et approuve´. This is all the more convincing
since some words (e.g. ‘Lage’ [‘situation’]) have no distinctive meaning
components, so that it is impossible to attribute a summative meaning to
such expressions as sonnige Lage (‘sunny location’).
6. Are collocations monosemous and monoreferential? Are there synonymic
According to González-Rey (2002: 117), collocations are monoreferential
and do not allow synonymic variation:
‘L’unité ne peut se constituer comme variante, exprimée sous la forme de
périphrase, d’un mot déjà établi, ni admettre d’autres variations pour
le même référent, à moins d’en créer des sous-catégories.’ (González-Rey
2002: 117, my emphasis)
Although this statement is generally correct, here too it is relatively easy
to find a number of counterexamples, such as to stick to/keep to the speed limit;
Verbrechen begehen / verüben11; parvenir/arriver à un compromis; la pluie baisse /
baisse d’intensite´ / diminue / se calme, etc. It is often claimed that such synonyms
differ in some aspects of their meaning, especially according to style level, but
this line of argument clearly does not apply to the first two examples just cited.
It is also interesting to note that one collocation may take on several
meanings, a factor that has been neglected both in lexicological theory and in
dictionary making. A simple example of a polysemous collocation is English
‘avoid an accident’:
(21) s.o. avoids an accident (1) -4 qqn évite un accident -4 j-m vermeidet
einen Unfall
Dirk Siepmann
s.o. avoids an accident (2) -4 qqn échappe à un accident -4 j-m entgeht
einem Unfall
To take a more complex example, the French collocation donner þ exemple,
normally translated by give þ example and geben þ Beispiel, can occur in two
different types of linguistic environment (cf. Siepmann 2003). Compare the
following groups of examples:
(22) Les grammaires disent encore que les adjectifs verbaux issus d’un participe
pre´sent ou passe´ ou d’une de leurs formes pre´fixe´es sont presque toujours
place´s apre`s le nom. Mon corpus donne de nombreux exemples d’infractions
à cet usage (. . .)
D’autres exemples ont e´te´ donne´s à la re´union de la Socie´te´ française de
microbiologie à l’Institut Pasteur en de´cembre 1997.
Les e´conomies re´gionales autarciques ont existe´ jusqu’au moment où se
sont de´veloppe´s les moyens de communication. G. Kuhnholtz-Lordat
en donne un remarquable exemple dans le « pays de Costie`re » (de´partement
du Gard).
L’Arabie Saoudite donne un exemple d’Etat islamique moderne.
R. T. T. Forman et M. Godron (1986) de´finissent un paysage comme un
espace de plusieurs kilome`tres carre´s, où un assemblage particulier
d’e´cosyste`mes interactifs se re´pe`te à peu pre`s à l’identique. La
mosaı¨que des champs, des pre´s, des haies et des bois d’un bocage en donne
un exemple.
De sorte que les villes ont crû, se sont transforme´es et fragmente´es,
d’une manie`re qui de´passe tout ce que l’on avait pu imaginer. Le meilleur
exemple est donne´ par Mexico, la ville du monde la plus peuple´e, dont il est
de´sormais impossible de fixer les limites et de dresser le plan. (all examples
from CAF)
In the first group of sentences donner has retained one of its dictionary
meanings (‘communiquer, exposer’). In functional grammar terms, the
subject of donner would be labelled an ’actor’; the construction belongs to
the material process type. It is somewhat different with the second group of
sentences, where donner has an equative meaning characteristic of the
relational process type. Its subject is a ’token’ that has a ’value’ ascribed to
it in the form of an object. Since the English collocation give þ example and
the German collocation geben þ Beispiel can only be used with material
processes, a literal translation of the second group of examples is out of
the question. We thus have to resort to equivalents based around copular be.
Collocation, Colligation and Encoding Dictionaries
The first sentence of the second group, for example, could be translated as
Saudi-Arabia is an example of a modern Islamic state.
Saudi-Arabien stellt ein Beispiel für einen modernen islamischen Staat dar.
The above considerations also hold true for noun-adjective combinations
such as heures creuses (literally ‘hollow hours’). Heures creuses is a semitechnical term which occurs in at least four different fields: power generation,
rail transport, road transport and telecommunications:
(22) Les radiateurs à accumulation ne´cessitent la mise en oeuvre d’un
asservissement aux heures creuses EDF.
la SNCF renforce les trains aux heures creuses entre Paris et Combsla-Ville
0,075 ou 0,105 (Bouygues) aux heures creuses (all examples from NF)
Such collocational polysemy is also apparent from the paradigmatic
relations entered by heures creuses. Thus, whereas in telephony the antonym
of heures creuses is heures pleines, in road transport it is heures de pointe.
Somewhat counterintuitively, collocational polysemy is particularly
common in special-purpose language. Thus, some French noun-(relational)
adjective combinations of the type roue inte´rieure can usually be disambiguated
in context only, since at least one of its meanings arises from the deletion
of an intermediate element: roue (à denture) inte´rieure (Forner 2000: 180ff.).
7. Conclusion: A redefinition of collocation for lexicographic purposes
It should have become clear that previous definitions of collocations have
relied too heavily on introspection rather than corpus evidence. This has
prevented linguists from realizing that what has traditionally been known as
‘collocation’ or ‘phraseology’ is only one aspect of idiomatic language use,
and that the boundaries between the two are hazy and uncertain. The only way
out of this dilemma is a rigourously corpus-driven approach to the study
of lexis and grammar, and this is the approach that has been taken in the
present study.
Our discussion suggests that even the most sophisticated structuralist
definitions cannot adequately capture the phenomenon of habitual
co-occurrences, and that the frequency-based approach to collocation cannot
account for the collocation of semantic features. We would therefore be
justified in loosening the definition of collocation to a considerable extent;
collocation could be defined pragmatically with reference to the notions of
‘Gebrauchsnorm’, or ‘usage norm’ (Steyer 2000: 108), reflected in concepts
Dirk Siepmann
such as ‘minimal recurrence’ (Kocourek 1991, Siepmann 2003) or ‘statistical
significance’ (Sinclair 1991), on the one hand, and the notion of ‘inhaltliche
Geschlossenheit’, or ‘holisticity’, on the other hand (Siepmann 2003), the latter
referring to the facts that (a) native speakers can ascribe meaning to generallanguage collocations even if these are divorced from context and (b) that
such units are intuitively considered as self-contained ‘wholes’:
a collocation is any holistic lexical, lexico-grammatical or semantic unit
normally composed of two or more words which exhibits minimal recurrence
within a particular discourse community
‘Holisticity’ should here be taken to include colligation with a particular
grammatical category, such as a noun phrase. Thus, the collocations the future
belongs to (die Zukunft gehört, l’avenir appartient à) or l’autoroute file would be
felt to be incomplete by most speakers, requiring as they do a prepositional
object. This variable complement is conceived of as part of the collocation.
There is some evidence from a psycholinguistic study by Schmitt et al. (2004)
to suggest that the above definition, first proposed in sketchier form in
Siepmann 2003, is psychologically valid. Schmitt and his colleagues administered an oral dictation task to a number of native and non-native English
speakers, who were asked to reproduce dictation ‘bursts’ of considerable length
which contained different types of recurrent clusters. It was found that not all
statistically significant clusters retrievable from corpora are stored as holistic
units in the mental lexicon; there was a discernible tendency for semantically
transparent clusters (e.g. to make a long story short) rather than sentence
fragments (e.g. in a variety of ) to be reproduced intact. This finding seems all
the more plausible since even participants’ failure to reproduce original
sequences does not mean that they are not stored in the mind, for the simple
linguistic reason that many so-called ‘fixed’ expressions admit of variants
(e.g. to cut a long story short; see above) that participants may prefer.
From this section there emerge two important conclusions for linguistic and
lexicological theory. Firstly, collocation, as defined above, dominates language
use (at least from a statistical perspective). That is, Sinclair’s open-choice
principle only has a marginal role to play compared to his idiom principle,
which, as seen in our discussion of long-distance collocations, needs to be
considerably widened. Secondly, collocations should be considered as fullyfledged linguistic signs in their own right, so that Saussure’s word-based
linguistics will have to be complemented by a collocation (or ‘expression’)based linguistics (cf. Feilke 1996, 2003).
This redefinition of collocation enables us to account for the ways in
which language users operate with wholes (words, collocations) and at the
same time with parts (words, semantic features) which they have extracted
from contextual wholes – a key demand to be placed on any semantic theory
Collocation, Colligation and Encoding Dictionaries
(cf. Bolinger 1965: 570–571). Both operations have been shown to be governed
by collocation, thus providing further evidence for Hoey’s claim that
collocation is indeed one of the central mechanisms involved in meaning
creation (see introduction).
It thus appears that both structurally simple (i.e. [bound] morphemes,
lexemes) and structurally complex units (i.e. collocations/colligational
patterns) are linguistic signs. If the dictionary is meant to be a record of
such signs, the task of the lexicographer is to gather together evidence of both
types of sign. So far it has been lexemes, non-compositional idioms and
morphemes that have received the bulk of lexicographic attention, but the
future clearly belongs to collocation and colligation in the widest possible
In the second part of this article, I shall discuss some of the implications
such a change in perspective – not to say paradigm shift – has for the making
of encoding dictionaries.
For those readers who are not yet familiar with the relatively recent notion of
colligation (a term originally coined by Firth), here is how Hoey (1998) defines
- the grammatical company a word keeps (or avoids keeping) either within its own
group or at a higher rank.
- the grammatical functions that the word’s group prefers (or avoids).
- the place in a sequence that a word prefers (or avoids).
Even a superficial glance at lexical functions shows that they disregard contextual
relationships. Thus, the adverb drop-dead may intensify the adjective beautiful with
reference to women, but not with reference to buildings.
On an alternative construal, the German sequence might be viewed as a
colligational pattern or schematic construction (Croft and Cruse 2003): eine ADJ
Straßenlage haben, but this seems problematic to the extent that very few adjectives can
fill the slot.
I use the term ‘concept’ more or less in its standard terminological sense to
refer to a ‘unit of thought constituted through abstraction on the basis of properties
common to a set of objects or phenomena’.
Clearly, then, the notion of literal meaning turns out to be a linguistic abstraction
(see also the introduction to this article).
A point of criticism that might be raised is that we are here dealing with an instance
of regular polysemy. The meaning of ‘drink’ could be glossed as referring to an occasion
where people have a drink, and the same reasoning would apply to cases such as quiet
dinner/breakfast/lunch/tea. I would argue that such apparent regularities are in fact
more or less accidental; as Blank (2001) and Grossmann and Tutin (forthcoming) have
shown, nouns belonging to the same semantic class may share some of their collocations
or colligations, but not all of them (e.g. nach der Schule gingen die Schüler nach Hause vs
*nach dem Parlament gingen die Abgeordneten nach Hause).
Or a three-item construction in the sense of Croft and Cruse (2003).
Dirk Siepmann
Blank is not unaware of the fact that verbs may also be associated with particular
circumstantial complements (‘Zirkumstanten’) which may themselves carry selectional
restrictions, but he considers these two levels to be of lesser importance. As our analysis
has shown, however, it is often the particular collocation that determines the verb
pattern (l’autoroute file quelque part). Put another way, valency and collocation appear
to shade off into each other; speakers have semantically and syntactically prepatterned
collocations or ‘constructions’ (Fillmore) at their disposal.
Interestingly, the distinction we have just established between selectional and
collocational restrictions has a parallel in theories of formal grammar, such as Headdriven Phrase Structure Grammar, where selection refers to the process whereby
a head selects its complements and an adjunct selects its head. Using the example of the
German verb fackeln, whose linguistic environment invariably comprises a durational
modifier (most commonly nicht lange), Sailer and Richter (2002) show that the
durational modifier cannot be interpreted as a complement of the head verb, but rather
as an adjunct. Therefore, they argue, the relationship between the head verb and the
relational modifier is one of collocation rather than selection.
An alternative, cognitive-linguistic explanation might take the conceptual background as its starting point. Since paintings, carvings, etc. are often perceived
as aesthetically pleasing, the adjective beautiful readily springs to mind to describe
them. Collocations incorporating the adverb beautifully could then be regarded as
being derived from the original collocation (beautiful carving -4 beautifully carved ).
The problem with this explanation is that such derivation is not always possible.
Dieter Wirth, personal communication.
