Journal of Pragmatics 25, 1996, 503-535.
The ‘Pragmatics’ of Doing Language Science:
The ‘Warrant’ for Large-Corpus Linguistics
Robert de Beaugrande
[abstract]
The development of ‘mainstream linguistics’ in this century is briefly
retraced to suggest that the original decision to describe ‘language by itself’
as opposed to ‘language in use’ favoured formalism over functionalism and
eventually led to a severe impasse for three tests a valid science of language
ought to meet: coverage of language
data, convergence among the data
being described, and consensus among
linguists about how to proceed. The impact of large-corpus linguistics might
resolve this impasse and accordingly raises the prospect of a fundamental
reorientation of linguistic theory and of the ‘pragmatics’ of ‘doing language
science’.
A. Testing the progress of
‘mainstream linguistics’
1. The term ‘mainstream linguistics’ is sometimes used
to designate a language science pursuing a programme based on several generally
agreed principles (survey in Beaugrande 1991), notably:
(a) Language is a phenomenon
distinct from other domains of human knowledge or activity.
(b) A language constitutes a system defined completely by
internal, language-based constraints.
(c) A language should be described apart from the conditions under which
speakers use it.
(d) The description of a language should be couched in
statements at a high degree of generality,
if possible about the ‘rules’ for the language as a whole or even about the
‘universals’ for all languages.
Within the programme, the tenets interlock in projecting a free-standing
and self-sufficient conception of language as a uniform, stable, and abstract
system holding still while we are describing it, and separated from the
‘rich’ and ‘messy’ human contexts where it is encountered in ordinary life.
Linguistics has thus undertaken to describe a theoretical construct of
‘language by itself’ (‘langue’, ‘competence’, etc.) situated at a safe distance
from the empirical realities of ‘language in use’ (‘parole’, ‘performance’,
etc.) during communication. Saussure’s (1966 [1916]: 9) defensive mistrust of actual
‘speech’ has proven highly influential (cf. § 17):
(1) speech is many-sided and heterogeneous [...] it
belongs both to the individual and to society; we cannot put it into any
category of human facts, for we cannot discover its unity.
2. If we imagine a ‘layer-cake’ with language as a
system mediating between a culture’s knowledge of the ‘world’ on one side and
the ‘society’ of speakers on the other, the programme of ‘mainstream’
linguistics implies detaching language and rolling away the other ‘layers’
(Fig. 1). Doing so discounts the constraints upon what people say due to what
they are talking about and to whom.

The working hypothesis is that
once detached, the language system will stand firm: complete and fully
organised by its own internal constraints (cf. § 6, 19).
3. Yet what we encounter is always ‘language in
use’; even the formal analysis of isolated sentence structures is a mode of
use, albeit a peculiar and untypical one (§ 15, 53, 57). To get from ‘use’
over to ‘language by itself’, linguistics
has foregrounded data-handling
strategies, such as:
(a) collating:
a large set of data samples are compared and contrasted to distil out what they
have in common, e.g., which word types frequently occur with other types;
(b) generalising:
certain aspects of the observed data are construed to be general ones, e.g.,
that the ‘Subject-Verb-Object’ order of a sample set of English sentences is a
typical pattern for the language as a whole;
(c) rarefying:
the ‘rich’ data as they were observed in spontaneous interaction are made
‘sparse’, e.g., by disregarding the personal authority of speakers;
(d) decontextualising:
the data are taken out of the observed context and text type and treated as if
they had occurred in isolation or could occur in a wide range of contexts,
e.g., irrespective of the social status of individual speakers;
(e) introspecting:
the linguists make estimations based on their own intuitions about the
language, e.g., which sentences do or do not violate the ‘rules’;
(f) consulting
informants: native speakers are given data samples of their language and
asked to judge or rate them, e.g., to decide which of two versions of a
sentence is ‘grammatical’ or ‘ungrammatical’.
These strategies project a second working hypothesis, namely that applying
them will lead to a complete and valid description of a given natural language.
4. How might these two main working hypotheses be tested? If it is true that a language
system is fully organised by its own internal constraints, and that these strategies
can describe it as such, then the three key tests for progress would be steady
cumulative rises in (a) the coverage
of the language; (b) the convergence
among language data discovered and described; and (c) the consensus among linguists about how to formulate the description.
Yet if we apply these three ‘C-tests’
to mainstream linguistics, we find a rise only in some domains and a sharp
fluctuation in others. What ‘pragmatic’ factors for ‘doing language science’
can have led to this outcome?
5. Probably the most influential factor was
that the programme for describing ‘language by itself’ has naturally favoured formalism, the stance construing form to be the basis and framework of
language — how entities are shaped or arranged. As we strive to ‘abstract’
language out of everyday contexts, the most stable and reliable substrate
naturally appears to be the forms, the patterns of word-stems and suffixes or
the patterns of phrases and sentences. This factor discourages functionalism, the stance construing function to be the basis and framework
of language — what means are used toward which ends; functional aspects tend to
be associated with language in use. So the ‘majority position’ in ‘mainstream
linguistics’ has usually been that formalism confers high ‘scientific’ status
and that its legitimacy can be taken for granted, whereas functionalism is
‘unscientific’ or ‘pre-theoretical’ and its legitimacy must be expressly
justified. In this academic power structure, formalist research is not required
to specify its relevance and its ecological validity — whether and how
its findings contribute to a general and productive understanding of the human
situation — whereas functionalist research is expected either to struggle toward
the a priori criteria of rigour, abstractness, generality, and so on, set down
by formalism, or else to defend itself for not doing so. So functionalism has
been severely held back or has worked at cross-purposes by not following up on
its insights and by compromising its own ecological validity in order to
compete with formalism on the latter’s terrain.
6. In the long term, the development of mainstream
‘formalist’ linguistics reveals an ominous trade-off:
the more formalised a theory of language and its apparatus of terms and formal
notations, the less we can expect a steady rise on the three ‘C-tests’ of
coverage, convergence, and consensus (cf. § 13, 18, 51f). This trade-off is not
unduly surprising, given the robust fact that people, including linguists, are
naturally much less skilled at ‘formalising’ natural language data than they
are at speaking, hearing, reading, and writing them. After detaching language
from the constraints of ‘world’ and ‘society’ and discounting functions in
favour of forms, we are left with the huge task of reconstructing or inventing
the formal and ‘purely linguistic’ constraints for every sort of regularity we
may encounter (cf. § 18). The enterprise of linguistic formalism hinges on the
assumption that there exists, for each natural language, at least one set of
such constraints strictly separating what belongs to the language from what
does not. Decades of formalist research have failed to identify any such set
for any language; and the lack of progress on the three ‘C-tests’ strongly
argue that no such set exists.
Converted into a static and closed formal system, language does not stand firm, complete, and fully
organised by its own internal constraints (§ 2); instead, it tends to skid out
of control.
7. The ‘formalist’ trade-off has been somewhat obscured by the fact that
it holds in differing degrees for the various domains of language. The three
‘C-tests’ are best met in phonology and morphology, which both offer concise
methods for segmenting language data
so as to isolate and classify minimal units. Phonology has
the most clearly defined criteria in
the articulatory events and locations that characterise the sound-units called
‘phonemes’, e.g., a ‘voiced dental stop’ such as /d/ produced when the vocal
cords vibrate and the air flow is blocked by the teeth. The visual
correspondence between many phonemes and written letters of the Roman alphabet
also supports the ‘C-tests’, though it has not been made into a theoretical
principle, since the description is strictly addressed to spoken language.
Thanks to these clear criteria, linguistics soon provided descriptions of the
repertory of sound-units in language after language, covering all the phonemes
with impressively high convergence and consensus.
8. The success of phonology did much to entrench the
concept of ‘language by itself’ being a uniform, stable, and abstract system,
or rather a set of subsystems, usually called ‘levels’, each consisting of a repertory of minimal combinable elements.
A complete description of a language would be the sum of the complete
descriptions for each subsystem, supplied by linguists working within the tidy,
‘pragmatic’ division of labour reflected in academic specialisations, journals,
conference proceedings, and course offerings.
9. Yet in morphology, the criteria are already less
tidy. Convergence and consensus are fairly high for identifying and isolating
the ‘morphemes’, aided again by the visual clarity of the data written down.
The analyst segments the written data until no further meaningful subdivisions
appear feasible — a method once introduced as ‘immediate constituent analysis’,
which, if applied ‘in all observation of word-structure’, Bloomfield (1933:
209, 221) promised, would eliminate any ‘inconsistency of procedure’. His promise
rested on a staunchly formalist mandate:
(2) Any utterance can be described in terms of lexical
and grammatical forms; [and] any complex form can be fully described apart from
its meaning in terms of the immediate constituent forms and the grammatical
features [whereby these] are arranged (1933: 167).
Here, consensus is to be established by sheer stringency of method. Yet
recalcitrant problems can arise that would not trouble phonology. We can easily
reach a consensus about the human vocal apparatus; and after some simple
demonstrations, most speakers will agree that they ‘know’ the repertory of
phonemes in their native language, e.g., when they distinguish between voiced
and unvoiced consonants. But it’s harder to agree in what sense speakers ‘know’
their native language as a repertory of its minimal meaningful forms. So we are
on weaker grounds in claiming that our own consensus as linguists corresponds
directly to the consensus of speakers of the language we are describing (cf. §
74). Consider the English morphemes borrowed from French, Latin, and Greek but
no longer recognised by all contemporary monolingual speakers. Should a
morphological description include not just the more obvious ones like ‘in-’ and
‘im-’ for negation alongside ‘un-’, ‘non-’, or ‘a-’ but also the more erudite
ones like ‘pter’ (‘wing’) in ‘helicopter’, where speakers might instead
identify the final ‘-er’ as an agentive suffix (compared, say, to ‘lawnmower’)?
Doing so would oblige us to turn to language history and thus deviate from the
‘mainstream’ programme to describe language ‘synchronically’ in a single stage
of its evolution.
10. Again in contrast to phonology, morphology is
quite problematic in respect to coverage. In theory, the entire vocabulary of
the language consists of ‘morphemes’ or clusters of these; how could we list
them all? The ‘mainstream’ strategy has been to focus on the ones that form
stable, compact classes, e.g. the set of all verb inflections, versus the ones
that form unstable, open classes, e.g. the set of all verbs or verb stems. Only
for the first type could full coverage be attained, whereas the second type
could be consigned to the category of ‘lexemes’
to be described in the domain of ‘lexicology’.
11. Still, morphology has faired rather well on the
‘C-tests’ through its close engagement with the real data recorded in fieldwork
on previously undescribed languages,
which in my estimation has contributed by far the finest achievements in modern
linguistics. The fieldworker’s overall task is to progress from being an
‘outsider’ in the community of speakers over to being an ‘insider’ who can
speak the language at least well enough to interact with the community and
eventually to describe the language. The fieldworker must reach a working
consensus with the community or else expect to be misunderstood, ridiculed or
ignored. The task is richly supported by ordinary constraints from ‘world’ and
society (§ 2, 6), which always apply to real data but which may well not appear
in a stringent formal description. To maintain a consensus with the community,
the fieldworker can rely on an intuitive, perhaps unconscious grasp of such
constraints to produce ‘proper’ utterances, whether their ‘propriety’ is
‘purely linguistic’ and can be ‘formalised’ within a ‘linguistic theory’ or is
more cognitive or social.
12. The three ‘C-tests’ get shifted far more radically
in the move from morphology to syntax. Units can no longer be isolated by using
the criterion of ‘minimalness’; nor does it seem at all feasible to make a
complete repertory of syntactic patterns. So phonology and morphology were
replaced by syntax at the centre of linguistic theory, and the concept of
‘language’ itself got shifted from a ‘descriptive’
notion of a repertory of units (§ 8)
over to a ‘generative’ notion of a repertory of rules for constructing and
arranging units. Since the ‘rules’ plainly do not appear in the data, syntactic research relaxed the close engagement
with real data as established in morphology fieldwork. Instead of segmenting
the language sequences themselves, the task was to devise rules that would ‘generate’ the underlying structure of the sequences. Such a shift did not just
leave formalism intact, but actually endowed language data with an enhanced but
hypothetical formality. The effect was most striking when semantics was added
onto syntax: to maintain the detachment of language from knowledge of the
‘world’ (§ 2, 6), meanings were described as arrays of underlying forms, often
called ‘semantic features’.
13. At this point, the ‘formalist trade-off’ described
in § 6 began to grow virulent. Disengaging from real data encouraged some
influential generative linguists to turn to invented
data, whose status as part of the language was certified not by its occurrence in the actual speech of
native speakers but by the intuitive
approval through introspection.
The official rationalisation was that corpuses of real data are inadequate
because they are ‘finite’ and ‘accidental’ collections of utterances, whereas
speakers of a language can produce or understand many more utterances —
presumably an ‘infinite’ set of them (Chomsky 1957: 15). This rationalisation
had the labour-saving corollary that fieldwork is not very necessary or
helpful: linguists need merely elicit invented data or even — a unique
privilege among scientists — invent their own data when they are native
speakers of the language, e.g.:
(3) The man hit the ball.
(4) John is easy to please.
(5) The cat sat on the mat.
A bit paradoxically, the ‘normalness’ of such sentences can make them
seem a bit odd in comparison to what people actually say (§ 26).
14. Saving labour this way has some severe hidden
costs. When linguists were no longer in the concrete fieldwork situation of
confronting real data in an unknown language, the task of describing the
language is no longer firmly correlated with the task of reaching a working
consensus by going from an outsider to an insider in the culture (cf. § 11).
This change removes both the most tangible means of testing one’s assumptions
and the richest source of constraints from world and society. And when the task
of describing freely rides upon the describer’s prior facility and unstated
intuitions regarding the language, the linguists are already insiders before
they start their work.
15. The
impending problems were forestalled by the central formalist assumption that
there exists, for each natural language, at least one set of formal constraints
strictly delimiting what belongs to the language from what does not (§ 6). When
found, such a set would provide total coverage and lead to a formal account
both for the convergence among the structures of ‘grammatical’ or ‘well-formed’
sentences and for the consensus among the intuitions of native speakers. Since
phonology and morphology had been relegated to the sidelines (§ 12), the
constraints could be grouped under the respective headings of syntax,
semantics, and pragmatics. Each group of constraints could be identified by selectively violating it, e.g.:
(4) John is easy to please. (‘well-formed’)
(4a) *To is please John easy. (‘syntactic violation’)
(4b) ?John is easy to sneeze. (‘semantic violation’)
(4c) ?John, be eased and pleased! (‘pragmatic violation’)
(4d) ?A john is sleazy to fleece. (which violation?)
But such demonstrations entail several problems:
(a) Insofar as the examples were invented on the spot
to demonstrate the rules, they cannot be an independent validation of the rules;
and the constraints applying to the act of invention and to its peculiar and
untypical purpose — to produce a selective violation — hardly match the
constraints that apply to ordinary acts of discourse and to their practical
purposes, e.g. to justify what you are doing.
(b) The assessment of a violation depends
heavily on the ingenuity of the linguists, e.g., whether they can imagine a
situation where it would be appropriate to talk about ‘sneezing John’ (he might
be a flue microbe in a children’s story); or where John’s imperious mother
might command her son to be pleased about a Christmas gift and to have an easy
conscience about not getting her one. To argue that a sentence is disqualified if it was ingeniousely devised (as
Bierwisch has) is not helpful when we cannot define the threshold where naivity
leaves off and ingenuiousness starts. Even (4) is ingenious in the sense that
it was expressly invented to make a point about underlying structure, and is
unlikely to be uttered (§ 53, 57).
(c) It is also easy enough to invent examples where it
is not clear which group of constraints is violated. (4d) would be such a case,
and might yet be contextualised by applying the American meanings of ‘john’ as
a ‘toilet’ or a ‘prostitute’s customer’ (Random
House Webster’s College Dictionary,
1991, p. 729).
16. At all events linguists were dismayed to find a
wholly unexpected lack of agreement, both among themselves and among native
speakers, about sample sentences. This outcome gave rise to a series of complex
rhetorical manoeuvres on two sides. On the side of data, samples were carefully
restricted in order to highlight the contrast between clearly proper sentences
with seemingly obvious meanings versus clearly improper ones with no sensible
meanings, as in (4-4c). On the side of theory, the central formalist assumption
was shielded from the implications of observed disagreements. The set of formal
constraints was declared to correspond only to ‘competence, the speaker-hearer’s knowledge of his language’ and not
to ‘performance, the actual use of
language in concrete situations’ (Chomsky 1965: 4). The ‘speaker-hearer’ was in
turn declared ‘ideal': ‘living in a completely homogeneous
speech-community’, ‘knowing its language perfectly’, and being ‘unaffected’ by
‘memory limitations, distractions, shifts of attention and interest, and
errors’ (1965: 3). In effect, these two declarations instated consensus by
decree and converted it into an ‘ideal’ that need not, indeed cannot, be tested
against the agreement among speakers.
17. The same rhetorical pressure accounts for the
evasive complication opposing ‘surface structure’ to ‘deep structure’ and
declaring that ‘the grammar does not, in itself, provide any sensible procedure
for finding a deep structure of a given sentence, or for producing a given
sentence’ (1965: 141). Moreover, ‘much of the actual speech observed’ was
declared to ‘consist of fragments and deviant expressions’ (1965: 201), echoing
Saussure’s influential mistrust of ‘actual speech’ (cf. § 1). These further
declarations can serve to explain away any discovered lack of convergence among
language data.
18. These rhetorical manoeuvres suggest that
generative linguists were aware of and disquieted by the ‘formalist trade-off’
(described in § 6) but were determined to rescue the central formalist
assumption (also cited § 6) by designing the theory precisely so as to prevent
the lack of progress in coverage, convergence, and consensus in respect to real
data from counting as a refutation. They correctly speculated that their
manoeuvres, even if thinly or speciously argued, would not be critically
assessed by colleagues who were (a) firmly committed to the ‘mainstream’
linguistic programme of describing ‘language by itself’, (b) were not anxious
to undertake painstaking fieldwork in remote places, and (c) mistrustful or
actual speech in all its ‘messy richness’. So we can readily understand the
success of generative linguistics and its continuation through a long and
sometimes arcane series of ‘extensions’, ‘revisions’, or changes of notation
without any willingness to change its basic claims about what a ‘language’ is
and what a ‘linguistic theory’ should do. Its adherents cannot admit that
isolating language from the functional
constraints that apply to real data
incurs the impossible job of inventing all
the formal constraints for all conceivable data, irrespective of
whether native speakers would ever utter them. The ‘generative grammar’ would
have to reconstruct the formal possibility that speakers could utter or understand them; and no
evidence has been brought forward so far that this can ever be done.
19. The conclusion would have to be: if language is
detached from the constraints of ‘world’ and society, its own internal
constraints are not sufficient to support its organisation (cf. § 2, 6, 11f,
14). Hence, any linguistic description which postulates such a detachment will
only be able to cover a part of that organisation and will encounter frequent
obstacles to convergence and consensus. This conclusion is borne out by empirical evidence not about the formal
structure of sentences but about the ‘pragmatic’ activities of doing language
science over the past century. Because ‘language by itself’ was a technical fiction to begin with, theories
about it have been obliged to created a proliferating series of further
technical fictions to prop each other up — ‘grammaticality’, well-formedness’,
‘competence’, ‘ideal speaker-hearer’, ‘homogeneous speech-community’, ‘deep
structure’, and all the rest — that are not merely unconfirmed by real data but programmatically
opposed to real data. The prospect today is not merely that no formal
description of ‘language by itself’ has yet attained adequate coverage,
convergence, and consensus for any natural language, but that no such theory ever will. In the long
run, the apparent advantages of linguistic formalism — stability, determinism,
rigour, visual clarity, impressive notations — and the privileges its confers —
to invent and judge your own data, to do science without leaving your desk, and
to escape the rich and messy contexts of human interaction — all turn out to be
liabilities for achieving even its own carefully circumscribed tasks. Such a
formalism relegates us to a shadowy world of formulas and arrays whose
determinacy is financed by their indeterminate relation to the language data
they purport to represent.
C.
The impact of ‘large-corpus linguistics’
20. I have briefly retraced the theoretical evolution
of ‘mainstream’ linguistics in section A in order to indicate how the early
programme of describing ‘language by itself’, detached from world and society,
has favoured a linguistic formalism that turned away from real data and
eventually blocked further progress in coverage, convergence, and consensus,
without which we cannot attain a complete and valid description of any natural
language. The growing awareness of this impasse has led to a diversification
within linguistics that has edged formalism gradually out of its ‘mainstream’
and majority position. The brands of linguistics going under such designations
as ‘functional’, ‘systemic’, ‘applied’, ‘cognitive’, ‘computational’, and
‘critical’, along with some adjunct domains such as ‘discourse analysis’ and
‘discourse processing’ (which seldom aspired to be part of linguistics), all
share the enterprise of resituating language in its cognitive and social
contexts, reassembling, as it were, the ‘layer cake’ of language interfaced
with world and society (§ 2).
21. As the
conventional division between ‘language by itself’ versus ‘language in use’ has
been progressively narrowing, we have found that real data are not plagued by
the lack of ‘discoverable unity’ that, Saussure vowed, would prevent us from
‘putting speech into any category of human facts’ (§ 1); nor do they ‘consist
of the fragments and deviant expressions’ that justified Chomsky’s retreat from
‘the actual use of language in concrete situations’ (§ 16). Instead, real data
reveal an unexpectedly high degree of precision and clarity, though not
necessarily in the modalities that mainstream linguistic theories would easily
recognise.
22. This finding has been most profoundly assisted by
the advance of technology, placing within our reach a new source of data that
dramatically enhances the prospects for coverage, convergence, and consensus.
The key technical innovation is the large computerised corpus of data from
actual texts and discourses, such as the ‘Bank of English’ (hereafter ‘BoE’ for
short) developed at Birmingham University by John Sinclair and his team. I took
the data described below from the BoE in July 1994, at the stage when it had
reached the size of some 200 million words of running text from contemporary
spoken and written sources, including: British and American books; newspapers (Times, Independent, Guardian, Today, Wall
Street Journal, New Scientist, Economist); magazines (e.g., Esquire, Good Housekeeping); ‘ephemera’
such as letter-box mailings (e.g., YMCA appeal for homeless people, Friends of
the Earth Tropical Rainforest Campaign), radio broadcasts (British Broadcasting
Corporation in the UK and National Public Radio in the US); and recordings of
conversations.[1] The coverage by so large a corpus
might validly claim to be representative,
though it is certainly not complete
and is very far from ‘infinite’. Yet paradoxically, it has itself made us aware
of the ways in which it is yet too small (§66ff).
23. Still, as a sample of contemporary English usage,
the coverage exceeds previous sample sizes by various orders of magnitude, such
as: the previous 20-million word corpus used for the 1987 Collins COBUILD English Language Dictionary (by 1 order of
magnitude); the 1-million word Survey of English Usage at University College
London (by 2 orders of magnitude plus doubling); the 2000-word fragments in the
Brown University corpus (by 5 orders of magnitude); and the 24 invented
sentences analysed or ‘transformed’ in Chomsky’s Aspects (by 7 orders of magnitude).[2]
24. Contrary to what is widely believed, the increase
in orders of magnitude does not entail a direct proportionality whereby we just
get the same data multiplied by 10, 100, 1,000, and so on, so that if an item
appears once in a 1 million word corpus, it appears 20 times in a 20 million
word corpus and 200 times in a 200 million word corpus. If that were true,
building steadily bigger corpuses would only give the results we could
accurately predict from the proportions in a small corpus. But in fact the
large corpus offers not just more
data but different kinds of data:
(a) We find numerous items that did not appear at all
in smaller ones.
(b) We can make more informed judgements about
relative frequency. Of two items appearing only once in a small corpus, the one
might still appear only once in a larger corpus and the other fifteen or twenty
times.
(c) The larger corpus will display the data in
steadily finer degrees and differentiations of detail. An item which appeared
only once in a small corpus may appear in several distinctive variants in a
large one.
In these ways, each increase in magnitude can reveal hosts of fresh and
more detailed regularities that were simply not noticeable before, nor are they
readily open to unaided intuition and introspection (§ 27,52f, 55, 63). They
still have to be interpreted, but — in marked contrast to non-corpus linguistic
methods — the outcome is quite amenable to convergence and consensus (§ 4, 6,
15, 17ff, 20, 22, 27f, 39, 43, 46, 48, 50, 52-55, 62, 64f, 72-75).
25. Conversely, the corpus shows that examples we
might intuitively accept at face value are not typical of actual usage. Our
beloved evergreens like those cited in § 13:
(3) The man hit the ball.
(4) John is easy to please/eager to please.
(5) The cat sat on the mat.
do not appear in the BoE, not because they aren’t
properly ‘grammatical’ or ‘well-formed’ English but because they aren’t ‘natural’: typical contexts of real
discourse require less simple-minded and peremptory utterances. In the BoE,
nobody at all is said to be ‘easy to please’. For ‘eager to please’, three
instances appear (6-8), each with a direct object for ‘please’ that was missing
in (4) and with more interesting agent-subjects than our insipid friend ‘John’.
Even allowing for intervening items, the only combination of ‘man + hit + ball’
was (9); ‘man + hit’ alone returned only (10), where the sense of ‘hit’ adapts
to ‘jackpot’. For ‘cats sitting on mats’, the only attestations were
derivations from the use of this trite example in schoolbooks or logician’s
debates, e.g. (11-13), rather than being assertions about any real cat.[3]
(6) < a government official who is eager to please
the wealth goddess >
(7) < the Sandinistas. The government is eager to
please the Church >
(8) < show a sociable child who is eager to please
or charm those around him >
(9) Yes. Doesn’t that man hit the ball hard?
(10) Where can a con-man hit the biggest jackpot? In
politics
(11) On the first page was a drawing of a brindled cat
seated on a recognisable mat, the original ‘cat on the mat’ now quoted in
derision of an antiquated method of teaching
(12) so if you have <ZF1> a <ZF0> a man on
the roof [pause] er erm erm a cat on a mat er a tree on a mountain top a boy
sitting on a tree branch these all involve
(13) material-objects statements, ‘There is a cat on
the mat’, statements about people in novels, statements of mathematics
We shouldn’t regard the grainy details of the real data as a mere
obstruction to be filtered out by rarefying and decontextualising (in the sense
of § 3). Instead, we should respect the ‘naturalness’ of real data because,
unlike the ‘grammaticalness’ or ‘well-formedness’ of the formalists, it has been decided for us by real users of
the language (cf. § 63f). We want to account for the ‘competence’ real
users not just possess but display
when doing this; and there, ‘well-formedness’ has no overriding priority (§ 46,
48, 53, 55).
26. Now, corpus displays are in some sense frankly
‘surface’ data’, but, exactly because the data are not severed from their
contexts, it is easier to assess what sorts of ‘shallower’ or ‘deeper’
constraints might apply. Even on the surface, a corpus displays to the
investigator not just words but collocations,
to adopt Firth’s (1968 [1952-1957]: 106ff, 113, 182) well-known term: ‘words’
considered in ‘the company they usually keep’, i.e., typical word combinations
that would not usually qualify as idioms or standing phrases (cf. § 31, 33, 52,
55, 60, 66, 69, 77ff). Also, the data can be accessed in somewhat ‘deeper’ ways
by means of the search software, so that, for example:
(a) The collocation need not be invariant or
continuous but may contain varying interposed words (up to 4 in the BoE), e.g.
‘on the mat’ in (5) versus ‘on a recognisable mat’ in (11).
(b) We can sort out words that could belong to more
than one word-class, e.g. ‘warrant’ as either a noun or a verb.
(c) We can use uncommitted characters to search for a
stem with all its endings, e.g. to compare ‘logic’ with ‘logical’, which turn
out to collocate rather differently.
(d) We can make nested sub-displays to zero in on
possibly significant combinations in the general display, e.g., to go from
‘warrant’ to ‘warrant + investigation’.
27. The most ‘surface’ use of the large corpus is to
enable accurate judgements about the frequencies
of words or word-combinations — a familiar tactic in ‘computational
linguistics’. A far ‘deeper’ and more revealing use of the corpus is to detect tendencies rather than just frequencies,
so that we can assess why certain
combinations occur and not just how often.
Paradoxically, sorting vast quantities of real data allows unexpected
convergences to emerge within the regularities underlying this huge variety
(cf. § 43f, 72). Among all of the possible combinations of English words and
phrases that might be intuitively judged ‘grammatical’, we can finally see
which ones are more likely to be realised and at least some of the reasons why.
28. The main challenge now is how to identify and describe the constraints
whose effect the corpus-displays allow us to inspect (cf. § 73). The
constraints are all functional in the
broadest sense, i.e., related to what people do with their language (§ 5); any formality we may distil out is
derivative upon that functionality and cannot be consensually accounted for
without it (cf. § 54f; Beaugrande, in press). Moreover, functional constraints
need not fit neatly into the formal linguistic schemes devised for ‘language by
itself’ — not a surprising finding, perhaps, but an immensely significant one
(§ 34-46).
29. My demonstration here will be the Bank of English
corpus data on the English verb ‘warrant’. The BoE returned a total of 392
lines centring on that key-word as a Verb. To get a more manageable and
productive sample, I made a hand-sorted selection of 228 lines by eliminating
repetitions, e.g. when a statement by a politician got reported in several
media, and false alarms where the key word was actually a noun.[4] Selecting the verb allowed me to disregard the
numerous noun occurrences in stock phrases like ‘search warrant’, ‘death
warrant’, or ‘warrant for arrest’.
30. The word has a venerable history related,
according to Walter Skeat’s (1970 [1879-1882]: 702) Etymological Dictionary of the English Language, to the word
‘guarantee’. As a verb, we find such usages attested in the Oxford English Dictionary (pp. 930ff)
as: ‘to keep safe from danger (14); ‘to guarantee goods to be of the quality,
quantity, etc. specified’ (15); ‘to give a personal assurance of a fact’ (16),
‘chiefly in “I (I’ll) warrant you”’ (17); and ‘to authorise, sanction a course
of action’ (18).
(14) What good Man was he that from deth warawnted
thee? (Henry Lovelich, Merlin, 1450)
(15) This Ryche man thenne sold his oylle to the
marchaunts and waraunted eche tonne al ful (William Caxton, The subtyl historyes and fables of Esope,
Auyan, Alfonce, and Poge, 1484)
(16) Bot for to lere him I warand, Als mekil als he
mai vnderstand (The proces of the seuyn
sages, 14th century)
(17) There be many such I warrant you yt neuer
cum to light (Thomas More, A dyaloge
wherin he treatyd dyvers maters as of the veneration and worshyp of ymagys etc.,
1528)
(18) The Lord warrants us to suspect the inconstant
(Daniel Rogers, Naaman the Syrian, his
disease and cure, 1642)
These samples from the 14th to the 17th centuries suggest a gradual
widening away from official discourse, and a drift toward the modern usage
displayed by the BoE corpus, as we shall see.
31. A first heuristic for identifying the more
interesting collocations in the BoE is to list in the order of frequency the
most common words within the set of lines returned. Many of those near the top
of the list, such as ‘of’ or ‘to’, will seem unenlightening in the early
stages, but at least some of the more suggestive words can turn up:[5] among the nouns, ‘evidence’ (21 occurrences),
‘investigation’ (12), ‘trial’ (7), ‘attention’ (9), ‘circumstances’ (8),
‘concern’ (6), ‘mention’ (5), ‘consideration’ (5), ‘punishment’ (5),
‘intervention’ (4), and ‘conditions’ (3); among the modifiers, ‘enough’ (58),
‘sufficient’ (27), ‘serious’ (14), ‘really’ (7), ‘certainly’ (6), ‘important’
(5), ‘severe’ (5), and ‘trivial’ (4).
32. A second heuristic is to create a positional
frequency table in which the words in the several slots to the left and right
of the key word are displayed in descending order of frequency. The table below
shows the data for ‘warrant’.
3 to the left 2 to the left 1 to the left word 1 to the
right 2 to the right 3 to the right
sufficient enough to warrant
a the
of
enough evidence not warrant
the investigation
the
serious did 't warrant an
a in
too do would warrant it
<t> a
the does might warrant such
attention <t>
and not really warrant any
of but
that as that warrant further
action action
not didn yet warrant this
trial <LTH>
sufficient may should warrant that
with and
in doesn search warrant his
and to
is nothing and warrant
to more
trial
was the will warrant some special that
it and circumstances warrant their even for
of seem arrest warrant no mention into
which t o could warrant another intervention by
but trivial can warrant my 's it
good that may warrant its it is
done will soon warrant for because than
<h> so 'll warrant more than some
a small conditions warrant concern new as
's seemed germane warrant officer an an
be they death warrant <
/h> further here
important appear certainly warrant one sort then
These data too are at best suggestive, and for much the same reason that
purely formal syntax readily becomes convoluted or opaque: many words or
word-classes are fuzzy in respect to their mutual positions; and functional
relations need not show up as formal ones. The frequent negations — ‘not, ‘-t’,
‘didn’, ‘no’, and, by implication, ‘too’ — are scattered over four positions
(cf. § 36, 42). And some of the most revealing data don’t appear at all, either
because their position isn’t consistent enough, e.g. ‘situation’; or because a
shared semantic concept is lexicalised in various ways, e.g. ‘disability -
distress levels - ill health - medical problems’.
33. A third heuristic offered in the BoE software
sorts the lines by the alphabetical order for a given position to the left or
right of the key word. This tool works best in bringing out data about items
whose position is relatively fixed, e.g. the extreme frequency of ‘to’ in the
infinitive (top item before ‘warrant’ in the positional frequency table). But
user-performed hand-sorting is needed for groupings wherein the essential items
and collocations occupy more flexible positions, e.g. ‘serious’; or where
groupings are to be made by semantic criteria, e.g. ‘investigation’ with
‘inquiry’. I worked out three hand-sorted displays and added bold italics to
highlight the items that I chiefly relied on while doing the sorting and
alphabetising:[6] one for what does or doesn’t do the
‘warranting’, one for what is or is not ‘warranted’, and one for the relevant
criteria. Samplings from these three displays are given in Appendices A, B, and
C.
34. These displays begin to reveal the various types
of constraints. Some constraints might be provisionally stated according to the
familiar schemes of different ‘levels’ or ‘components’ of ‘mainstream
linguistics’. For phonology, the intonation would be distinctive for the
performative ‘warrant’ in relatively rare locutions like ‘I’ll warrant’ used
when you want to indicate you feel sure about something though you can’t point
to actual facts (cf. § 64):
(19) If I had ten thousand men like him tomorrow then I
warrant we’d see Napoleon beat by midday [quoting the Duke of Wellington.]
(20) The soil may look innocuous enough when you’ve
dug it over but I’ll warrant it’s teeming with root-eating wireworms.
(21) I’ll warrant I even heard Honey Bane shuffling by
somewhere in the background of a song that will provide the perfect soundtrack
for when your mum won’t let you out of your room until you’ve done your
homework.
A sample like (21) looks quite complex (with quadruple ‘embedding’) in
comparison to the usual invented sentences like (3-5) in § 13 and 25, but in
actual discourse it should present no difficulties for comprehension, even for
the young and not very intellectual readers it addresses.
35. For morphology, we might note the overwhelming frequency of non-finite
forms, either in infinitives with ‘to’ (136 occurrences) or with some modal
verb (58) (cf. § 42). Also, several Latin/French-based prefixes among the
semantic processes may be significant: ‘ad-’ for moving toward
something: ‘action, appeal, appellation, assistance, attention’; ‘com-’ or
‘con-’ for acting, happening, or bringing together: ‘collection, commitment,
complaints, conclusion, conditions, consideration, conspiracy, consultations’
plus the Anglo-Saxon ‘with-’ in ‘withdrawals’; ‘de-’ and ‘dis-’ for uncovering
or invalidating something: ‘declines, definition, developments, disability,
distress’; ‘e-’ or ‘ex-’ for getting outside: ‘event, evidence, examination,
exclusion, expansion, expenditure, extension’, plus the Anglo-Saxon ‘out-’ in
‘outburst’; negating ‘im-’ or ‘in-’ for something that is not as it should be:
‘impropriety, indeterminate, insufficient’, plus the Anglo-Saxon ‘un-’ in
‘uncharacteristic, uncovered, unimportant, unorthodox, unsatisfactory,
unspecifiable, untutored’; ‘in-’ and ‘inter-’ for getting inside or between:
‘inquiry, interception, interference, intervention, introducing, investigation
into war crimes, inclusion in the wheelchair, internal matters that warrant no
outside interference’; ‘re-’ for following up or going back toward something
previous: ‘recession, record, recording, relaxation, relief, respect, response,
retaliation, retrospective, return, revelations, revision’ (cf. § 41).
36. For syntax or ‘grammar’, we could note the extreme
dominance of third person subjects (224 occurrences), as opposed to just 4 in
first person (compare samples (19-21) and none at all in the second person;
and, within the third person, the mere handful of pronoun subjects ‘he’ (6
occurrences), ‘she’ (0), ‘they’ (5), and ‘it’ (7), as contrasted with the large
numbers of noun subjects (§ 42). Or, we might note the high proportion of
negations attached to the verb: ‘not, don’t, didn’t, not yet, hardly, not
really’ (cf. § 36, 42).
37. For semantics, we could note that many of the
subjects and direct objects fall into associative classes that are not unduly
hard to label, e g.:
(a) as subjects:
actions: ‘achievement,
aggressions, behaviour, blow, brawl’; resources:
‘abilities, acreage, growing area, scrappable cars’; knowledge: ‘evidence, information,
perception, scientific authority’; messages:
‘accusations, complaints, juicy stuff, message, piece of tittle-tattle,
revelations’; problems: ‘air
leaks, ambiguity, antitrust conspiracy, casualty rate, chilly old homes,
degenerating trees, disability, discriminatory practices, distress levels, food
shortage, ill health, impropriety, job bias, slowing in the economy, violence’;
(b) as direct objects: (in)appropriate reactions: ‘(further)
action, change, commitment, conclusion, consideration, expansion, extension,
formation, increases, motion, (cautious) move, plan, step, signing, treatment’;
consumption of resources: ‘cost,
expenditure, loss of any troops’ lives, overeating, paying the steeper taxes,
shelf-space’; messages: ‘apology,
appellation, billing, briefing, brochure, column inches, comment, description,
footnote, mention, phrase, satire, serious talk, suggestion, talking-to’; knowledge-gathering: ‘airing,
attention, consultations, examination, hearing, inquiry, investigation, retrospective
survey, review, [legal] trial, [medical] trials’; solving problems: ‘answering machine, (charitable /economic)
assistance, breaking the embargo, easing of interest rates, full-time
custodian, guests wearing thermal long johns, intensive care, introducing more
elaborate feeding, (professional/prompt/surgical) intervention, making peace,
mid-season break, opening of a new peat extraction plant, revision, sending of
those supplies, using these drugs’; retaliating:
‘banning the show, charge(s), God’s anger, jail time, lengthy ban, massive
American retaliation, penalties, pre-emptive strike, (criminal) prosecution,
(capital) punishment, retribution, [legal] trial’.
Such
groupings overlap, since a broad category like ‘(in)appropriate reactions’ can reasonably
include narrower ones like ‘knowledge-gathering’, ‘problem-solving’, and
‘retaliating’. Still, we can make a modest ‘semantic table’ showing the typical
correlations between subject-groupings and object-groupings, e.g.:
subject-groupings object-groupings
actions (in)appropriate
reactions
resources consumption of resources
messages messages
knowledge knowledge-gathering
problem problem-solving
It seems plausible that a given parallel across our columns might show
up in the data on the same line, as we see at once for ‘evidence’ (knowledge)
plus ‘investigation/trial’ (knowledge-gathering). But this co-occurrence of
semantic groupings on one line is by no means a rule. We can also have, say, an
action as subject and a message about it as direct object, e.g. when an
‘operation warrants a middle-of-the-night briefing’. Or, the context for one
grouping may imply another, as when knowledge-gathering in legal contexts
implies a retribution, e.g. the condemnation and punishment likely to follow
upon a ‘trial’. Or again, some people consider legal punishments a type of
problem-solving, despite the scant evidence that the ‘problem of crime’ is
being solved in this way.
38. The constraints of context soon impel us beyond
the customary borders of semantics. An abstract scheme of ‘semantic features’
would presumably suggest making a separate class for general nouns, some of
which appear as frequent subjects in our data: ‘behaviour, circumstances,
conditions, contemporary events, incident, occasion, operation, qualities,
situation’. But none of these remains general in the context. Most of them
carry a pejorative implication, i.e., that the ‘behaviour, circumstances’, etc.
involve some problems. If we read that ‘circumstances do not warrant a change in
the leadership’, we can assume that one or more ‘leaders’ do not seem to have
been acting as they should and that somebody wants to reassure us. Or, if we
read that ‘circumstances simply do not warrant charitable assistance’, we can
assume some people are in financial difficulties while other people with money
are, in the finest Tory tradition, excusing themselves from helping out.
39. We can see here a major difference between
conventional abstract semantics versus corpus-driven semantics, one which Sinclair
(1994) has pointed out. Most of what passes for generality, vagueness, or
ambiguity in the meaning of language and impels semanticists to build finicky
sets of rules to eliminate it, evaporates when we look at suitably sorted real
data. So we may well feel uneasy about approaches that expressly declare it the
job of semantics to ‘disambiguate’ sentences or sequences that allow for more
than one interpretation (§ 43). Quite plausibly, the ambiguity is largely an
artefact of using isolated and invented data. We might recall here the contrast
between invented simple sentences like (3-5) in § 13 and 25 versus authentic
and elaborate real data such as (19-21) in § 34. Again, trying to filter
language to the point of enabling a formalist description erodes the
constraints that are urgently needed for convergence and consensus (cf. § 4, 6,
15, 17ff, 20, 28) .
40. For pragmatics, finally, we could note the
explicit performative ‘warrant’ when the speaker is also the subject, as in
(19-21) in § 34. Less explicit but far more common and influential is the
pragmatic force entailed in declaring what does or does not ‘warrant’ what.
This force carries the implication that the event or state of affairs that
might do the ‘warranting’ is in some way unusual or significant enough that a
reaction might well be in order, and that those who might be expected to do the
reacting are likely to say why or why not they are going to, and how.
Accordingly, the speaker — or, when the discourse is reported, the originator
of the message — is likely to be a person who represents some institution or
authority, and our data suggest what kind: government, judiciary, military,
sports, business, science, and medicine. Or if the person does not, then the
use of ‘warrant’ implies a subtle signal that authority is being claimed
anyhow; we see this use among journalists and media persons when they are not
reporting what other people said. Uses like ‘the Chevrolet Beretta does not
warrant particular mention’ or ‘the documentary wouldn’t warrant more than a 4’
are inconsequential magisterial pronouncements merely aping genuine authority
with real consequences, e.g., medical judgements about whether ‘problems
warrant surgery’ or ‘drugs’.
41. I have followed through the familiar linguistic
‘levels’ or ‘components’ to suggest that each of them contributes a set of
constraints on the verb ‘warrant’. But taken by itself, each set is weak and
some may seem unduly speculative. For example, citing the frequency of prefixes
as morphological units (§ 35) might seem to be overinterpreting merely
coincidental or antiquarian materials, were it not for the semantic and
pragmatic constraints indicating that ‘warranting’ often does involve
situations in which people act together (viz. ‘commitment, complaints, consideration,
conspiracy, consultations’); or where something is not what it should be (viz.
‘impropriety, insufficient, unimportant, unorthodox, unsatisfactory,
untutored’); or where people want ‘inside’ knowledge (viz. ‘inquiry,
investigation’) or want to break ‘in’ on the chain of events (viz.
‘interception, interference, intervention, introducing’); and so on. Suggestive
too are some less frequent semantic combinations, e.g., that ‘assistance’ and
‘assistant manager’ both appear as ‘warranted’ solutions to problems. The
question of whether such accumulations or combinations reflect the design of
the language or the speaker’s choice still needs to be determined; but without
the corpus data display, we wouldn’t have occasion to pose the question at all.
42. Considering pragmatics clearly helps in
appreciating the significance of several ‘grammatical’ or syntactic
accumulations. Foremost among these is the high frequency of negations (§ 32,
36), signalling how often the potential reactors feel impelled to declare that
a predictable or reasonable reaction will not
take place. Or (to include morphology here), the frequency of infinitive forms
reflects the specification of the criterion for making such a declaration e.g.,
that things are ‘too small, trivial’ etc. or ‘not serious, severe, etc. enough’
‘to warrant’ something. Or again, the
frequent use of modal verbs like ‘may’ (14), ‘must’ (11), ‘would’ (10), ‘will’
(7), ‘might’ (5), ‘should’ (4), ‘can’ (3), ‘could’ (3), and ‘shall’ (1) in a
total of 58 lines, plus ‘seem’ (8) and ‘appear’ (2), all have the function of
attenuating the pragmatic force and conceding that other people might reach
different conclusions about the ‘warranting’. The same function is at stake in
the use of interrogatives, as in ‘Did he warrant the harsh punishment of
exclusion?’; and of dependent clauses with the force of interrogatives, as in
‘specify what kind of cases would warrant capital punishment’. Or again, the
low number of personal pronouns (§ 36) as subjects reflects the semantic and pragmatic
constraint that actions and situations are more likely to be said to ‘warrant’
something than people are.
43. When we are describing real data, the interaction between semantic
and pragmatic constraints is often so intense that there are only weak
indicators of which is which. How can we, say, keep our semantic understanding
of a general noun like ‘circumstances’ apart from our pragmatic understanding
of the force entailed? The constraints from knowledge of world and society,
which ‘mainstream’ linguistics sought to detach from the constraints on
language (cf. § 2, 6, 11f, 14, 19f), are absolutely crucial for interpreting
such data, but are by no means easy to formalise as ‘rules’ (§ 56). We appear
to be dealing with numerous local interactions among constraints that support
sophisticated higher-level organisation, as in a complex system with
distributed parallel processing (cf. Rumelhart, McClelland, et al. 1986;
Beaugrande, in preparation). What appears to be a single constraint in an
actual context might rather be a pattern of such interactions. If so, the standing internal constraints upon the
language, e.g. that the English infinitive be formed from ‘to’ + non-inflected
verb, are like the ‘frozen islands’
in a complex system and continually interact with emergent external constraints from world and society during
discourse, e.g. that something is or is not ‘warranted’ by a combination of
situation (e.g. ‘circumstances’, ‘conditions’) + sufficiency (e.g. ‘enough’,
‘sufficient’) + gradable modifier (e.g. ‘serious’, ‘severe’) (cf. 46, 53, 68).
This interaction supports a convergence
among the various modes of data and a consensus
among speaker and hearer or writer and reader. If, as formalists linguistics
sought to do, we detach language from the constraints from world and society
and retreat from real data, the emergent constraints get diluted or lost, and
we face the awesome task of trying to ‘freeze’ the entire system — a sort of
‘cryogenic linguistics’ building a ‘cryogenerative grammar’. Convergence and
consensus recede, and the data begin to appear vague and ambiguous, sending us
off in search of complicated formal rules which, being devised in a relative
vacuum, are naturally arbitrary and ponderous (cf. § 39).
44. Moreover, the emergent
external constraints may be quite flexible about formal positions. They can
generate rich strands of semantic relatedness among items at various locations
in the sequences showing up in our data lines. In some lines, we encounter
items together that might be said to belong to the same semantic field, e.g.
‘chilly - thermal’, ‘economy - interest rates’, ‘shortage - embargo’, or
‘slowing - easing’. In other lines, we find the ‘attraction’ of a specific item constraining a general one. In ‘forward attraction’, the specific comes
first and specifies the general after it, e.g., in ‘alcohol - taxes’ (hence not
value added taxes), ‘degenerating trees - specialist’ (hence not an eye
specialist), ‘medical - drugs’ (hence not psychedelics), ‘violence - security’
(hence not a bond), ‘worshippers - huge edifice’ (hence a church or shrine). In
‘backward attraction’, the general
comes first and gets specified further on, e.g. in ‘declines - recession’,
‘operation - intensive care’, or ‘sites - custodian’; in cases like ‘air leaks
- military interception’ and ‘inclusion - wheelchair’, the specific emergent
constraints run counter to the standing constraints on the general item, i.e.,
an ‘air leak’ being in a sealed container, or ‘inclusion’ being ‘making
something part of a larger thing’ (Collins
COBUILD English Language Dictionary, p.736). In either direction, the
formal distance between the items can vary quite freely.
45. Should these data be considered ‘purely semantic’ when so much
depends on our pragmatic knowledge of the situations in which people say that
things are or are not ‘warranted’? Should uses like ‘air leak’ and ‘inclusion’
be classed as semantically deviant or deficient because they go against the
standing constraints, even though we can readily understand if we consider the
speaker’s motivations, e.g. to arouse the impression that a ‘no-fly zone’ in a
war is virtually air-tight, or to avoid a more usual but harsher term like
‘confinement’? Should we devise ‘semantic rules’ that first compute the typical
meaning and then go on to compute the deviant meaning? How about cases where
the data seem plainly misleading, e.g.:
(22) < as a major threat sufficient to warrant a
pre-emptive strike of their own. >
(23) < stories of ill health that appear to warrant
surgical intervention. Frequently >
This
‘major threat’ in (22) differs from the standing constraints on the familiar
speech act of ‘threatening’ in that the agent may have done or said nothing
implying any intention to cause harm. Yet our social knowledge is quite
familiar with the high-tech jargon from the age-old military and political
discourse that disguises aggression as defence. Or, the ‘appear’ rather than
‘appears’ in (23) oddly suggests that surgery is to be performed on ‘stories’
or ‘story-tellers’ rather than on the people in ‘ill health’; but
world-knowledge prevented both the text producer and the news editors from
noticing this suggestion.
46. The overall conclusion would be that the familiar linguistic
‘levels’ or ‘components’ are designations not for neatly distinct sets of formal abstract data but for sets of functional standing constraints operating across sets of real
data and generating emergent constraints. Since this process supports the
convergence among the various modes of data and the consensus among speaker and
hearer or writer and reader (§ 43), a linguistic description can itself attain
convergence and consensus not just by sorting data into separate piles, one for
each set, but by assessing the interactions among these sets (§ 50). Even my
brief demonstration should suffice to show that the form of the data may seem
highly variable and at times utterly idiosyncratic unless we continually
examine the relevant functions. Formulating ‘formal rules’ that draw a rigorous
border between what can ‘warrant’ what versus what cannot in any ‘well-formed’
English sentence only leads to finicky debates over examples and
counter-examples and misrepresents the ‘competence’ of English speakers (§ 53).
They do not know what can and cannot be ‘warranted’ for once and for all, but
they do know what sorts of things people are likely to say are or are not
‘warranted’ and why; and that is the knowledge put to use by the people who
produce and understand real data.
C. Some implications of
corpus linguistics for linguistic theory
47. Our situation today recalls a complaint once
voiced by Saussure (1966 [1916]: 106): ‘It is one thing to feel the quick,
delicate interplay of units and quite another to account for them through
methodical analysis’. Corpus data reveal far more numerous and more ‘delicate
interplays’ than Saussure, with his deep mistrust of ‘actual speech’ (§ 1),
could have imagined, and they are pressuring us to develop suitable methods of
analysis and a more functional and realistic theoretical ambience (cf. Baker et
al. [eds.] 1993). In this final section, I shall explore some factors bearing
upon such a theoretical ambience and relate them to the theoretical problems
aired in section A.
48. Against the backdrop of my forceful articulation of these problems, it may seem odd if I sound optimistic. But the chances for ‘mainstream’ linguistics to make major pr