Journal of Pragmatics 25, 1996, 503-535.

The ‘Pragmatics’ of Doing Language Science:

The ‘Warrant’ for Large-Corpus Linguistics

 

Robert de Beaugrande

 

[abstract]

 

The development of ‘mainstream linguistics’ in this century is briefly retraced to suggest that the original decision to describe ‘language by itself’ as opposed to ‘language in use’ favoured formalism over functionalism and eventually led to a severe impasse for three tests a valid science of language ought to meet: coverage of language data, convergence among the data being described, and consensus among linguists about how to proceed. The impact of large-corpus linguistics might resolve this impasse and accordingly raises the prospect of a fundamental reorientation of linguistic theory and of the ‘pragmatics’ of ‘doing language science’.

 

A. Testing the progress of ‘mainstream linguistics’

 

1. The term ‘mainstream linguistics’ is sometimes used to designate a language science pursuing a programme based on several generally agreed principles (survey in Beaugrande 1991), notably:

 

(a) Language is a phenomenon distinct from other domains of human knowledge or activity.

(b) A language constitutes a system defined completely by internal, language-based constraints.

(c) A language should be described apart from the conditions under which speakers use it.

(d) The description of a language should be couched in statements at a high degree of generality, if possible about the ‘rules’ for the language as a whole or even about the ‘universals’ for all languages.

 

Within the programme, the tenets interlock in projecting a free-standing and self-sufficient conception of language as a uniform, stable, and abstract system holding still while we are describing it, and separated from the ‘rich’ and ‘messy’ human contexts where it is encountered in ordinary life. Linguistics has thus undertaken to describe a theoretical construct of ‘language by itself’ (‘langue’, ‘competence’, etc.) situated at a safe distance from the empirical realities of ‘language in use’ (‘parole’, ‘performance’, etc.) during communication. Saussure’s (1966 [1916]: 9) defensive mistrust of actual ‘speech’ has proven highly influential (cf. § 17):

 

(1) speech is many-sided and heterogeneous [...] it belongs both to the individual and to society; we cannot put it into any category of human facts, for we cannot discover its unity.

 

2. If we imagine a ‘layer-cake’ with language as a system mediating between a culture’s knowledge of the ‘world’ on one side and the ‘society’ of speakers on the other, the programme of ‘mainstream’ linguistics implies detaching language and rolling away the other ‘layers’ (Fig. 1). Doing so discounts the constraints upon what people say due to what they are talking about and to whom.

The working hypothesis is that once detached, the language system will stand firm: complete and fully organised by its own internal constraints (cf. § 6, 19).

3. Yet what we encounter is always ‘language in use’; even the formal analysis of isolated sentence structures is a mode of use, albeit a peculiar and untypical one (§ 15, 53, 57). To get from ‘use’ over  to ‘language by itself’, linguistics has foregrounded data-handling strategies, such as:

 

(a) collating: a large set of data samples are compared and contrasted to distil out what they have in common, e.g., which word types frequently occur with other types;

(b) generalising: certain aspects of the observed data are construed to be general ones, e.g., that the ‘Subject-Verb-Object’ order of a sample set of English sentences is a typical pattern for the language as a whole;

(c) rarefying: the ‘rich’ data as they were observed in spontaneous interaction are made ‘sparse’, e.g., by disregarding the personal authority of speakers;

(d) decontextualising: the data are taken out of the observed context and text type and treated as if they had occurred in isolation or could occur in a wide range of contexts, e.g., irrespective of the social status of individual speakers;

(e) introspecting: the linguists make estimations based on their own intuitions about the language, e.g., which sentences do or do not violate the ‘rules’;

(f) consulting informants: native speakers are given data samples of their language and asked to judge or rate them, e.g., to decide which of two versions of a sentence is ‘grammatical’ or ‘ungrammatical’.

 

These strategies project a second working hypothesis, namely that applying them will lead to a complete and valid description of a given natural language.

4. How might these two main working hypotheses be tested? If it is true that a language system is fully organised by its own internal constraints, and that these strategies can describe it as such, then the three key tests for progress would be steady cumulative rises in (a) the coverage of the language; (b) the convergence among language data discovered and described; and (c) the consensus among linguists about how to formulate the description. Yet if we apply these three ‘C-tests’ to mainstream linguistics, we find a rise only in some domains and a sharp fluctuation in others. What ‘pragmatic’ factors for ‘doing language science’ can have led to this outcome?

5. Probably the most influential factor was that the programme for describing ‘language by itself’ has naturally favoured formalism, the stance construing form to be the basis and framework of language — how entities are shaped or arranged. As we strive to ‘abstract’ language out of everyday contexts, the most stable and reliable substrate naturally appears to be the forms, the patterns of word-stems and suffixes or the patterns of phrases and sentences. This factor discourages functionalism, the stance construing function to be the basis and framework of language — what means are used toward which ends; functional aspects tend to be associated with language in use. So the ‘majority position’ in ‘mainstream linguistics’ has usually been that formalism confers high ‘scientific’ status and that its legitimacy can be taken for granted, whereas functionalism is ‘unscientific’ or ‘pre-theoretical’ and its legitimacy must be expressly justified. In this academic power structure, formalist research is not required to specify its relevance and its ecological validity — whether and how its findings contribute to a general and productive understanding of the human situation — whereas functionalist research is expected either to struggle toward the a priori criteria of rigour, abstractness, generality, and so on, set down by formalism, or else to defend itself for not doing so. So functionalism has been severely held back or has worked at cross-purposes by not following up on its insights and by compromising its own ecological validity in order to compete with formalism on the latter’s terrain.

6. In the long term, the development of mainstream ‘formalist’ linguistics reveals an ominous trade-off: the more formalised a theory of language and its apparatus of terms and formal notations, the less we can expect a steady rise on the three ‘C-tests’ of coverage, convergence, and consensus (cf. § 13, 18, 51f). This trade-off is not unduly surprising, given the robust fact that people, including linguists, are naturally much less skilled at ‘formalising’ natural language data than they are at speaking, hearing, reading, and writing them. After detaching language from the constraints of ‘world’ and ‘society’ and discounting functions in favour of forms, we are left with the huge task of reconstructing or inventing the formal and ‘purely linguistic’ constraints for every sort of regularity we may encounter (cf. § 18). The enterprise of linguistic formalism hinges on the assumption that there exists, for each natural language, at least one set of such constraints strictly separating what belongs to the language from what does not. Decades of formalist research have failed to identify any such set for any language; and the lack of progress on the three ‘C-tests’ strongly argue that no such set exists. Converted into a static and closed formal system, language does not stand firm, complete, and fully organised by its own internal constraints (§ 2); instead, it tends to skid out of control.

7. The ‘formalist’ trade-off has been somewhat obscured by the fact that it holds in differing degrees for the various domains of language. The three ‘C-tests’ are best met in phonology and morphology, which both offer concise methods for segmenting language data so as to isolate and classify minimal units. Phonology has the most clearly defined criteria in the articulatory events and locations that characterise the sound-units called ‘phonemes’, e.g., a ‘voiced dental stop’ such as /d/ produced when the vocal cords vibrate and the air flow is blocked by the teeth. The visual correspondence between many phonemes and written letters of the Roman alphabet also supports the ‘C-tests’, though it has not been made into a theoretical principle, since the description is strictly addressed to spoken language. Thanks to these clear criteria, linguistics soon provided descriptions of the repertory of sound-units in language after language, covering all the phonemes with impressively high convergence and consensus.

8. The success of phonology did much to entrench the concept of ‘language by itself’ being a uniform, stable, and abstract system, or rather a set of subsystems, usually called ‘levels’, each consisting of a repertory of minimal combinable elements. A complete description of a language would be the sum of the complete descriptions for each subsystem, supplied by linguists working within the tidy, ‘pragmatic’ division of labour reflected in academic specialisations, journals, conference proceedings, and course offerings.

9. Yet in morphology, the criteria are already less tidy. Convergence and consensus are fairly high for identifying and isolating the ‘morphemes’, aided again by the visual clarity of the data written down. The analyst segments the written data until no further meaningful subdivisions appear feasible — a method once introduced as ‘immediate constituent analysis’, which, if applied ‘in all observation of word-structure’, Bloomfield (1933: 209, 221) promised, would eliminate any ‘inconsistency of procedure’. His promise rested on a staunchly formalist mandate:

 

(2) Any utterance can be described in terms of lexical and grammatical forms; [and] any complex form can be fully described apart from its meaning in terms of the immediate constituent forms and the grammatical features [whereby these] are arranged (1933: 167).

 

Here, consensus is to be established by sheer stringency of method. Yet recalcitrant problems can arise that would not trouble phonology. We can easily reach a consensus about the human vocal apparatus; and after some simple demonstrations, most speakers will agree that they ‘know’ the repertory of phonemes in their native language, e.g., when they distinguish between voiced and unvoiced consonants. But it’s harder to agree in what sense speakers ‘know’ their native language as a repertory of its minimal meaningful forms. So we are on weaker grounds in claiming that our own consensus as linguists corresponds directly to the consensus of speakers of the language we are describing (cf. § 74). Consider the English morphemes borrowed from French, Latin, and Greek but no longer recognised by all contemporary monolingual speakers. Should a morphological description include not just the more obvious ones like ‘in-’ and ‘im-’ for negation alongside ‘un-’, ‘non-’, or ‘a-’ but also the more erudite ones like ‘pter’ (‘wing’) in ‘helicopter’, where speakers might instead identify the final ‘-er’ as an agentive suffix (compared, say, to ‘lawnmower’)? Doing so would oblige us to turn to language history and thus deviate from the ‘mainstream’ programme to describe language ‘synchronically’ in a single stage of its evolution.

10. Again in contrast to phonology, morphology is quite problematic in respect to coverage. In theory, the entire vocabulary of the language consists of ‘morphemes’ or clusters of these; how could we list them all? The ‘mainstream’ strategy has been to focus on the ones that form stable, compact classes, e.g. the set of all verb inflections, versus the ones that form unstable, open classes, e.g. the set of all verbs or verb stems. Only for the first type could full coverage be attained, whereas the second type could be consigned to the category of ‘lexemes’ to be described in the domain of ‘lexicology’.

11. Still, morphology has faired rather well on the ‘C-tests’ through its close engagement with the real data recorded in fieldwork on previously undescribed languages, which in my estimation has contributed by far the finest achievements in modern linguistics. The fieldworker’s overall task is to progress from being an ‘outsider’ in the community of speakers over to being an ‘insider’ who can speak the language at least well enough to interact with the community and eventually to describe the language. The fieldworker must reach a working consensus with the community or else expect to be misunderstood, ridiculed or ignored. The task is richly supported by ordinary constraints from ‘world’ and society (§ 2, 6), which always apply to real data but which may well not appear in a stringent formal description. To maintain a consensus with the community, the fieldworker can rely on an intuitive, perhaps unconscious grasp of such constraints to produce ‘proper’ utterances, whether their ‘propriety’ is ‘purely linguistic’ and can be ‘formalised’ within a ‘linguistic theory’ or is more cognitive or social.

12. The three ‘C-tests’ get shifted far more radically in the move from morphology to syntax. Units can no longer be isolated by using the criterion of ‘minimalness’; nor does it seem at all feasible to make a complete repertory of syntactic patterns. So phonology and morphology were replaced by syntax at the centre of linguistic theory, and the concept of ‘language’ itself got shifted from a ‘descriptive’ notion of a repertory of units (§ 8) over to a ‘generative’ notion of a repertory of rules for constructing and arranging units. Since the ‘rules’ plainly do not appear in the data, syntactic research relaxed the close engagement with real data as established in morphology fieldwork. Instead of segmenting the language sequences themselves, the task was to devise rules that would ‘generate’ the underlying structure of the sequences. Such a shift did not just leave formalism intact, but actually endowed language data with an enhanced but hypothetical formality. The effect was most striking when semantics was added onto syntax: to maintain the detachment of language from knowledge of the ‘world’ (§ 2, 6), meanings were described as arrays of underlying forms, often called ‘semantic features’.

13. At this point, the ‘formalist trade-off’ described in § 6 began to grow virulent. Disengaging from real data encouraged some influential generative linguists to turn to invented data, whose status as part of the language was certified not by its occurrence in the actual speech of native speakers but by the intuitive approval through introspection. The official rationalisation was that corpuses of real data are inadequate because they are ‘finite’ and ‘accidental’ collections of utterances, whereas speakers of a language can produce or understand many more utterances — presumably an ‘infinite’ set of them (Chomsky 1957: 15). This rationalisation had the labour-saving corollary that fieldwork is not very necessary or helpful: linguists need merely elicit invented data or even — a unique privilege among scientists — invent their own data when they are native speakers of the language, e.g.:

 

(3) The man hit the ball.

(4) John is easy to please.

(5) The cat sat on the mat.

 

A bit paradoxically, the ‘normalness’ of such sentences can make them seem a bit odd in comparison to what people actually say (§ 26).

14. Saving labour this way has some severe hidden costs. When linguists were no longer in the concrete fieldwork situation of confronting real data in an unknown language, the task of describing the language is no longer firmly correlated with the task of reaching a working consensus by going from an outsider to an insider in the culture (cf. § 11). This change removes both the most tangible means of testing one’s assumptions and the richest source of constraints from world and society. And when the task of describing freely rides upon the describer’s prior facility and unstated intuitions regarding the language, the linguists are already insiders before they start their work.

  15. The impending problems were forestalled by the central formalist assumption that there exists, for each natural language, at least one set of formal constraints strictly delimiting what belongs to the language from what does not (§ 6). When found, such a set would provide total coverage and lead to a formal account both for the convergence among the structures of ‘grammatical’ or ‘well-formed’ sentences and for the consensus among the intuitions of native speakers. Since phonology and morphology had been relegated to the sidelines (§ 12), the constraints could be grouped under the respective headings of syntax, semantics, and pragmatics. Each group of constraints could be identified by selectively violating it, e.g.:

 

(4) John is easy to please. (‘well-formed’)

(4a) *To is please John easy. (‘syntactic violation’)

(4b) ?John is easy to sneeze. (‘semantic violation’)

(4c) ?John, be eased and pleased! (‘pragmatic violation’)

(4d) ?A john is sleazy to fleece. (which violation?)

 

But such demonstrations entail several problems:

 

(a) Insofar as the examples were invented on the spot to demonstrate the rules, they cannot be an independent validation of the rules; and the constraints applying to the act of invention and to its peculiar and untypical purpose — to produce a selective violation — hardly match the constraints that apply to ordinary acts of discourse and to their practical purposes, e.g. to justify what you are doing.

(b) The assessment of a violation depends heavily on the ingenuity of the linguists, e.g., whether they can imagine a situation where it would be appropriate to talk about ‘sneezing John’ (he might be a flue microbe in a children’s story); or where John’s imperious mother might command her son to be pleased about a Christmas gift and to have an easy conscience about not getting her one. To argue that  a sentence is disqualified if it was ingeniousely devised (as Bierwisch has) is not helpful when we cannot define the threshold where naivity leaves off and ingenuiousness starts. Even (4) is ingenious in the sense that it was expressly invented to make a point about underlying structure, and is unlikely to be uttered (§ 53, 57).

(c) It is also easy enough to invent examples where it is not clear which group of constraints is violated. (4d) would be such a case, and might yet be contextualised by applying the American meanings of ‘john’ as a ‘toilet’ or a ‘prostitute’s customer’ (Random House Websters College Dictionary, 1991, p. 729).

 

16. At all events linguists were dismayed to find a wholly unexpected lack of agreement, both among themselves and among native speakers, about sample sentences. This outcome gave rise to a series of complex rhetorical manoeuvres on two sides. On the side of data, samples were carefully restricted in order to highlight the contrast between clearly proper sentences with seemingly obvious meanings versus clearly improper ones with no sensible meanings, as in (4-4c). On the side of theory, the central formalist assumption was shielded from the implications of observed disagreements. The set of formal constraints was declared to correspond only to ‘competence, the speaker-hearer’s knowledge of his language’ and not to ‘performance, the actual use of language in concrete situations’ (Chomsky 1965: 4). The ‘speaker-hearer’ was in turn declared ‘ideal':living in a completely homogeneous speech-community’, ‘knowing its language perfectly’, and being ‘unaffected’ by ‘memory limitations, distractions, shifts of attention and interest, and errors’ (1965: 3). In effect, these two declarations instated consensus by decree and converted it into an ‘ideal’ that need not, indeed cannot, be tested against the agreement among speakers.

17. The same rhetorical pressure accounts for the evasive complication opposing ‘surface structure’ to ‘deep structure’ and declaring that ‘the grammar does not, in itself, provide any sensible procedure for finding a deep structure of a given sentence, or for producing a given sentence’ (1965: 141). Moreover, ‘much of the actual speech observed’ was declared to ‘consist of fragments and deviant expressions’ (1965: 201), echoing Saussure’s influential mistrust of ‘actual speech’ (cf. § 1). These further declarations can serve to explain away any discovered lack of convergence among language data.

18. These rhetorical manoeuvres suggest that generative linguists were aware of and disquieted by the ‘formalist trade-off’ (described in § 6) but were determined to rescue the central formalist assumption (also cited § 6) by designing the theory precisely so as to prevent the lack of progress in coverage, convergence, and consensus in respect to real data from counting as a refutation. They correctly speculated that their manoeuvres, even if thinly or speciously argued, would not be critically assessed by colleagues who were (a) firmly committed to the ‘mainstream’ linguistic programme of describing ‘language by itself’, (b) were not anxious to undertake painstaking fieldwork in remote places, and (c) mistrustful or actual speech in all its ‘messy richness’. So we can readily understand the success of generative linguistics and its continuation through a long and sometimes arcane series of ‘extensions’, ‘revisions’, or changes of notation without any willingness to change its basic claims about what a ‘language’ is and what a ‘linguistic theory’ should do. Its adherents cannot admit that isolating language from the functional constraints that apply to real data incurs the impossible job of inventing all the formal constraints for all conceivable data, irrespective of whether native speakers would ever utter them. The ‘generative grammar’ would have to reconstruct the formal possibility that speakers could utter or understand them; and no evidence has been brought forward so far that this can ever be done.

19. The conclusion would have to be: if language is detached from the constraints of ‘world’ and society, its own internal constraints are not sufficient to support its organisation (cf. § 2, 6, 11f, 14). Hence, any linguistic description which postulates such a detachment will only be able to cover a part of that organisation and will encounter frequent obstacles to convergence and consensus. This conclusion is borne out by empirical evidence not about the formal structure of sentences but about the ‘pragmatic’ activities of doing language science over the past century. Because ‘language by itself’ was a technical fiction to begin with, theories about it have been obliged to created a proliferating series of further technical fictions to prop each other up — ‘grammaticality’, well-formedness’, ‘competence’, ‘ideal speaker-hearer’, ‘homogeneous speech-community’, ‘deep structure’, and all the rest — that are not merely unconfirmed by real data but programmatically opposed to real data. The prospect today is not merely that no formal description of ‘language by itself’ has yet attained adequate coverage, convergence, and consensus for any natural language, but that no such theory ever will. In the long run, the apparent advantages of linguistic formalism — stability, determinism, rigour, visual clarity, impressive notations — and the privileges its confers — to invent and judge your own data, to do science without leaving your desk, and to escape the rich and messy contexts of human interaction — all turn out to be liabilities for achieving even its own carefully circumscribed tasks. Such a formalism relegates us to a shadowy world of formulas and arrays whose determinacy is financed by their indeterminate relation to the language data they purport to represent.

 

C. The impact of ‘large-corpus linguistics’

 

20. I have briefly retraced the theoretical evolution of ‘mainstream’ linguistics in section A in order to indicate how the early programme of describing ‘language by itself’, detached from world and society, has favoured a linguistic formalism that turned away from real data and eventually blocked further progress in coverage, convergence, and consensus, without which we cannot attain a complete and valid description of any natural language. The growing awareness of this impasse has led to a diversification within linguistics that has edged formalism gradually out of its ‘mainstream’ and majority position. The brands of linguistics going under such designations as ‘functional’, ‘systemic’, ‘applied’, ‘cognitive’, ‘computational’, and ‘critical’, along with some adjunct domains such as ‘discourse analysis’ and ‘discourse processing’ (which seldom aspired to be part of linguistics), all share the enterprise of resituating language in its cognitive and social contexts, reassembling, as it were, the ‘layer cake’ of language interfaced with world and society (§ 2).

 21. As the conventional division between ‘language by itself’ versus ‘language in use’ has been progressively narrowing, we have found that real data are not plagued by the lack of ‘discoverable unity’ that, Saussure vowed, would prevent us from ‘putting speech into any category of human facts’ (§ 1); nor do they ‘consist of the fragments and deviant expressions’ that justified Chomsky’s retreat from ‘the actual use of language in concrete situations’ (§ 16). Instead, real data reveal an unexpectedly high degree of precision and clarity, though not necessarily in the modalities that mainstream linguistic theories would easily recognise.

22. This finding has been most profoundly assisted by the advance of technology, placing within our reach a new source of data that dramatically enhances the prospects for coverage, convergence, and consensus. The key technical innovation is the large computerised corpus of data from actual texts and discourses, such as the ‘Bank of English’ (hereafter ‘BoE’ for short) developed at Birmingham University by John Sinclair and his team. I took the data described below from the BoE in July 1994, at the stage when it had reached the size of some 200 million words of running text from contemporary spoken and written sources, including: British and American books; newspapers (Times, Independent, Guardian, Today, Wall Street Journal, New Scientist, Economist); magazines (e.g., Esquire, Good Housekeeping); ‘ephemera’ such as letter-box mailings (e.g., YMCA appeal for homeless people, Friends of the Earth Tropical Rainforest Campaign), radio broadcasts (British Broadcasting Corporation in the UK and National Public Radio in the US); and recordings of conversations.[1] The coverage by so large a corpus might validly claim to be representative, though it is certainly not complete and is very far from ‘infinite’. Yet paradoxically, it has itself made us aware of the ways in which it is yet too small (§66ff).

23. Still, as a sample of contemporary English usage, the coverage exceeds previous sample sizes by various orders of magnitude, such as: the previous 20-million word corpus used for the 1987 Collins COBUILD English Language Dictionary (by 1 order of magnitude); the 1-million word Survey of English Usage at University College London (by 2 orders of magnitude plus doubling); the 2000-word fragments in the Brown University corpus (by 5 orders of magnitude); and the 24 invented sentences analysed or ‘transformed’ in Chomsky’s Aspects (by 7 orders of magnitude).[2]

24. Contrary to what is widely believed, the increase in orders of magnitude does not entail a direct proportionality whereby we just get the same data multiplied by 10, 100, 1,000, and so on, so that if an item appears once in a 1 million word corpus, it appears 20 times in a 20 million word corpus and 200 times in a 200 million word corpus. If that were true, building steadily bigger corpuses would only give the results we could accurately predict from the proportions in a small corpus. But in fact the large corpus offers not just more data but different kinds of data:

 

(a) We find numerous items that did not appear at all in smaller ones.

(b) We can make more informed judgements about relative frequency. Of two items appearing only once in a small corpus, the one might still appear only once in a larger corpus and the other fifteen or twenty times.

(c) The larger corpus will display the data in steadily finer degrees and differentiations of detail. An item which appeared only once in a small corpus may appear in several distinctive variants in a large one.

 

In these ways, each increase in magnitude can reveal hosts of fresh and more detailed regularities that were simply not noticeable before, nor are they readily open to unaided intuition and introspection (§ 27,52f, 55, 63). They still have to be interpreted, but — in marked contrast to non-corpus linguistic methods — the outcome is quite amenable to convergence and consensus (§ 4, 6, 15, 17ff, 20, 22, 27f, 39, 43, 46, 48, 50, 52-55, 62, 64f, 72-75).

25. Conversely, the corpus shows that examples we might intuitively accept at face value are not typical of actual usage. Our beloved evergreens like those cited in § 13:

 

(3) The man hit the ball.

(4) John is easy to please/eager to please.

(5) The cat sat on the mat.

 

do not appear in the BoE, not because they aren’t properly ‘grammatical’ or ‘well-formed’ English but because they aren’t ‘natural’: typical contexts of real discourse require less simple-minded and peremptory utterances. In the BoE, nobody at all is said to be ‘easy to please’. For ‘eager to please’, three instances appear (6-8), each with a direct object for ‘please’ that was missing in (4) and with more interesting agent-subjects than our insipid friend ‘John’. Even allowing for intervening items, the only combination of ‘man + hit + ball’ was (9); ‘man + hit’ alone returned only (10), where the sense of ‘hit’ adapts to ‘jackpot’. For ‘cats sitting on mats’, the only attestations were derivations from the use of this trite example in schoolbooks or logician’s debates, e.g. (11-13), rather than being assertions about any real cat.[3]

 

(6) < a government official who is eager to please the wealth goddess >

(7) < the Sandinistas. The government is eager to please the Church >

(8) < show a sociable child who is eager to please or charm those around him >

(9) Yes. Doesn’t that man hit the ball hard?

(10) Where can a con-man hit the biggest jackpot? In politics

(11) On the first page was a drawing of a brindled cat seated on a recognisable mat, the original ‘cat on the mat’ now quoted in derision of an antiquated method of teaching

(12) so if you have <ZF1> a <ZF0> a man on the roof [pause] er erm erm a cat on a mat er a tree on a mountain top a boy sitting on a tree branch these all involve

(13) material-objects statements, ‘There is a cat on the mat’, statements about people in novels, statements of mathematics

 

We shouldn’t regard the grainy details of the real data as a mere obstruction to be filtered out by rarefying and decontextualising (in the sense of § 3). Instead, we should respect the ‘naturalness’ of real data because, unlike the ‘grammaticalness’ or ‘well-formedness’ of the formalists, it has been decided for us by real users of the language (cf. § 63f). We want to account for the ‘competence’ real users not just possess but display when doing this; and there, ‘well-formedness’ has no overriding priority (§ 46, 48, 53, 55).

26. Now, corpus displays are in some sense frankly ‘surface’ data’, but, exactly because the data are not severed from their contexts, it is easier to assess what sorts of ‘shallower’ or ‘deeper’ constraints might apply. Even on the surface, a corpus displays to the investigator not just words but collocations, to adopt Firth’s (1968 [1952-1957]: 106ff, 113, 182) well-known term: ‘words’ considered in ‘the company they usually keep’, i.e., typical word combinations that would not usually qualify as idioms or standing phrases (cf. § 31, 33, 52, 55, 60, 66, 69, 77ff). Also, the data can be accessed in somewhat ‘deeper’ ways by means of the search software, so that, for example:

 

(a) The collocation need not be invariant or continuous but may contain varying interposed words (up to 4 in the BoE), e.g. ‘on the mat’ in (5) versus ‘on a recognisable mat’ in (11).

(b) We can sort out words that could belong to more than one word-class, e.g. ‘warrant’ as either a noun or a verb.

(c) We can use uncommitted characters to search for a stem with all its endings, e.g. to compare ‘logic’ with ‘logical’, which turn out to collocate rather differently.

(d) We can make nested sub-displays to zero in on possibly significant combinations in the general display, e.g., to go from ‘warrant’ to ‘warrant + investigation’.

 

27. The most ‘surface’ use of the large corpus is to enable accurate judgements about the frequencies of words or word-combinations — a familiar tactic in ‘computational linguistics’. A far ‘deeper’ and more revealing use of the corpus is to detect tendencies rather than just frequencies, so that we can assess why certain combinations occur and not just how often. Paradoxically, sorting vast quantities of real data allows unexpected convergences to emerge within the regularities underlying this huge variety (cf. § 43f, 72). Among all of the possible combinations of English words and phrases that might be intuitively judged ‘grammatical’, we can finally see which ones are more likely to be realised and at least some of the reasons why.

28. The main challenge now is how to identify and describe the constraints whose effect the corpus-displays allow us to inspect (cf. § 73). The constraints are all functional in the broadest sense, i.e., related to what people do with their language (§ 5); any formality we may distil out is derivative upon that functionality and cannot be consensually accounted for without it (cf. § 54f; Beaugrande, in press). Moreover, functional constraints need not fit neatly into the formal linguistic schemes devised for ‘language by itself’ — not a surprising finding, perhaps, but an immensely significant one (§ 34-46).

29. My demonstration here will be the Bank of English corpus data on the English verb ‘warrant’. The BoE returned a total of 392 lines centring on that key-word as a Verb. To get a more manageable and productive sample, I made a hand-sorted selection of 228 lines by eliminating repetitions, e.g. when a statement by a politician got reported in several media, and false alarms where the key word was actually a noun.[4] Selecting the verb allowed me to disregard the numerous noun occurrences in stock phrases like ‘search warrant’, ‘death warrant’, or ‘warrant for arrest’.

30. The word has a venerable history related, according to Walter Skeat’s (1970 [1879-1882]: 702) Etymological Dictionary of the English Language, to the word ‘guarantee’. As a verb, we find such usages attested in the Oxford English Dictionary (pp. 930ff) as: ‘to keep safe from danger (14); ‘to guarantee goods to be of the quality, quantity, etc. specified’ (15); ‘to give a personal assurance of a fact’ (16), ‘chiefly in “I (I’ll) warrant you”’ (17); and ‘to authorise, sanction a course of action’ (18).

 

(14) What good Man was he that from deth warawnted thee? (Henry Lovelich, Merlin, 1450)

(15) This Ryche man thenne sold his oylle to the marchaunts and waraunted eche tonne al ful (William Caxton, The subtyl historyes and fables of Esope, Auyan, Alfonce, and Poge, 1484)

(16) Bot for to lere him I warand, Als mekil als he mai vnderstand (The proces of the seuyn sages, 14th century)

(17) There be many such I warrant you yt neuer cum to light (Thomas More, A dyaloge wherin he treatyd dyvers maters as of the veneration and worshyp of ymagys etc., 1528)

(18) The Lord warrants us to suspect the inconstant (Daniel Rogers, Naaman the Syrian, his disease and cure, 1642)

 

These samples from the 14th to the 17th centuries suggest a gradual widening away from official discourse, and a drift toward the modern usage displayed by the BoE corpus, as we shall see.

31. A first heuristic for identifying the more interesting collocations in the BoE is to list in the order of frequency the most common words within the set of lines returned. Many of those near the top of the list, such as ‘of’ or ‘to’, will seem unenlightening in the early stages, but at least some of the more suggestive words can turn up:[5] among the nouns, ‘evidence’ (21 occurrences), ‘investigation’ (12), ‘trial’ (7), ‘attention’ (9), ‘circumstances’ (8), ‘concern’ (6), ‘mention’ (5), ‘consideration’ (5), ‘punishment’ (5), ‘intervention’ (4), and ‘conditions’ (3); among the modifiers, ‘enough’ (58), ‘sufficient’ (27), ‘serious’ (14), ‘really’ (7), ‘certainly’ (6), ‘important’ (5), ‘severe’ (5), and ‘trivial’ (4).

32. A second heuristic is to create a positional frequency table in which the words in the several slots to the left and right of the key word are displayed in descending order of frequency. The table below shows the data for ‘warrant’.

 

 

 

3 to the left    2 to the left    1 to the left          word          1 to the right       2 to the right       3 to the right

sufficient        enough          to                        warrant       a                        the                     of

enough          evidence        not                      warrant       the                     investigation       the

serious           did                't                         warrant       an                      a                        in

too                do                 would                  warrant       it                        <t>                    a

the               does               might                   warrant       such                   attention             <t>

and                not                really                   warrant       any                    of                       but

that                as                  that                      warrant       further                action                 action

not                didn               yet                       warrant       this                     trial                    <LTH>

sufficient        may               should                 warrant       that                    with                   and

in                   doesn            search                 warrant       his                      and                    to

is                   nothing          and                      warrant       to                       more                  trial

was               the                 will                      warrant       some                  special               that

it                   and                circumstances      warrant       their                   even                   for

of                  seem             arrest                   warrant       no                      mention              into

which            t o                  could                   warrant       another              intervention        by

but                trivial             can                      warrant       my                     's                       it

good             that                may                     warrant       its                      it                        is

done              will                soon                    warrant       for                     because             than

<h>               so                  'll                         warrant       more                  than                   some

a                   small              conditions            warrant       concern              new                   as

's                   seemed          germane              warrant       officer                an                      an

be                 they               death                   warrant       < /h>                 further                here

important       appear           certainly               warrant       one                    sort                    then

 

These data too are at best suggestive, and for much the same reason that purely formal syntax readily becomes convoluted or opaque: many words or word-classes are fuzzy in respect to their mutual positions; and functional relations need not show up as formal ones. The frequent negations — ‘not, ‘-t’, ‘didn’, ‘no’, and, by implication, ‘too’ — are scattered over four positions (cf. § 36, 42). And some of the most revealing data don’t appear at all, either because their position isn’t consistent enough, e.g. ‘situation’; or because a shared semantic concept is lexicalised in various ways, e.g. ‘disability - distress levels - ill health - medical problems’.

33. A third heuristic offered in the BoE software sorts the lines by the alphabetical order for a given position to the left or right of the key word. This tool works best in bringing out data about items whose position is relatively fixed, e.g. the extreme frequency of ‘to’ in the infinitive (top item before ‘warrant’ in the positional frequency table). But user-performed hand-sorting is needed for groupings wherein the essential items and collocations occupy more flexible positions, e.g. ‘serious’; or where groupings are to be made by semantic criteria, e.g. ‘investigation’ with ‘inquiry’. I worked out three hand-sorted displays and added bold italics to highlight the items that I chiefly relied on while doing the sorting and alphabetising:[6] one for what does or doesn’t do the ‘warranting’, one for what is or is not ‘warranted’, and one for the relevant criteria. Samplings from these three displays are given in Appendices A, B, and C.

34. These displays begin to reveal the various types of constraints. Some constraints might be provisionally stated according to the familiar schemes of different ‘levels’ or ‘components’ of ‘mainstream linguistics’. For phonology, the intonation would be distinctive for the performative ‘warrant’ in relatively rare locutions like ‘I’ll warrant’ used when you want to indicate you feel sure about something though you can’t point to actual facts (cf. § 64):

 

(19) If I had ten thousand men like him tomorrow then I warrant we’d see Napoleon beat by midday [quoting the Duke of Wellington.]

(20) The soil may look innocuous enough when you’ve dug it over but I’ll warrant it’s teeming with root-eating wireworms.

(21) I’ll warrant I even heard Honey Bane shuffling by somewhere in the background of a song that will provide the perfect soundtrack for when your mum won’t let you out of your room until you’ve done your homework.

 

A sample like (21) looks quite complex (with quadruple ‘embedding’) in comparison to the usual invented sentences like (3-5) in § 13 and 25, but in actual discourse it should present no difficulties for comprehension, even for the young and not very intellectual readers it addresses.

35. For morphology, we might note the overwhelming frequency of non-finite forms, either in infinitives with ‘to’ (136 occurrences) or with some modal verb (58) (cf. § 42). Also, several Latin/French-based prefixes among the semantic processes may be significant: ‘ad­-’ for moving toward something: ‘action, appeal, appellation, assistance, attention’; ‘com-’ or ‘con-’ for acting, happening, or bringing together: ‘collection, commitment, complaints, conclusion, conditions, consideration, conspiracy, consultations’ plus the Anglo-Saxon ‘with-’ in ‘withdrawals’; ‘de-’ and ‘dis-’ for uncovering or invalidating something: ‘declines, definition, developments, disability, distress’; ‘e-’ or ‘ex-’ for getting outside: ‘event, evidence, examination, exclusion, expansion, expenditure, extension’, plus the Anglo-Saxon ‘out-’ in ‘outburst’; negating ‘im-’ or ‘in-’ for something that is not as it should be: ‘impropriety, indeterminate, insufficient’, plus the Anglo-Saxon ‘un-’ in ‘uncharacteristic, uncovered, unimportant, unorthodox, unsatisfactory, unspecifiable, untutored’; ‘in-’ and ‘inter-’ for getting inside or between: ‘inquiry, interception, interference, intervention, introducing, investigation into war crimes, inclusion in the wheelchair, internal matters that warrant no outside interference’; ‘re-’ for following up or going back toward something previous: ‘recession, record, recording, relaxation, relief, respect, response, retaliation, retrospective, return, revelations, revision’ (cf. § 41).

36. For syntax or ‘grammar’, we could note the extreme dominance of third person subjects (224 occurrences), as opposed to just 4 in first person (compare samples (19-21) and none at all in the second person; and, within the third person, the mere handful of pronoun subjects ‘he’ (6 occurrences), ‘she’ (0), ‘they’ (5), and ‘it’ (7), as contrasted with the large numbers of noun subjects (§ 42). Or, we might note the high proportion of negations attached to the verb: ‘not, don’t, didn’t, not yet, hardly, not really’ (cf. § 36, 42).

37. For semantics, we could note that many of the subjects and direct objects fall into associative classes that are not unduly hard to label, e g.:

 

(a) as subjects: actions: ‘achievement, aggressions, behaviour, blow, brawl’; resources: ‘abilities, acreage, growing area, scrappable cars’; knowledge: ‘evidence, information, perception, scientific authority’; messages: ‘accusations, complaints, juicy stuff, message, piece of tittle-tattle, revelations’; problems: ‘air leaks, ambiguity, antitrust conspiracy, casualty rate, chilly old homes, degenerating trees, disability, discriminatory practices, distress levels, food shortage, ill health, impropriety, job bias, slowing in the economy, violence’;

(b) as direct objects: (in)appropriate reactions: ‘(further) action, change, commitment, conclusion, consideration, expansion, extension, formation, increases, motion, (cautious) move, plan, step, signing, treatment’; consumption of resources: ‘cost, expenditure, loss of any troops’ lives, overeating, paying the steeper taxes, shelf-space’; messages: ‘apology, appellation, billing, briefing, brochure, column inches, comment, description, footnote, mention, phrase, satire, serious talk, suggestion, talking-to’; knowledge-gathering: ‘airing, attention, consultations, examination, hearing, inquiry, investigation, retrospective survey, review, [legal] trial, [medical] trials’; solving problems: ‘answering machine, (charitable /economic) assistance, breaking the embargo, easing of interest rates, full-time custodian, guests wearing thermal long johns, intensive care, introducing more elaborate feeding, (professional/prompt/surgical) intervention, making peace, mid-season break, opening of a new peat extraction plant, revision, sending of those supplies, using these drugs’; retaliating: ‘banning the show, charge(s), God’s anger, jail time, lengthy ban, massive American retaliation, penalties, pre-emptive strike, (criminal) prosecution, (capital) punishment, retribution, [legal] trial’.

 

Such groupings overlap, since a broad category like ‘(in)appropriate reactions’ can reasonably include narrower ones like ‘knowledge-gathering’, ‘problem-solving’, and ‘retaliating’. Still, we can make a modest ‘semantic table’ showing the typical correlations between subject-groupings and object-groupings, e.g.:

 

                subject-groupings                   object-groupings

                actions                                                (in)appropriate reactions

                resources                                            consumption of resources

                messages                                             messages

                knowledge                                          knowledge-gathering

                problem                                              problem-solving

 

It seems plausible that a given parallel across our columns might show up in the data on the same line, as we see at once for ‘evidence’ (knowledge) plus ‘investigation/trial’ (knowledge-gathering). But this co-occurrence of semantic groupings on one line is by no means a rule. We can also have, say, an action as subject and a message about it as direct object, e.g. when an ‘operation warrants a middle-of-the-night briefing’. Or, the context for one grouping may imply another, as when knowledge-gathering in legal contexts implies a retribution, e.g. the condemnation and punishment likely to follow upon a ‘trial’. Or again, some people consider legal punishments a type of problem-solving, despite the scant evidence that the ‘problem of crime’ is being solved in this way.

38. The constraints of context soon impel us beyond the customary borders of semantics. An abstract scheme of ‘semantic features’ would presumably suggest making a separate class for general nouns, some of which appear as frequent subjects in our data: ‘behaviour, circumstances, conditions, contemporary events, incident, occasion, operation, qualities, situation’. But none of these remains general in the context. Most of them carry a pejorative implication, i.e., that the ‘behaviour, circumstances’, etc. involve some problems. If we read that ‘circumstances do not warrant a change in the leadership’, we can assume that one or more ‘leaders’ do not seem to have been acting as they should and that somebody wants to reassure us. Or, if we read that ‘circumstances simply do not warrant charitable assistance’, we can assume some people are in financial difficulties while other people with money are, in the finest Tory tradition, excusing themselves from helping out.

39. We can see here a major difference between conventional abstract semantics versus corpus-driven semantics, one which Sinclair (1994) has pointed out. Most of what passes for generality, vagueness, or ambiguity in the meaning of language and impels semanticists to build finicky sets of rules to eliminate it, evaporates when we look at suitably sorted real data. So we may well feel uneasy about approaches that expressly declare it the job of semantics to ‘disambiguate’ sentences or sequences that allow for more than one interpretation (§ 43). Quite plausibly, the ambiguity is largely an artefact of using isolated and invented data. We might recall here the contrast between invented simple sentences like (3-5) in § 13 and 25 versus authentic and elaborate real data such as (19-21) in § 34. Again, trying to filter language to the point of enabling a formalist description erodes the constraints that are urgently needed for convergence and consensus (cf. § 4, 6, 15, 17ff, 20, 28) .

40. For pragmatics, finally, we could note the explicit performative ‘warrant’ when the speaker is also the subject, as in (19-21) in § 34. Less explicit but far more common and influential is the pragmatic force entailed in declaring what does or does not ‘warrant’ what. This force carries the implication that the event or state of affairs that might do the ‘warranting’ is in some way unusual or significant enough that a reaction might well be in order, and that those who might be expected to do the reacting are likely to say why or why not they are going to, and how. Accordingly, the speaker — or, when the discourse is reported, the originator of the message — is likely to be a person who represents some institution or authority, and our data suggest what kind: government, judiciary, military, sports, business, science, and medicine. Or if the person does not, then the use of ‘warrant’ implies a subtle signal that authority is being claimed anyhow; we see this use among journalists and media persons when they are not reporting what other people said. Uses like ‘the Chevrolet Beretta does not warrant particular mention’ or ‘the documentary wouldn’t warrant more than a 4’ are inconsequential magisterial pronouncements merely aping genuine authority with real consequences, e.g., medical judgements about whether ‘problems warrant surgery’ or ‘drugs’.

41. I have followed through the familiar linguistic ‘levels’ or ‘components’ to suggest that each of them contributes a set of constraints on the verb ‘warrant’. But taken by itself, each set is weak and some may seem unduly speculative. For example, citing the frequency of prefixes as morphological units (§ 35) might seem to be overinterpreting merely coincidental or antiquarian materials, were it not for the semantic and pragmatic constraints indicating that ‘warranting’ often does involve situations in which people act together (viz. ‘commitment, complaints, consideration, conspiracy, consultations’); or where something is not what it should be (viz. ‘impropriety, insufficient, unimportant, unorthodox, unsatisfactory, untutored’); or where people want ‘inside’ knowledge (viz. ‘inquiry, investigation’) or want to break ‘in’ on the chain of events (viz. ‘interception, interference, intervention, introducing’); and so on. Suggestive too are some less frequent semantic combinations, e.g., that ‘assistance’ and ‘assistant manager’ both appear as ‘warranted’ solutions to problems. The question of whether such accumulations or combinations reflect the design of the language or the speaker’s choice still needs to be determined; but without the corpus data display, we wouldn’t have occasion to pose the question at all.

42. Considering pragmatics clearly helps in appreciating the significance of several ‘grammatical’ or syntactic accumulations. Foremost among these is the high frequency of negations (§ 32, 36), signalling how often the potential reactors feel impelled to declare that a predictable or reasonable reaction will not take place. Or (to include morphology here), the frequency of infinitive forms reflects the specification of the criterion for making such a declaration e.g., that things are ‘too small, trivial’ etc. or ‘not serious, severe, etc. enough’ ‘to warrant’ something. Or again, the frequent use of modal verbs like ‘may’ (14), ‘must’ (11), ‘would’ (10), ‘will’ (7), ‘might’ (5), ‘should’ (4), ‘can’ (3), ‘could’ (3), and ‘shall’ (1) in a total of 58 lines, plus ‘seem’ (8) and ‘appear’ (2), all have the function of attenuating the pragmatic force and conceding that other people might reach different conclusions about the ‘warranting’. The same function is at stake in the use of interrogatives, as in ‘Did he warrant the harsh punishment of exclusion?’; and of dependent clauses with the force of interrogatives, as in ‘specify what kind of cases would warrant capital punishment’. Or again, the low number of personal pronouns (§ 36) as subjects reflects the semantic and pragmatic constraint that actions and situations are more likely to be said to ‘warrant’ something than people are.

43. When we are describing real data, the interaction between semantic and pragmatic constraints is often so intense that there are only weak indicators of which is which. How can we, say, keep our semantic understanding of a general noun like ‘circumstances’ apart from our pragmatic understanding of the force entailed? The constraints from knowledge of world and society, which ‘mainstream’ linguistics sought to detach from the constraints on language (cf. § 2, 6, 11f, 14, 19f), are absolutely crucial for interpreting such data, but are by no means easy to formalise as ‘rules’ (§ 56). We appear to be dealing with numerous local interactions among constraints that support sophisticated higher-level organisation, as in a complex system with distributed parallel processing (cf. Rumelhart, McClelland, et al. 1986; Beaugrande, in preparation). What appears to be a single constraint in an actual context might rather be a pattern of such interactions. If so, the standing internal constraints upon the language, e.g. that the English infinitive be formed from ‘to’ + non-inflected verb, are like the ‘frozen islands’ in a complex system and continually interact with emergent external constraints from world and society during discourse, e.g. that something is or is not ‘warranted’ by a combination of situation (e.g. ‘circumstances’, ‘conditions’) + sufficiency (e.g. ‘enough’, ‘sufficient’) + gradable modifier (e.g. ‘serious’, ‘severe’) (cf. 46, 53, 68). This interaction supports a convergence among the various modes of data and a consensus among speaker and hearer or writer and reader. If, as formalists linguistics sought to do, we detach language from the constraints from world and society and retreat from real data, the emergent constraints get diluted or lost, and we face the awesome task of trying to ‘freeze’ the entire system — a sort of ‘cryogenic linguistics’ building a ‘cryogenerative grammar’. Convergence and consensus recede, and the data begin to appear vague and ambiguous, sending us off in search of complicated formal rules which, being devised in a relative vacuum, are naturally arbitrary and ponderous (cf. § 39).

 44. Moreover, the emergent external constraints may be quite flexible about formal positions. They can generate rich strands of semantic relatedness among items at various locations in the sequences showing up in our data lines. In some lines, we encounter items together that might be said to belong to the same semantic field, e.g. ‘chilly - thermal’, ‘economy - interest rates’, ‘shortage - embargo’, or ‘slowing - easing’. In other lines, we find the ‘attraction’ of a specific item constraining a general one. In ‘forward attraction’, the specific comes first and specifies the general after it, e.g., in ‘alcohol - taxes’ (hence not value added taxes), ‘degenerating trees - specialist’ (hence not an eye specialist), ‘medical - drugs’ (hence not psychedelics), ‘violence - security’ (hence not a bond), ‘worshippers - huge edifice’ (hence a church or shrine). In ‘backward attraction’, the general comes first and gets specified further on, e.g. in ‘declines - recession’, ‘operation - intensive care’, or ‘sites - custodian’; in cases like ‘air leaks - military interception’ and ‘inclusion - wheelchair’, the specific emergent constraints run counter to the standing constraints on the general item, i.e., an ‘air leak’ being in a sealed container, or ‘inclusion’ being ‘making something part of a larger thing’ (Collins COBUILD English Language Dictionary, p.736). In either direction, the formal distance between the items can vary quite freely.

45. Should these data be considered ‘purely semantic’ when so much depends on our pragmatic knowledge of the situations in which people say that things are or are not ‘warranted’? Should uses like ‘air leak’ and ‘inclusion’ be classed as semantically deviant or deficient because they go against the standing constraints, even though we can readily understand if we consider the speaker’s motivations, e.g. to arouse the impression that a ‘no-fly zone’ in a war is virtually air-tight, or to avoid a more usual but harsher term like ‘confinement’? Should we devise ‘semantic rules’ that first compute the typical meaning and then go on to compute the deviant meaning? How about cases where the data seem plainly misleading, e.g.:

 

(22) < as a major threat sufficient to warrant a pre-emptive strike of their own. >

(23) < stories of ill health that appear to warrant surgical intervention. Frequently >

 

This ‘major threat’ in (22) differs from the standing constraints on the familiar speech act of ‘threatening’ in that the agent may have done or said nothing implying any intention to cause harm. Yet our social knowledge is quite familiar with the high-tech jargon from the age-old military and political discourse that disguises aggression as defence. Or, the ‘appear’ rather than ‘appears’ in (23) oddly suggests that surgery is to be performed on ‘stories’ or ‘story-tellers’ rather than on the people in ‘ill health’; but world-knowledge prevented both the text producer and the news editors from noticing this suggestion.

46. The overall conclusion would be that the familiar linguistic ‘levels’ or ‘components’ are designations not for neatly distinct sets of formal abstract data but for sets of functional standing constraints operating across sets of real data and generating emergent constraints. Since this process supports the convergence among the various modes of data and the consensus among speaker and hearer or writer and reader (§ 43), a linguistic description can itself attain convergence and consensus not just by sorting data into separate piles, one for each set, but by assessing the interactions among these sets (§ 50). Even my brief demonstration should suffice to show that the form of the data may seem highly variable and at times utterly idiosyncratic unless we continually examine the relevant functions. Formulating ‘formal rules’ that draw a rigorous border between what can ‘warrant’ what versus what cannot in any ‘well-formed’ English sentence only leads to finicky debates over examples and counter-examples and misrepresents the ‘competence’ of English speakers (§ 53). They do not know what can and cannot be ‘warranted’ for once and for all, but they do know what sorts of things people are likely to say are or are not ‘warranted’ and why; and that is the knowledge put to use by the people who produce and understand real data.

 

C. Some implications of corpus linguistics for linguistic theory

 

47. Our situation today recalls a complaint once voiced by Saussure (1966 [1916]: 106): ‘It is one thing to feel the quick, delicate interplay of units and quite another to account for them through methodical analysis’. Corpus data reveal far more numerous and more ‘delicate interplays’ than Saussure, with his deep mistrust of ‘actual speech’ (§ 1), could have imagined, and they are pressuring us to develop suitable methods of analysis and a more functional and realistic theoretical ambience (cf. Baker et al. [eds.] 1993). In this final section, I shall explore some factors bearing upon such a theoretical ambience and relate them to the theoretical problems aired in section A.

48. Against the backdrop of my forceful articulation of these problems, it may seem odd if I sound optimistic. But the chances for ‘mainstream’ linguistics to make major pr