Corporate Bridges’ Twixt Text and Language:

Twenty Arguments against Corpus Research

And Why They're a Right Load of Old Codswallop

 

Robert de Beaugrande

 

 

Every time a new sockdolager of a word come along and I learnt where she orter fit in to make sense it kind o’ tickled me all over.

— Don Marquis, Danny's Own Story

 

A. ‘Language’ versus ‘data’: bridged or unabridged?

1. The mild pun on ‘corporate’ in my title may be tolerated for supplying a handy adjective — ‘relating to corpora’ — whilst acknowledging the ‘corporate’ funding of research in its justified anticipation of new reference works. Now that a whole generation of such works has transformed the commercial market, the signals seem clear enough for a methodological revolution in our practices, such as compiling dictionaries for the purposes of learners of English; but the signals are far from clear for a scientific revolution in our theories, such as defining ‘language’ for the purposes of linguists, whether theoretical or applied (§ 12, 34, 166).

2. These mixed signals are not too surprising. Since the first emergence of the academic discipline, linguistic theory has been beset with substantive problems regarding the question of whether how to define and describe ‘language’ through some bridge to an actual or potential source of data in texts (survey in Beaugrande 1991). For a retrospective overview, the alternative outlooks might be summarised in these terms:

2.1. Language is best represented by the largest and broadest corpus of authentic data that can be collected and described. This view is prominent in fieldwork linguistics (e.g. Longacre 1964 [1958]) with its ties to ethnography (language as culture), and is urgently needed when working with a language which the linguists do not know and which has not been previously described (§ 32).

2.2. Language can be represented by such a corpus, but doing so is not obligatory, and can be supported by practical shortcuts with non-authentic data, assuming that the same results would be obtained with authentic data. This view is prominent in descriptive linguistics (e.g. Bloomfield 1933) with its ties to behaviourism (language as habit), especially when working with a language which the linguists do know and which has been previously described.

2.3. Language need not be described from a corpus at all; linguists can safely rely on their own intuition and introspection as native speakers to supply them with data. This view is prominent in generative linguistics (e.g. Chomsky 1965) with its ties to idealist philosophy (language as mind). Here, linguists can work only with a language they know.

2.4. Language is an abstract, ideal system not directly manifested in data, and so must be deduced by formal or logical means. This view is prominent in glossematics (e.g. Hjelmslev 1969 [1943]) with its ties to formal philoso-phy (language as calculus). Here, linguists can work without reference to any particular language, e.g., in research on ‘universal grammar’.

2.5. Language is a delicate system menaced by errors and abuses, and so must be described as it ought to be used rather than how it is. This view is prominent in prescriptive linguistics (e.g. Alford 1864), with its ties to social elitism and ‘conservative’ politics (language as refinement). Linguists work only with a carefully purified version of a language they know very well. These outlooks roughly fall along a parameter where a bridge from language to authentic data is expressly required at one pole, and expressly dismissed at the other pole. Only the prescriptive outlook juts out awkwardly, accepting data but if they are certified to be ‘correct’, ‘educated’, ‘elegant’, and so forth. Many of its adherents are not accredited linguists, and might better be called ‘language guardians’. Even so, it is at least implicitly the dominant outlook among the general population and has engendered a gallery of self-serving pot-boiler handbooks with cautionary titles like 1001 Pitfalls in English Grammar.

3. We could plausibly predict that these outlooks will produce respective descriptions of language that differ substantially from each other; and that the differences within a single outlook will be more substantial wherever a theory is not bridged to authentic data — like a house of cards with no driveway, or a castle in the air with no drawbridge. ‘Theories’ will abound where they can be devised out of whole cloth, like clothing fads in fashion, by strenuous theoretical bootstrapping (§ 33).

4. In the purview of science at large, linguistics markedly stands out for its periodic resolves to get by without data. Perhaps daunted by a vision of language data being ‘unabridged’ like the contents of those massive dictionaries, some linguists have expressly devised or embraced theories to show why the discipline need not, in principle, sustain a bridge between theory and data. Saussure (1966 [1916]: 9, 11) already asserted that ‘speech cannot be studied’, ‘for we cannot discover its unity’; it is only a ‘heterogeneous mass’ of ‘accessory and accidental facts’. In the same vein, Chomsky (1965: 4, 201) later asserted that the ‘observed use of language’ ‘surely cannot constitute the subject-matter of linguistics, if this is to be a serious discipline’; ‘from the standpoint of the theory’, ‘much of the actual speech observed consists of fragments and deviant expressions of a variety of sorts’. And both linguists found a large and ardent following.

5. The key qualifier here is ‘from the standpoint of the theory’ —the theory is what tells us our data are ‘deviant’. Although Chomsky was purportedly discussing the ‘theory constructed by the child’ learning a language (1965: 201), the real constructor implied here was surely the generative linguist. One curious consequence of this outlook is that both the child learning the language and the linguist describing it would be working against the grain of available data and in despite of the ‘actual speech observed’. Conversely, any description of a language directly based upon and confirmed by data would be inadequate a priori, irrespective of the size and sources of the sample.

6. This consequence is reflected by at least four central precepts in the generative outlook on ‘linguistic theory’: (1) children succeed by means of an ‘innate language acquisition device’; (2) ‘unquestionable data’ are to be produced and judged by the linguist’s own ‘intuition’ as a ‘native speaker’; (3) language data need to be ‘transformed’ or ‘formalised’ in order to be investigated scientifically; (4) language should be described by analogy to a more abstract and formal system, such as ‘context-free grammar’ (see Beaugrande 1998, 2001a, for detailed documentation). Each precept plots the circuitous, evasive routes between theory and data after the direct bridge has been closed off.

7. A second and even more curious consequence is that the production of data in a real speech community would resemble a ‘catastrophe’ — a ‘sudden violent change representing a discontinuous response of a system to smooth changes in the external conditions’ (Arnold 1984: 2). The speaker accesses order and equilibrium (language), transforms it into disorder and disequilibrium (speech) and transmits the data to the hearer, who transforms them back into order. Communication would be highly ‘noisy’ in the sense of electrical engineering; and perilously ‘far from equilibrium’ in the sense of complexity theory. Ambiguity and vagueness should abound (cf. § 122).

8. This second consequence makes a difficult and perilous enterprise out of ordinary language use, and not just of learning or describing the language. But the consequence has been evaded by an expedient inconsistency. On the one hand, the linguist was declared to command an ‘enormous mass of unquestionable data concerning the linguistic intuition of the native speaker, often himself’ (Chomsky 1965: 20). On the other hand, the ‘speaker of a language’ was declared incapable of being or becoming ‘aware of the rules of the grammar’, so his ‘reports and viewpoints about his behaviour and competence may be in error’ (1965: 8). We thus come to a third curious consequence: generative linguists must command a superhuman rationality for ‘becoming aware of and reporting’ what real speakers cannot — and are thus the humans best empowered to reveal the ‘competence’ of the ‘ideal speaker-hearer’ in a completely homogeneous speech-community’ who ‘knows the language perfectly’, which Chomsky (1965: 3) has famously vowed to be the ‘primary concern’ of ‘linguistic theory’.

9. The downside of this empowerment is that such linguists seem qualified to do only that. Halliday (1984: 51) has accordingly critiqued the ‘assumption’ questioned that ‘the only job for which a professionally trained linguist was fitted is to go back and train more linguists’ ‘in a university linguistics department’, ‘insulated from the real world’. Those departments feel entitled to ‘dismiss questions raised by “non-linguists”’ — i.e., ‘the rest of humanity’ — who are deemed ‘prejudiced and ill-informed’; yet ‘behind the questions lies a concern with real issues of social value’ and ‘effective communication’.

10. The privileged status of professional linguists was at all events implicitly claimed by their common practice of inventing their own data to illustrate their particular notions, such as the evergreens [1-3].

[1] The man hit the ball.

[2] John is eager to please.

[3] John is easy to please.

Samples like [2] and [3] were designed to show that sentences apparently having the same surface structure differ in their ‘underlying’ structure (John pleasing or getting pleased). Other samples were designed to accentuate the ‘distinctions between well-formed and deviant’, ‘corresponding to the intuition of the speaker’ (Chomsky 1965: 24), e.g., [4] versus [5] (Chomsky 1957: 42, 78).

[4] John admires sincerity.

[5] John frightens sincerity.

But a subtle objection might be raised here. The task of distinguishing between events versus non-events is surely an oddity for any science, insofar as a set of non-events lacks any systemic organisation. Nobody seriously expects meteorology to show why rain doesn’t fall upwards; or geography to explain why the earth is not flat; or astronomy to prove that the earth is not the centre of universe; these sciences explain real and possible events rather than the impossible non-events. Yet generative linguistics seemingly proposed to explain why ‘grammars’ preclude wildly ungrammatical or ill-formed data. Surely the set of such linguistic non-events would be the true ‘heterogeneous mass’ of ‘accessory and accidental facts’ that we have seen Saussure imagining to comprise the real events of ‘speech’ (§ 4) — and may actually be infinite, which a set of real events never is (cf. § 65ff).

11. Working with large corpus data for the past seven years has impelled me to grasp an even more subtle objection. Much of the data presented as non-deviant — as ‘grammatical’, ‘well-formed’, and so on — also do not occur as real events. Not even in the Bank of English (BoE), the world’s largest corpus of authentic texts, nor in the British National Corpus (BNC), did I find a single occurrence of samples [1] through [5]. Such trivial data are evidently possible but not probable, and this factor might tell us something significant about human language. For example, speakers or writers of English don’t just say that somebody ‘admires sincerity’, full stop; but rather that they ‘admire the sincerity’ of a particular person on a particular occasion, as when desperate Valancourt confessed to Emily St. Aubert that he was ‘irreparably ruined’ by his ‘debts’:

[6] Emily, while she was compelled to admire his sincerity, saw, with unutterable anguish, new reasons for fear in the suddenness of his feelings (Mysteries of Udolpho)BAWC [= British and American Writers Corpus data]

And we don’t normally just say ‘the man hit the ball’ but a deal more about who, why, and how:

[7] Leconte, by contrast, hit the ball with the joy of a player savouring rare moments free of physical pain and won, 6-4 (Independent )BNC [= British National Corpus data]

[8] A back pass from player-manager Hoddle seemed to catch Hammond by surprise and the goalkeeper hit the ball straight to the feet of Posh’s Disappointed Swindon boss. (Today Sports Page)BNC

[9] All of life, as we know it, moves in little, unavailing circles. More justly than to anything else, it can be likened to the game of baseball. Crack! we hit the ball, and away we go. If we earn a run in life we call it success (Whirligigs)BAWC

[10] I ’lowed I’d knock that durned little ball way over into the next county. So I rolled up my sleeves and spit on my hands and got a good holt on that war club and I whaled away at that little ball agin, and by chowder I hit it. I knocked it clear over into Deacon Witherspoon’s pasture, and hit his old muley cow, and she got skeered and run away, jumped the fence and went down the road, and the durned fool never stopped a-runnin’ till she went slap dab into Ezra Hoskins’ grocery store, upsot four gallons of apple butter into a keg of soft soap, and sot one foot into a tub of mackral, and t’other foot into a box of winder glass (Uncle Josh’s Punkin Centre Stories)BAWC

Further on, I shall suggest that non-authentic data are distinguished by static predictability, and authentic data by a dynamic tension between predictable and unpredictable (§ 15, 73, 79ff, 129). A bit paradoxically, data like [1-4] are so strenuously made to seem probable that they flip over to improbable.

12. The prospect impends that linguistics may have set itself a task truly without precedent in the annals of science: defining our object of investigation, namely language, by contrasting two sets of non-events. Not only has no remotely complete ‘explanation’ or ‘grammar’ of this type ever been published for any human language; the task is inherently impossible. If so, the so-called ‘generative revolution’ was more properly an ‘anti-scientific revolution’, and the time has come for a genuine ‘scientific revolution’ that will restore the reality of language (§ 33f).

B. Some data parameters: authentic, rich, literary, academic

13. From the standpoint of corpus linguistics, which, by definition, works with authentic data, the staid dichotomies of ‘langue and parole’ or ‘competence and performance’ are basically irrelevant because, strictly speaking, they imply a dichotomy of non-data versus data insofar as neither ‘langue’ nor ‘competence’ is manifested as data. Saussurians have evaded this implication by a double tracking: they have vowed to ‘deal only with linguistics of language’ [langue], yet to ‘use material belonging to speaking [parole] to illustrate a point’ (Saussure 1966 [1916]: 19), which suggests that data might also occur somewhere else besides in ‘parole’ but doesn’t say where. Chomskyans more expediently ‘assumed that the set of sentences is somehow given in advance’ (Chomsky 1957: 18, 54, 85, 103). To be more precise, that ‘ideal speaker-hearer’ would never say anything at all (which would entrain him in the ‘deviant’ conduct of ‘performance’); he would stand transfixed in rapturous ‘introspection’ upon the ‘infinity’ of ‘well-formed sentences’ hovering in the ‘perfect’ nirvana of his ‘competence’.

14. For us, the most relevant dichotomy should be between authentic data versus non-authentic data: whether the data are attested by actual occurrence in text and discourse. Not surprisingly, this dichotomy has hardly figured in linguistic approaches built chiefly on non-authentic, non-attested data. No doubt the long-standing convention of idealising language has encouraged the notion that we can study it best with idealised data. Yet the very fact that we can intuitively recognise non-authentic data as such should indicate their exceptional status and argue against their being a valid representation of the language. Our parameters would not impose a boundary between ‘grammatical’ versus ‘ungrammatical’ sentences, but would seek to describe the parameters of authenticity among data which are all unquestionably grammatical.

15. One influential parameter here could be termed rich data versus sparse data, where ‘richness’ denotes the potential of a context to determine the meaning of some term. We can finally shelve the projects of linguistics to provide a ‘context-free’ description (still envisioned, say, in Cook 1992; Keenan 1993), e.g. for ‘describing the structure of a sentence in isolation from its possible settings’ (Katz and Fodor 1963: 170). Data actually freed of all contexts and settings would no longer be language nor data, but merely symbol strings of the kind displayed in inscriptions of an undeciphered dead language. The act of recognising language as data is inseparable from the act of imagining ‘possible settings’ (§ 71). Even Katz and Fodor do just that with their non-authentic sparse-context example [11].

[11] The bill is large.

by imagining whether one is dealing with a hefty payment request or a bulky bird. The quest for the structure of semantic theory’ can only defeat itself by maximizing sparseness, as if we were required to explain whatever a space traveller might mean who lands on earth, says nothing but ‘the bill is large’, and then is instantly transported to the planet Tattooine by Jabba the Hutt. We need to explain what actual speakers might mean by collocating ‘large’ with ‘bill’, e.g., for an exorbitant charge [12] (the only meaning attested in the BNC); a hefty menu complete with hefty prices, [13]; a banknote of high denomination [14]; or a wall poster [15].

[12] Budgeting loans are not available for gas and electricity bills. If you have a large bill which you cannot pay you may be able to go on the fuel direct scheme (Age Concern)BNC

[13] The large bill of fare held an array of dishes sufficient to feed an army, sidelined with prices which made reasonable expenditure a ridiculous impossibility (Sister Carrie)BAWC

[14] Merriam had his bank balance of $2,800 in his pocket in large bills, and brief instructions to pile up as much water as he could between himself and New York. (Whirligigs)BAWC

[15] He always prints, I know, ’cos he learnt writin’ from the large bills in the bookin’ offices. (Pickwick Papers)BAWC

These data are ‘rich’ in the sense that we can in each case determine a distinctive meaning of our collocation, even though a portion of the data must be unpredictable (cf. 11, 73, 79ff)

16. Another influential parameter could be termed literary data versus non-literary data. My own corpora of British and American Writers (BAWC), whose construction I shall briefly describe later on (§ 115ff), contains mostly texts that would be labelled ‘literature’ for purposes such as library catalogues. I have elsewhere proposed to define literature as socially accredited discourse about alternative worlds, which we can compare and contrast with our notions of our own (e.g. Beaugrande 1988). This principle of ‘alternativity’ underwrites the human validity of fiction despite its not being ‘fact’: it uses imaginary people and events to convey statements about the human situation.

17. For this reason, most literature sustains a ‘world-creating’ potential by providing a rich background and setting for its audiences. Authors feel encouraged to present rich discourse frames telling how things were said instead of just reporting the words, e.g.:

[16] ‘Tis because you are an indifferent person’, said Lucy, with some pique, and laying a particular stress on those words, ‘that your judgment might justly have such weight with me’. (Sense and Sensibility)BAWC

[17] ‘Step this way, if you please!’ I repeated, in so determined a manner that he could not, or did not choose to resist its authority. (Tenant of Wildfell Hall)BAWC

Representing such data without the frames would lose some of the information conveyed by the frames (§ 118).

18. A final parameter I would propose would be academic data versus non-academic data: whether or not a text is produced in or for some institution of ‘higher learning’ or ‘research’. Academic texts frame academic sources by their prestige and conviction rather than their manner of speech, e.g. [18]; and construct periodic sentences like tapestry, better suited to the eye than the ear, e.g. [19]. Such samples would be difficult to imagine anywhere but in academic data.

[18] The Hon. and Rev. W. Herbert, afterwards Dean of Manchester, in the fourth volume of the Horticultural Transactions, declares that ‘horticultural experiments have established, beyond the possibility of refutation, that botanical species are only a higher and more permanent class of varieties’. (The Origin of Species)BAWC

[19] On the throne of Samarcand, Timour displayed his magnificence and power; listened to the complaints of the people; distributed a just measure of rewards and punishments; employed his riches in the architecture of palaces and temples; and gave audience to the ambassadors of Egypt, Arabia, India, Tartary, Russia, and Spain, the last of whom presented a suit of tapestry which eclipsed the pencil of the Oriental artists. (Decline and Fall of the Roman Empire)BAWC

The richness of academic data clearly differs in this regard from the richness of literary data. We would not have, for example, ‘The Hon. W. Herbert declared, with some pique…’.

19. The contrast between sparse and rich seems obvious for [1] versus [7-10], or for [11] versus [12-15], but must be partially an intuitive one, and cannot be reduced to ‘syntactic rules’ or ‘semantic features’. But then linguistic theory is not required to do so if authenticity is decided by attestation, not by abstract features or formal structures (§ 13). One could perhaps extract from authentic materials some individual sentences that would seem as sparse as non-authentic data, e.g.:

[20] I know the man. (Uncle Tom’s Cabin)BAWC

[21] You are only fourteen. (Cash Boy)BAWC

But their sparseness is merely an illusion created by isolating the data from their richer contexts, e.g., [20] being a reason to believe what Cassy says to Uncle Tom about the odious Legree [20a]; or [21] being a reason to doubt whether Frank will be able to ‘take care of Grace’ [21a].

[20a] now you’ve got his ill will upon you, to follow you day in, day out, hanging like a dog on your throat — sucking your blood, bleeding away your life, drop by drop. I know the man.

[21a] ‘But Grace? She is a delicate girl’, said the mother, anxiously. ‘She cannot make her way as you can.’ ‘She won’t need to’, said Frank, promptly; ‘I shall take care of her.’ ‘But you are very young even to support yourself. You are only fourteen.’

When linguists purport to analyse data ‘free of context’ (§ 15), they are instead creating artificially sparse contexts where the activities of imagining ‘possible settings’ are performed under the counter. The sparseness is intensified when linguists go on to convert their non-authentic data into some formal representation, such as a ‘syntactic structure’ [22], or a ‘general postulate’ to signify that every office building has a window [23].

[22] the man hit the ball => T + N + Verb + NP (Chomsky 1957: 26f).

[23]: ("x) [office (x) ƒ=> building (x)]; b: ("x) [building => ($y) (has (x, y) & window (y))] (van Dijk 1977: 100)

But as long as we are still working with expressions of natural language, such as ‘window’, we retain contact with contexts. The key factor for is that authentic data characteristically occur in rich contexts, which, in my BAWC data, shed an unflattering light on ‘buildings’ well-supplied with ‘windows’:

[24] Coketown […] had a vast pile of buildings full of windows where there was a rattling and a trembling all day long (Hard Times)BAWC

[25] here and there would be a great factory, a dingy building with innumerable windows in it, and immense volumes of smoke pouring from the chimneys (The Jungle)BAWC

20. I would vigorously contest the assumption implicit in modern linguistics that the processes of inventing sparse data and making rich data sparse increase the validity and generality of our description of language (cf. § 82ff, 92). I would assert just the opposite insofar as these processes are usually arbitrary and uncontrolled. We are flatly presented with the results (e.g. ‘John admires sincerity’, § 10f) rather than with an explicit account of how the linguist went about inventing or formalising the data, as if these processes were fully underwritten and guaranteed by native speaker intuition or by an academic degree in linguistics (cf. § 8, 38).

21. Moreover, I submit that since language use is empirically found to constitute rich data, scientific method demands that these must be the central basis of a valid description or explanation. And since the products of intuition are empirically found—in the discourse of many linguists—to constitute sparse data, the production of data cannot be a valid function of intuition (§ 40ff). Instead, its valid function is to sustain bridges between authentic data and rich contexts which quite naturally cover more than the data themselves express — ‘making rich’ the way most ordinary discourse participants do, not ‘making sparse’ the way some formal linguists do. The validity of our ‘enrichments’ depends on whether they can be verified to be typical of the language community (not the ‘ideal speaker-hearer’); and doing so is one of the major tasks we face for the future. But to determine what modes and instances of enrichment should be verified, we are obliged and justified in relying on the interaction of authentic data with our own intuition.

C. Four responses to corpus research

22. Corpus work has become a testing grounds for the various outlooks upon authentic data in linguistics, and in other approaches to language as well. I shall sketch four responses with light-hearted but hopefully mnemonic labels.

23. At one extreme, the cold shoulder response totally ignores the results of corpus research. This response signals a strong commitment to linguistic approaches based on non-authentic data, and a determination to hide from new facts like those pious prelates who refused to look though Galileo’s telescope (Sinclair 1994). Curiously, we find two utterly disparate outlooks at this same extreme: the generativists who regard real language as ‘deviant’ and replace it with ideal language (§ 4f); and the prescriptivists who regard real language as ‘non-standard’ and replace it purified language (§ 2.5). Both groups feel entitled to understand the nature of ‘language’ far better than ordinary speakers do, but for entirely disparate reasons: the generativists because they have access to the ‘perfect knowledge’ of the ‘ideal speaker-hearer’ (§ 8); and the prescriptivists because they know just what is ‘correct’ or ‘incorrect’, ‘good English’ or ‘bad English’ (§ 2.5).

24. At the opposite extreme, the red carpet response is delighted to finally have such large data samples and heartily welcomes the results. This response signals a strong commitment to linguistic approaches based on texts but hitherto compelled to follow opportunistic strategies by getting authentic data wherever we happened to find it; and by making compromises to invent plausible data when authentic data were not sufficiently available (cf. § 117). Now we are happy indeed to be freed from this necessity by the ‘corporate bridges’ that data can provide between ‘language’ and ‘text’ — or langue and parole, competence and performance, and so on. These bridges consist principally of regularities which are more specific than the language but more general than the text; and which are vital for making texts sound ‘fluent’ or ‘idiomatic’ (Beaugrande 2000, 2001b)

25. This response appears typical for systemic functional linguistics, which has all along respected the value of authentic data even whilst highlighting those systemic factors which are ‘realised’ or ‘actualised’ by instances in the data (e.g. Halliday 1985). Today, corpus data are being hailed as the most promising bridge between system and instance, and has lent new energy to the project of using statistical frequencies to assign relative probabilities to the options of the grammar (e.g. Halliday 1991, 1992) (cf. § 130-137). The focus of systemic functional linguistics upon paradigmatic, not just syntagmatic, fosters a natural interest in how some choices are made in coordination with others, and how frequently. But prior to corpora, this had to be worked out by hand, which, even for a small corner of the lexicogrammar, can be horrendously laborious.

26. The response is typical also for text linguistics, at least in my own view of the field. At its best, text linguistics has always been an implicit mode of small-corpus linguistics, coping with practical and theoretical problems as they arose. Our guiding rationale throughout has been that working with authentic texts will bring to light aspects of language and communication that we otherwise miss. Such was the essential message and demonstration of the 1981 Introduction (Beaugrande and Dressler 1981).

27. In between the two extremes, the limp handshake response publicly welcomes corpus research but privately harbours misgivings about its potential for creating pressure to change accepted views of language and familiar methods of teaching it. Our results are at most regarded as issues to place alongside established ones without disrupting them, e.g., as modules to be inserted somewhere into the business-as-usual ‘lesson plans’ in EFL teaching.

28. Also in between the two extremes, the poison needle response exploits academic or institutional leverage to fend off the implications and results of corpus research. Our results are regarded as heresies — ‘mistakes, inadequacies, limitations, distortions, biases’ etc. etc. — against which the unwary world must be resoundingly warned.

29. Whereas the identities of ‘cold shoulderers’ and ‘red carpeters’ are clearly on public record, the same cannot be said of these two groups in between. Some ‘limp handshakers’ on the conference and lecture circuit change into ‘poison needlers’ in the shielded preserves of anonymous reviewing and academic politicking. In fact, the ‘review’ process is never so prone to unprofessional manoeuvring as when a discipline confronts a substantive body of evidence with a ‘revolutionary’ potential regarding dominant theories and methods, as I have documented elsewhere (Beaugrande 2001c) (§ 31). The harder it becomes to deny the findings of authentic data, the harder these groups will work to keep them out of print and foreclose any free and open discussion of the issues.

D. Twenty arguments against corpus research

30. The time seems opportune to clear the air by reviewing the merits of some arguments being commonly lodged against corpus research, whether publicly or privately. They offer predominantly theoretical motives to conclude that corpus research cannot, in principle, produce significant or applicable results; they pass quietly over their practical motive to eschew the detailed labour and technical training corpus research requires. Most of arguments are found on close scrutiny to be empty or irrelevant-- in the parlance of the BNC, to be a right load of old codswallop -- arising from some fortuitous or wilful misrepresentation of language and discourse in general or of corpus research in particular. Others point to genuine substantive problems which we must confront but which will not — as is apparently hoped —drive us to give up in despair.

D.1. ‘Corpus research is a new fad.’

31. A ‘fad’ is by definition a new trend which rapidly achieves general acceptance through sheer brash novelty; and corpus research simply doesn’t qualify. Placing our results can still be difficult in mainstream journals of theoretical or applied linguistics whose ‘peer reviewers’ see corpus research as a threat to their preferred approaches. Until our research attains the mainstream, we authors should seek alternative strategies and outlets. We can sustain interactive websites to post our work whenever, in our own judgment, it seems to be of interest. Readers who do or do not agree with it are invited to justify their opinions with thorough and substantive arguments instead of using ‘anonymous negative reviews’ to suppress our work with no public accountability.

32. And, far from being a ‘novelty’, corpus research is in reality older than most of the linguistics now arrayed against it. Already at the inception of the field, corpus research was established in fieldwork. Philology had inaugurated the compilation of atlases for well-known languages (e.g. Wencker 1887-95), which was continued in descriptive linguistics (e.g. Kurath 1949). In my own view, the most impressive achievement in all of linguistics was the description of previously undescribed languages of native America, Africa, and the Pacific, often without the aid of bilingual informants or decent audiovisual recording equipment (e.g. Sapir 1922; Hockett 1939; Pike 1944; Hoijer 1945; Newman 1947; Pittman 1948). These studies founded a stream of corpus-based fieldwork that has continued right up to the present (e.g. Eberhard 1995; Wannemaker 1999; Newman and Ratliff eds. 2001).

33. This monumental work most firmly established linguistics as an accredited social and human science up until the so-called ‘scientific revolution’ that turned against descriptive methods in the 1960s. This turn was to some extent prefigured in long-standing uncertainties about data since Saussure, as I noted in section A; but the deliberate and programmatic substitution of invented data for observed data, and of the scientist’s own intuition for the reports of informants, was a real novelty without precedent in any science, and from today’s standpoint, deserves to be called instead an ‘anti-scientific revolution’ (§ 12). Thus cut loose from authentic data, and licensed to devise arbitrary ‘formalisations’, linguistics has proliferated genuine fads (§ 3). Some forty ‘formal theories’ of language have competed for adherents (Escribano 1993); and the definitive refutation of any one is hardly feasible if substantive data cannot be adduced.

34. Corpus research accordingly represents a return to the roots of linguistics, now equipped with cutting-edge technologies (cf. McEnery and Wilson 1996). We seek to bring about a ‘scientific revolution’ that restores what was lost in that ‘anti-scientific revolution’ against data. In the process, our dependence upon authentic data for our claims and demonstrations precludes any mere faddishness. However, we are definitely in a phase of swift evolution in our theories and practices, and the outcome is by no means clearly foreseeable (Sinclair 1997a, 1997b, 2001). And surely that is grounds for optimism, not pessimism.

D.2. ‘Corpus research is subjective, not objective.’

35. This argument is a heritage of the positivism and physicalism that triumphantly heralded the ‘unified science’ in the early 20th century (Neurath et al. 1938). It was predictably applied to linguistics to assist its accreditation as a relatively new science (e.g. Bloomfield 1930). Since language as a whole hardly seemed amenable to treatment as a physical object, it was dismantled to isolate some of its more amenable aspects, especially the phonetics of articulation and the acoustics of audition (Jones 1914). Under the aegis of behaviourism, real speech could thus be readmitted despite exclusions like Saussure’s (§ 4), e.g., as a chain of ‘verbal behaviour’ composed of objectively observable pairs of ‘stimulus and response’ (Bloomfield 1933; Skinner 1957). In this purview, ‘speech’ constitutes ‘cause-and-effect sequences exactly like those we may observe in the study of physics’ (Bloomfield 1933: 33).

36. The programmatic turn of mentalism against behaviourism curiously retained some notion of language as a set physical objects. Now, the convention in ‘physics’ whereby ‘any scientific theory is based on a finite number of observations’, which it ‘relates’ and ‘predicts’ ‘by constructing general laws’, was compared to a ‘grammar of English based on a finite corpus of utterances (observations)’, ‘containing grammatical rules (laws) stated in terms of phonemes, phrases, etc.’ and ‘expressing structural relations among sentences of the corpus and the indefinite number of sentences generated by the grammar’ (Chomsky 1957: 49). This comparison presumably helped the new ‘theory’ along, even though generative linguistics soon found its reasons to banish both ‘corpus’ and ‘observation’ (§ 4). In return, the sentence assumed some traits of a physical object. It became a ‘string’ with a ‘surface’; its ‘structures’ can ‘branch’ to the ‘left’ and the ‘right’, or can be ‘raised’ and ‘lowered’; and so on. Such objectifying notions help to fill the void left by draining the authenticity out of the data.

37. Corpus research holds the potential to transcend the competitive dichotomies between objective versus subjective, and between behaviourism versus mentalism. The larger the corpora and the more consensus and coverage we can achieve, the brighter our prospects to attain intersubjectivity vis-à-vis a language community. If solely subjective methods seem too broad and loose, solely objective methods seem too narrow and rigid for data as rich and variegated as ours.

38. In view of the problems I have aired above, the position of the corpus linguists themselves cannot be treated so casually as that of linguists who claim to know all about the ‘ideal speaker-hearer’ (§ 8). We cannot escape our own subjectivity as the physicalists and behaviourists aspired to do; but neither can we exalt it as our privileged source of data, as the mentalists and generativists proposed to do. Instead, we should invest it in our explorations like a partial and fallible map to be filled in or corrected when the data require it. For example, when I read this passage some years back:

[26] They began trotting, […] tanned graduate students striding like gazelles ahead of the pack; middle-aged duffers, white hairy legs pumping, bringing up the rear (Lonely Hearts of the Cosmos) 214

I projected too much from the context and imagined ‘duffers’ to be middle-aged, flabby men. Today I can see from BNC data they are just people who are awkward at something:

[27] she had always been considered a complete duffer at languages. (Hypnosis Regression Therapy)BNC

[28] One longs for the Germans to give up trying to make facsimiles of other people's cheeses. They are terrible duffers at it. (An Omelette and a Glass of Wine)BNC

I suspect such minor personal lapses are more commonplace than we’d like to think. But corpus data offer us the means for being less of duffers at grasping unfamiliar words in context.

D.3 ‘Corpus research seeks to banish intuition.’

39. This argument sounds like the exact reverse of the previous one, and faults us for not being subjective enough and not allying ourselves with mentalism against behaviourism. The sources for this argument (e.g. Widdowson 1991; Owen 1993) evidently propose yet another competitive dichotomy, this one between corpus versus intuition, as if we are must choose only one. This dichotomy entails a serious category error, because every mode of contact with language implicates intuition. No matter how rich, the data never speak for themselves or declare their own significance. And corpus linguists must at least partially approach data from the standpoint of a potential audience, say, the fans who read the sports news in the Independent and Today [7-8], and who have heard of Leconte and Hoddle (which I hadn’t).

40. However, we would transform the role and function of intuition so as to enlist it in the purposes of corpus research (Francis and Sinclair 1993). It should be transposed from before the fact — the source for supplying the occasional sentence from the linguist’s own mental data-bank — to after the fact — the resource for interpreting discourse samples from a collaborative electronic data bank. Our interpretations will in part run parallel to those of the wider language community, but will also run at the higher awareness and in the broader scope enabled by multiple bridges between authentic data and rich contexts (§ 21, 58). However, we imply no claim to any superhuman rationality whereby we command access to some ‘universal deep structure’ of language or to the ‘perfect knowledge’ of some ‘ideal speaker-hearer’ (§ 8, 23).

41. This transposition is vital insofar as intuition is much less adept in prediction than in retrospection. If, like me, we work as English teachers, we often get asked how one should say this, that, or the other; working with corpus data has made me much more circumspect about answering. My own intuition, at any rate, does not run at the degree of precision needed to give reliable information on specific questions. When a student wrote [29], my response was that the Verb is not used that way. But corpus data proved me wrong with samples like [30-31]. And in the 1913 edition of Webster's Dictionary (before radio days), this meaning was the main one [32].

[29] The woman follow the oxen them to broadcast seeds

[30] sowers flinging their seed about broadcast (Mayor of Casterbridge)BAWC

[31] The second method is to broadcast the seeds together with not more than 1 kg. to the acre of rape and turnips in late June or early July. (The Challenge of Smallholding)BNC

[32] Broadcast (Agric.) 1. A casting or throwing seed in all directions, as from the hand in sowing; 2. Scattering in all directions (as a method of sowing); opposed to planting in hills, or rows

42. Sinclair (2001: 10) has recently remarked that it might be ‘difficult for one who has been a Professor of Modern English Language for 35 years to admit that there are words whose meaning he does not know’ and ‘several thousand words of English’ he ‘could not define’. But this admission feels difficult mostly where teachers of English have, willingly or not, been seen in the role of infallible authorities — an unfortunate tendency from which access to corpus data should finally release us (§ 155). In the future, our role will be to assist people in accessing data they need; and our authority will depend on having scanned large data sets, and not just on holding an ‘advanced degree in language’ (cf. § 8, 23, 40).

43. What teachers and learners of English alike require, and Sinclair says so, is not massive standing vocabularies so much as skills for grasping the meanings of expressions in real contexts. Even familiar words may be found in unfamiliar meanings, as befell me with ‘broadcast’. The trick is to pick the rich data whose contexts are most helpful. For example, the Modifier ‘knackered’ was not in my own vocabulary, and rich data like [33] intuitively led me to the meaning of ‘physically exhausted’. I could then extend the meaning by analogy to sparser data concerning ‘inflation’ [34] or a ‘car axle’ [35]. But I had to find a different rich context for the Noun ‘knacker’ [36], which suggests to me an ominous derivation for the Modifier.

[33] I forced myself towards it. I was utterly knackered. It took my last reserves of strength and will to reach it and then to heave myself in. (Pilot)BNC

[34] you can’t have a decent life if you’ve got high inflation all the time knackering you up (conversation)BNC

[35] In no time, we’ll have done in £500 worth of tyres and knackered the rear-axle. (Esquire)BNC

[36] Richard Cross was a knacker in Camden Town. He supplied dead horses and asses for dissection, and also dealt in dead cows. (Royal Veterinary College)BNC

44. Still less reliable is intuition in predicting frequencies. When I was adapting Halliday’s (1985) ‘functional lexicogrammar’ that describes ‘Processes’ by such categories as ‘Transitivity’ and ‘Ergativity’, my intuition predicted that a key distinction would be whether and how far a Process is judged to be under the control of the Agent or Initiator (Beaugrande 1997). This in turn ought to show up as frequencies in corpus data for the Verbs collocating with ‘could not help’ and ‘couldn’t help’, where you say that a spontaneous Action was not fully under control. This usage might be classed as a Face-Saving Auxiliary, along with ‘couldn’t resist’, ‘couldn’t refrain from’, and so on: expressions which attenuate the Agency of Process Verbs after some Action that might indicate insufficient regard for social norms.

45. But my intuition could by no means have predicted the actual frequencies of the particular collocations I found among the 515 occurrences returned from the Bank of English (BoE) in July 1994, then at 226 million words. There, just four Process Verbs, ‘feel’ (68 occurrences), ‘think’ (59), ‘notice’ (58), and ‘wonder’ (49), totalled up to 234, 45% of the data. Still, my intuition can retrospect upon these data by noting that these Verbs are prime examples of Processes which might well elude the Agent’s full control and which might lead into emotions, perceptions, and thoughts which render some speakers of English self-conscious.

46. In the BNC, which is 44% of the size the BoE was at that time, I find 225 attestations of ‘could not help’ and 378 of ‘couldn’t help’ collocating in these proportions for the same Verbs: ‘feel’ (61), ‘think’ (58), notice’ (44), and ‘wonder’ (37), for a total of 200, 33% of the data. In percentages of the BoE totals, these would be 88 - 98 - 76 - 75, or on average 84%, quite high for a corpus only 44% as large. This factor might be due to differences in their composition, notably the higher proportion of news media in the BoE and of popular fiction in the BNC; news reporters hardly write about what the Prime Minister or the Queen ‘couldn’t help feeling’. Or, we may simply be encountering the accidental scatter to be expected by working at finer degrees of precision in very large complex systems. But the way the four Verbs nicely lined up the same relative to each other in both corpora suggest that scatter may not prove to be a serious problem.

47. In my BAWC data, in its turn only 38% as large as the BNC, I found 944 occurrences of ‘could not help’ and 291 of ‘couldn’t help’ for a rather massive total of 1235. We can safely attribute this high proportion to the literary status of the data, as a text domain where feelings, thoughts, and so on, are often presented from a narrator’s standpoint. Also, the 19-century society described by many of my writers had rather firm ideas about what one really shouldn’t ‘feel’ though one may not be able to ‘help’ it: in my data, ‘bitterness, distrust, jealousy, envious, angry, hurt’ etc.; and, more intensely, ‘utter hopelessness, delirious happiness, infinite pity’.

48. The four Verbs did not line up this time, but collocated at these proportions in the BAWC: ‘feel’ at 84, ‘think’ at 106,notice’ at 17, and ‘wonder’ at 26 for a total of 233, roughly 19% of all the data. In relation to the BNC data, ‘feel’ and ‘think’ are sharply skewed at 137% and 252%; ‘notice’ fits almost exactly at 40%; and ‘wonder’ is less sharply skewed at 72%. The presence of ‘think’ becomes still more obtrusive when we take into account the 30 occurrences of ‘feel that’ plus a Clause in a sense similar to ‘think that’; still, in 12 of those, feelings were clearly involved too, e.g.:

[37] he could not help feeling that he was getting the worst of it—there was some faint stigma attached (Sister Carrie)BAWC

[38] she cannot help feeling that her children are cruelly handicapped by the fact that he is their father, nor can she help feeling guilty about it (In Defense of Women)BAWC

A plausible explanation might be the far heavier representation of literary and popular fiction in the BAWC than the BNC, and the attractions for authors to tell us what someone ‘couldn't help thinking’, even if one would hardly say so, e.g.:

[39] Mr. Wharton could not help thinking: ‘How poorly this young man compares with my young friend. Still, as he is Mrs. Bradley's nephew, I must be polite to him.’ (Cash Boy)BAWC

[40] I could not help thinking that, with his queer head and length of thinness, he was made to hop along the road of life rather than to walk, […] and I bade my inward spirit keep close to discretion. (Pointed Firs)BAWC

49. Such explanations are themselves products of intuition, but they are still data-driven; the data placed before me some facts of usage I had never noticed as such, despite fairly extensive readings in literature. Of course, without a great deal more data, I cannot say whether the proposed category of ‘Face-Saving Auxiliary’ should be regarded as explanatory. I can only offer it as plausible and point to confirming evidence, such as the powerful preference to colligate with ‘I’ as the Subject who ‘couldn’t help’: 150 occurrences in the BoE, 171 in the BNC, and 448 in the BAWC; the face that most often needs saving is the speaker’s.

D.4 ‘Corpus data represent outer behaviour rather than inner knowledge of language.’

50. This argument, also put forward by Widdowson (1991), fits the previous one, portraying us to be studying only behavioural factors and ignoring mental factors. Here we should recall the long-standing controversy within linguistics and adjacent fields like philosophy and psychology between behaviourism versus mentalism (e.g. Skinner 1957; Chomsky 1959, 1965; discussion in Beaugrande 1980). Both sides presented their case as if we must accept the one and reject the other — a gesture closer to academic politics than scientific method. Purely behavioural data could only consist of observable events of the body. Such data could represent text or discourse only as an array of articulatory and acoustic operations for speech, or of inscriptions and visual recognitions for writing (§ 35); and corpus linguistics certainly does not propose to describe language in those terms, nor does a corpus represent language that way.

51. Purely mental data would only consist of non-observable events of the mind. Such data could represent text or discourse only as an array of meanings, intentions, mental images, and so on, as distinct from the act of their expression. But they can become our data only when they are expressed, and there some reprocessing occurs, notably, to convert a hierarchical network of activations into linear sequences (Beaugrande 1980, 1984).

52. Surely language and discourse represent the most elaborate interaction of body and mind. So our research needs to sustain a dialectical cycle between the behavioural and mental. One prospect is to exploit the rich indicators in corpus data about how outward behaviour might be interpreted in respect to people’s inner knowledge, e.g. when you ‘survey them from head to foot’, or ‘look full in their face’:

[41] after surveying Mr. Winkle from head to foot, [he] said: ‘You’re a wery humorous young gen’l’m’n, you air, sir!’ ‘What do you mean by this conduct, Sam?’ inquired Mr. Winkle, indignantly. ‘Get out, sir, this instant.’ […] ‘I shall leave this here room, sir, just precisely at the wery same moment as you leaves it’, responded Sam, speaking in a forcible manner, and seating himself with perfect gravity. [And he] planted his hands on his knees, and looked full in Mr. Winkle’s face, with an expression of countenance which showed that he had not the remotest intention of being trifled with. (Pickwick Papers)BAWC

Such indicators help us to enlist our data in interpreting them in ways the producers might plausibly have intended.

D.5 ‘Corpus data do not reveal what is possible but only what is performed.’

53. This argument, once more advanced by Widdowson (e.g. 1991), might be construed as yet another re-issue of the dichotomy of langue and parole or competence and performance, but the fit is not exact. The possible must include all of the performed, whether a speaker or writer is judged competent [42] or incompetent [43] in an ordinary sense:

[42] There was a communication before her, one which she only could be competent to make: the confession of her engagement (Emma)BAWC

[43] What he would have asked her he did not say, and instead of encouraging him she remained incompetently silent. (Mayor of Casterbridge)BAWC

According to Widdowson (1991: 13), ‘Chomsky’s view is that you go for the possible, Sinclair’s view is that you go for the performed’. But as I have pointed out, Chomskyans appear to go further by proposing a theory to distinguish the possible from the impossible; and when they invent ‘well-formed’ and ‘ill-formed’ sentences that accentuate this distinction, they in effect contrast two sets of non-events (§ 10ff).

54. For corpus research, the conditions of a performance possess far greater relevance than the simple fact, and here too we can usefully distinguish authentic from non-authentic data (§ 14). The inventing of data by a linguist is a peculiar performance and so is less, rather than more, significant for representing the ‘competence’ underlying authentic performances. As I have remarked, the linguist paradoxically implies a claim to special competence, despite the assertion of a ‘completely homogeneous speech-community’ (§ 8).

55. Within the set of authentic events, we should further distinguish between more or less probable, and this we can attempt only by examining very large sets of events — more precisely, interactions among co-occurring events wherein some events make others more or less probable, and also more or less natural, fluent, idiomatic, and so on, for intuitive retrospection. The collocations and the colligations in performance offer the only rational perspective on collocability and colligability in competence by suggesting where to search for them.

56. Sinclair (1994) has suggested a visual analogy to the list display returned when we query a corpus for a given key word or phrase: reading horizontally, we encounter performance; reading vertically, we encounter competence. I would add that either dimension is a glimpse rather than a vision, or a snapshot rather than a video — both the horizontal and the vertical extend further than our vision could take in. Competence will routinely encompass more uses for the key than we see; and performance will be saying more than the context in the display.

D.6 ‘Corpus linguistics adopts an exclusively third person perspective’.

57. This mildly arcane argument ties in to the previous two. It too was advanced by Widdowson (1991: 15), and in these terms:

The description of internalised language requires a first person perspective. You really have no choice if you are seeking to prise knowledge out from the recesses of the mind: knowledge which is not realised as behavioural evidence available to the observer […] Corpus linguistics […] adopts the third person perspective and only describes what can be observed, [and so cannot] reveal […] ‘member categories’ […] of the speech community itself which account for their intuitions about the language.

Widdowson appears to conflate the ‘Persons’ in the grammar of English Verbs, which are fairly distinct in their forms, with the roles of the participants in discourse, which are not. A speaker or writer usually has no need to frame his or her own views beside saying ‘I believe’ or ‘I assert’ and such like; and in academic discourse, the use of the First Person Singular is indeed actively discouraged by prescriptive teachers or editors, ostensibly to enhance ‘objectivity’. (I cannot agree that it does; in the present paper, advocating the return to real data, I return myself to a real author.) In literary discourse, in contrast, author and reader can emerge as ‘I’ and ‘you’ by convention, e.g.:

[44] Gentle reader, may you never feel what I then felt! May your eyes never shed such stormy, scalding, heart-wrung tears as poured from mine. (Jane Eyre)BAWC

The frank literary conventions probably tell us more about participant roles than do the staid academic ones. Speakers or writers so frequently use ourselves as models for our hearers or readers that first and second person roles only occasionally need to be made fully distinct, e.g., in writing letters [45]. Also ‘you, my reader’ is not clearly distinct from just ‘the reader’ in the Third Person, e.g. [46].

[45] Dear Niece: I am writing this in a hurry, as we are going a week before we expected to. I think you will find everything all right. (Lavender and Old Lace)BAWC

[46] It would make the Reader pity me, or rather laugh at me, to tell how many awkward ways I took to raise this Paste (Robinson Crusoe)BAWC