Large Corpora and Applied Linguistics

H.G. Widdowson versus J.McH. Sinclair1




[COMMENTARY: H.G. Widdowson has repeatedly voiced unfounded criticisms of corpus research and of its potential application to language teaching. And attempts to publish refutations, including my own, have been blocked by editors and reviewers who support his position or at least are anxious not to admit major changes into the landscape of applied linguistics. So I am obliged to present the other side of the issues here on my website.]


1. the large corpus and the language teacher

In 1991, a controversy arose at the Georgetown University Round Table on Languages and Linguistics during an interchange between Henry Widdowson and John Sinclair. After carefully analysing the two published papers and separately discussing the issues with each of the two linguists, I have concluded that their respective positions are closer together than the controversy might suggest. Widdowson seems to have argued from some positions which are not actually his, and attributed to his opponent some positions which are definitely not Sinclair’s.

A predictable crux of the controversy was how corpus evidence might relate to the ‘competence’ of native speakers on the one hand and to the needs of learners of English as a Foreign Language (hereafter EFL) on the other. As a noted spokesperson for applied linguistics in EFL, Widdowson (1991: 14) felt provoked by Sinclair’s typical criticisms, and cited this one: ‘we are teaching English in ignorance of a vast amount of basic fact’ (Sinclair 1985: 282). To be sure, Sinclair has not blamed the teachers, but the sources they are offered, such as dictionaries, viz:

Teachers and learners have become used to a diet of manufactured, doctored, lop-sided, unnatural, peculiar, and even bizarre examples through which, in the absence of anything better, traditional dictionaries present the language. It is perhaps the main barrier to real fluency. (1988: 6f)

Nonetheless, Widdowson seemed indignant that ‘linguists’ who have debarred ‘discrimination against languages’ should practice ‘discrimination against ideas about language’; and that ‘linguists have no hesitation in saying that certain ideas held by the uninformed commoner or language teacher are ill-conceived, inadequate, or hopelessly wrong’, and in ‘rubbishing the theories of colleagues with relish in prescribing their own’ (1991: 11). By these tactics, each linguist’s ‘point of view is sustained by eliminating all others, so that the diversity of experience is reduced in the interests of intellectual security’ (1991: 11).

My own detailed studies of the discourse of theoretical linguists in considerable detail (e.g. Beaugrande 1991) confirm Widdowson’s remarks. But we should make due allowance for the fact than theoretical linguistics has been largely an enterprise for replacing real language with ideal language existing nowhere except in some ‘linguistic theory’ (cf. Beaugrande 1997a, 1997b, 1998a, 1998b). In consequence, the major resources for rationally adjudicating theories or models become unavailable, and debaters merely contest that ‘my idealisation is better than yours!’ A that stage, ‘rubbishing the theories of colleagues’ and ‘eliminating’ other ‘points of view’ become prominent tactics.

The same mode of linguistics would naturally shower ‘haughty disapproval, not to say disdain’ upon the attempts of ‘applied linguists to ‘appropriate’ its ‘ideas’, as Widdowson (1997: 146) has more recently complained (see Beaugrande 1998b for a riposte posted on this website). This posture is not just the ordinary casual ‘disdain’ of authentic experts for ordinary people. It is the calculated defence of a sham expertise that could be severely imperilled by applications, e.g., ones that would quickly debunk Chomsky’s (1965: 33) straight-faced denial that ‘information regarding situational context’ ‘plays any role in how language is acquired, once the mechanism’ — the ‘language acquisition device’ — ‘is put to work’ ‘by the child’.

So those earlier polemic tactics ensued from replacing real language with ideal language, whereas the arguments Widdowson was castigating here were being marshalled against this very replacement by Sinclair, as they have also been by Pike, Chafe, Firth, Halliday, Hasan, Schegloff, Roy Harris, and many others. Unfortunately, the reinstatement of real language at the rightful centre of modern linguistics cannot be achieved without strenuous ‘discrimination against ideas about language’ which really are ‘ill-conceived, inadequate, or hopelessly wrong’ but which have been enthroned by linguists whose ‘theories’ must be sustained by ‘rubbishing’ the others. And, our own objective is just the opposite of ‘reducing the ‘diversity of experience’ ‘in the interests of intellectual security’; we are resolve to disrupt the unearned ‘intellectual security’ of linguists, theoretical or applied, who have indeed ‘reduced the diversity of experience’ of language and discourse and left us with a ‘trivial picture’ (Halliday 1997: 25).

Widdowson’s paper proposed a contrast between the two positions. Whereas the one claims ‘objectivity’ and ‘correctness’ in ‘descriptions of language’, the other adopts ‘the relativist or pluralist position on the nature of knowledge’:

The principles or equality and objectivity are comfortable illusions. Descriptions of language are not more or less correct but more or less influential, and therefore prescriptive in effect. They tell us less about truth than about power, about the privilege and prestige accorded to acknowledged authority. […] We cannot any longer be sure of our facts. It is not a very comfortable position to be in. (1991:11f)

Despite the first person pronouns (‘us’, ‘we’), Widdowson avoided committing himself to this ‘pluralist position’,4 but he did imply that Sinclair opposes it by invoking ‘basic fact’ ‘about which teachers were previously ignorant’ (Widdowson 1991: 12).

Widdowson then posed the rhetorical question ‘what kind of fact is it that comes out of computer analysis of a corpus of text?’ (1991: 12). Characteristically, he did not answer it here or anywhere else in the paper by quoting a single ‘corpus fact’; at one point, he speculated on the ‘relative frequency’ of specific words without ‘having any evidence immediately to hand’ (1991: 17). Instead, he evoked the ‘distinction’ drawn between ‘externalised language’ versus ‘internalised language’ (1991: 12) by none other than Chomsky, the linguist who has memorably taken the most ‘relish’ in ‘rubbishing the theories of colleagues’ whilst ‘prescribing his own’. Moreover, Chomsky (1991: 89) has ‘doubted very much that linguistics has anything to contribute’ to ‘teaching’ (Chomsky 1991: 89), as Widdowson (1990: 9f) has elsewhere acknowledged even whilst rating ‘Chomsky’s position as consistent with the position I expressed’ (but see below). The genuine opposition is still between real language versus ideal language, which, I have asserted, can seriously mislead the language teaching profession.

Widdowson (1991: 12-15) also invoked a further series of oppositions or dichotomies we might do well to deconstruct. These included ‘competence’ versus ‘performance’ (of course); ‘the possible’ versus ‘the performed’ (after Hymes 1972); ‘knowledge’ in ‘the mind’ versus ‘behaviour’ (Chomsky again); and ‘first person’ versus ‘third person perspective’ (Widdowson’ own theme, e.g. 1997: 158f), which should not be misconstrued as referring to the morphology of English Verbs. Sinclair was reproached for conveying the ‘clear implication’ that the corpus is identical with the language, and for excluding the first pole of each opposition whilst allowing only for the second:

You do not represent language beyond the corpus: the language is represented by the corpus. What is not attested in the data is not English; not real English at any rate. […] what is not part of the corpus is not part of competence. […] What is not performed is just not possible. (Widdowson 1991: 14)

Against this supposed position of ‘the work of Sinclair and his colleagues’, Widdowson quoted Greenbaum (1988: 83) that ‘the major function of the corpus is’ ‘to supply examples that represent language beyond the corpus’. But this position is just as much Sinclair’s, e.g.: ‘language users treat the regular patterns as jumping off points, and create endless variations to suit particular purposes’ (Sinclair 1991: 492). His real position should concur with the notion the collocability and colligability of the lexicogrammar of English are partly realised by the collocations and grammatical colligations of discourse and partially innovated against (Beaugrande 2000).

Sinclair was astounded to be stuck in the straw-man realist position of ‘what is not attested in the data is not real English’ and ‘what is not performed is just not possible’. If he held those positions, he would stop expanding the corpus straightaway because nothing more is ‘possible’ and because any differing data would be ‘not real English’, whereas he has in fact insisted, at times to the dismay of agitated project sponsors, that the corpus must be hugely expanded. He would also have to assume that the sources of his corpus are the linguistic equivalent of the sum total all ‘possible’ sources, whereas he candidly asserts that a much wider selection of spoken data would have already been included but for severe problems of labour and expense.

The evolution of modern linguistics proffers an ironic context for another one of Widdowson’s (1991: 13) polarities: ‘Chomsky’s view is that you go for the possible, Sinclair’s view is that you go for the performed’. By any realistic measure, Chomsky’s programme has always gone for the impossible, advocating, with tireless self-confidence, one project after another that never materialise and never could — a ‘grammar’ that is ‘autonomous and independent of meaning’; a solution to ‘the general problem of analysing the process of “understanding”’ by ‘explaining how kernel sentences are understood’; an account of how human ‘children’ ‘acquire language’ by ‘inventing a generative grammar that defines well-formedness and assigns interpretations to sentences even though linguistic data’ are ‘deficient’ (1957: 17, 92; 1965: 201); and more others than I have room to list here (for a thorough analysis of Chomskyan discourse, see now Beaugrande 1998b).

Here we can look to Hjelmslev (1969 [1943]: 17) for the most striking formulation, this one concerning the ‘possible’: ‘the linguistic theoretician must’ ‘foresee all conceivable possibilities’, including ‘texts and languages that have not appeared in practice’ and ‘some of which will probably never be realised’ Easy enough to say once you decide (as we saw Hjelmslev do) that ‘linguistic theory cannot be verified (confirmed or invalidated) by reference to any existing texts and languages’.

Chomsky (1965: 25, 27) fulfilled Hjelmslev’s vision in the most facile manner when he simply installed, by fiat, just such a ‘theory’ in the ‘language acquisition device’ of the human child: ‘as a precondition for language learning’ the child ‘must possess a linguistic theory that specifies the form of the grammar of a possible human language’ plus ‘a strategy for selecting a grammar’ by ‘determining which of the humanly possible languages is that of the community’. This is definitely not the position of Widdowson, who has firmly rejected the concept of ‘internalisation’ by means of a ‘universal Chomskyan language acquisition device’ (1990: 19).

The conception of the ‘possible’ is too abstract to be very useful for language pedagogy anyway. Learners of English as a non-native language produce many utterances which may not seem possible to the teacher’s intuition, but, as I have noted, we are currently finding new motives for doubting the reliability of intuition. Far more relevant is what is or is not both ‘possible’ and ‘performed’ at the learners’ current stage of skills and knowledge, since that is all we can realistically hope to build upon. There, we can productively orient our approach toward large corpora of learners’ English, such as have been collected by Sylviane Granger at the University of Louvain (cf. Granger 1996) and by John Milton at the Hong Kong University of Science and Technology (cf. Milton and Freeman 1996). Such data can also systematically alert teachers and learners to typical problems such as language interference.

Another of Widdowson’s polarities we might deconstruct is the one between ‘knowledge’ in ‘the mind’ versus ‘behaviour’, the latter term perhaps reminding language teachers of behaviourist pedagogy and Skinnerean behaviourism.5 But linking a large corpus with behaviour and behaviourist methods would be flawed for at least two reasons. The more obvious reason is that the behaviourist ‘audio-lingual’ method with its pattern drills and prefabricated dialogues was based on mechanical language patterns more than on authentic data; it equated language with behaviour in order to reduce language, whose relative complexity it could not grasp, to behaviour, whose relative simplicity seemed ideal for ‘conditioning’, ‘reinforcement’ and so on; and the method was backed up by heavy behaviourist commitments with in general pedagogy and by the prestige and authority of American military language institutes, where ‘drills’ are literally the ‘order of the day’. Nor does Sinclair advocate a teaching method whereby learners parrot back corpus data; on the contrary, he has expressly counselled against ‘heaping raw texts into the classroom, which is becoming quite fashionable’, and in favour of having ‘the patterns of language to be taught undergo pedagogic processing’ (1996).

The more subtle reason is that corpus data are not equivalent to ‘behaviour’ in the ‘externalised’ sense which Widdowson’s polarities imply and which is often encountered in discussions of pedagogy, e.g., when a ‘syllabus’ ‘identifies’ ‘behavioural skills’ (Sinclair 1988: 175). Instead, they are discourse, and the distinction is crucial. External behaviour consists of observable corporeal enactments, of which the classic examples in behaviourist research were running mazes, pulling levers, and pressing keys. Discourse is behaviour in that externalised sense only as an array of articulatory and acoustic operations, or, for written language, of inscriptions and visual recognitions; and no one has for a long time — certainly not Sinclair — proposed to describe language in those terms, nor does a corpus represent language that way. When discourse realises lexical collocability and grammatical colligability by means of collocations and colligations, the ‘performed’ continually re-specifies and adjusts the contours of the ‘possible’. In parallel, ‘knowledge’ in ‘the mind’ decides the significance of the ‘behaviour’. Sinclair’s true position is that these operations are far more delicate and specific than we can determine without extensive corpus data. Moreover, analysing corpus data is less equivalent to observing behaviour than to participating in discourse.

We thus move on to deconstruct Widdowson’s polarity between ‘first person’ versus ‘third person’:

The description of internalised language requires a first person perspective. You really have no choice if you are seeking to prise knowledge out from the recesses of the mind: knowledge which is not realised as behavioural evidence available to the observer […] Corpus linguistics […] adopts the third person perspective and only describes what can be observed, [and so cannot] reveal […] ‘member categories’ […] of the speech community itself which account for their intuitions about the language. (1991: 15)

On the contrary: corpus linguists can reveal the ‘member categories’ they themselves hold and apply as ‘members of the speech community’ sharing what  (Sinclair 1991: 498) would call the ‘general acculturation’ of the intended ‘target reader’. They too are deeply concerned with ‘the pragmatic use of the language in the transaction of social business’ and ‘the interaction of social relations’, which Widdowson would reserve for ‘discourse analysis’6 while boxing ‘corpus linguistics of the COBUILD kind’ into a ‘text analysis’ concerned only with ‘performance frequencies’ (1991: 13).

Especially for data from public sources, corpus linguists can readily adopt a first person perspective as potential speaker (e.g., how I might use language to stress national prosperity, [14]); a second person perspective as a potential addressee (e.g., how I might react to a discourse about national prosperity when all I can see is isolated pockets of prosperity among the rich); and a third person perspective (e.g., how the general populace might be persuaded by such discourse to vote in the interests of the rich). All this resembles what ordinary speakers and hearers do, and discourse analysts as well. Having plenty of data can trigger intuitions that might otherwise lie untapped if you were just trying to ‘prise knowledge out from the recesses of the mind’, which sounds like shelling a stubborn walnut.

The ‘difficulty’ with ‘generative linguists’ ‘acting as their own informants’ and ‘drawing introspectively on their own competence’ is not just, as Widdowson (1991: 15) commented, that ‘they are also members’ of ‘the community of linguists with all its disciplinary sub-culture of different and incompatible attitudes and values’. A more severe difficulty is that these linguists have in effect disowned that membership in order to arrogate to themselves the authority of the ‘ideal speaker-hearer’. Thus, Chomsky has denied that the (presumably real) ‘speaker of a language’ ‘is aware of the rules of the grammar or even’ ‘can become aware of them’; so ‘a generative grammar attempts to specify what a speaker actually knows, not what he may report about his knowledge’ (1965: 8). By implication, linguists who assume the role of the speaker are claiming, simply by virtue of holding an academic degree in ‘linguistic theory’, to command superhuman powers for ‘becoming aware of and reporting’ what other speakers cannot. Presumably, the ‘kernel sentences’ they invent, like the man hit the ball, would in turn be perfect data; and these — or at least their ‘underlying’ order or ‘deep structure’ — would be far more suited to represent ideal language than real data would be.

So it would not be at all ‘disturbing for the claims of corpus linguistics if there were disparities between’ ‘what people indicate they would say in a given context’ and ‘what they actually do say in such contexts’ (Widdowson 1991: 17).  Quite the contrary: compare Widdowson’s (1991: 17) view that ‘the correspondence between what people claim they would say and what they actually do say cannot be taken on trust’ with Sinclair et al.’s (1990: xi) view that ‘any such points emerging from a set of constructed examples could not, of course, be trusted’. Sinclair does not attribute this lack of ‘trust’ to people being ‘ignorant’ and ‘hopelessly wrong’, which Widdowson (1991: 11f) suggests he does; the obstacle is simply that many constraints upon what people say, as I have pointed out, only emerge during the actual discourse — what people do say and not just what they would say.

The final Widdowsonian polarity (one I already cited) we might deconstruct is between ‘internal’ or ‘I-language’ versus ‘external’ or ‘E-language’ appropriated from Chomsky’s more recent work. ‘I-language’ in Chomsky’s own sense is quite irrelevant to Widdowson’s argument, being a universal code which is common to all languages and which is not accessible to the interventions of language teachers because it is genetically and biologically installed and implemented in fine detail: ‘there is a highly determinate, very definite structure of concepts and of meaning that is intrinsic to our nature, and as we acquire language or other cognitive systems these things just kind of grow in our minds, the same way we grow arms and legs’ (Chomsky 1991: 66). Moreover, when Chomsky now ‘speculates’ ‘that there may be only one computational system and in that sense only one language’, his ‘radically different’ ‘post-1980s theories’ have ‘no constructions; there are no rules’, ‘that is, language-specific rules’ (1991: 81, 92). What Sinclair wants to describe and Widdowson surely wants to teach would still be ‘E-languages’, which Chomsky (1986: 25) has shrugged aside as ‘epiphenomena at best’. Similarly, if ‘the distinction between I-language and E-language description refers to what aspects of language are to be described’ (Widdowson 1991: 15), the description of a Chomskyan ‘I-language’ would be utterly useless for language teaching, which has to deal extensively with ‘language-specific rules’ and ‘constructions’; and, as noted, ‘I-language’ is not teachable at all.

Or again, Widdowson means something quite different than Chomsky does, and their ‘positions’ are not so ‘consistent’ after all (see above). Besides, even if ‘I-language’ versus ‘E-language’ are informally taken to designate what speakers know of their language versus what they say in the language, the distinction between the two could not be the same for native language learning (or ‘acquisition’), where extensive knowledge is indeed acquired without ordinary learning, versus non-native language learning, where that same acquired knowledge needs to be revised, often consciously, to accommodate knowledge of the non-native language (Beaugrande 1997b). And the same distinction might be unstable and inconsistent for the same speaker in different contexts and for different speakers in the ‘same’ context. The special qualities of corpus data indicate that this instability and inconsistency are a natural reflex of the huge range and variety of constraints emerging on the plane of the actual discourse (Beaugrande 2000).

So the evolving dialectic between ‘possible’ versus’ ‘performed’ in Hymes’ terms, or between ‘I-language’ versus ‘E-language’ in Widdowson’s (but not Chomsky’s) terms, or between Chomsky’s ‘competence’ versus ‘performance’ would best account both for the ‘fluency’ language teachers seek to instil and for the regularities in large corpus data. I can see no sound justification for cordoning off the two sides of any of these polarities in order to insulate the activities of teachers from those of corpus analysts, as Widdowson’s reservations seem to suggest even whilst, a bit inconsistenly, he is accusing Sinclair of trying to discard the first term of each polarity.

In another source, Widdowson (1990: 18) has proposed yet another polarity between what language learners know versus how they perform: ‘acquisition having to do with knowledge’ versus ‘accuracy having to do with behaviour’. The first term is problematic: for Halliday (1973: 24), ‘acquisition’ is a ‘misleading metaphor, suggesting that language’ is ‘property to be owned’. The term was mainly promoted when generative linguists decided to invent an account which was pointedly distinct from ‘learning’ — a distinction later exploited by Krashen to discredit established methods of language learning by reciting his airy that ‘learning cannot become acquisition’ (e.g. Krashen 1985: 22, 24, 41, 48, 55) (see now Beaugrande 1997b). The second term is problematic too insofar as corpus data indicate that many of the detailed decisions on the plane of the actual discourse are not properly determined by criteria of ‘accuracy’ but of ‘appropriateness’ as defined by Hymes and cited by Widdowson (1990: 13) among the criteria belonging to ‘E-language’, whereas Widdowson apparently consigns ‘knowledge of language’ to ‘I-language’; besides, criteria of ‘accuracy’ can have the practical effect (noted below) of ranking conformity high above creativity. Perhaps we might agree to distinguish instead between a person’s ‘language capability’ and ‘language achievement’; or between ‘known options’ versus ‘selected options’; or between ‘available regularities’ versus ‘on-line decisions’.

A further polarity in that same other source cited Bialystok and Sharwood-Smith’s (1985) ‘difference between knowledge of language’ versus ‘the ability to access that knowledge effectively’, with the implication that the ‘variation may either be because these forms are tied in some way to a particular kind of context and so are not freely transferable or because the second context imposes inhibiting conditions which prevent learners from accessing and applying what they know’ (Widdowson 1990: 18). This position sounds reasonably compatible with Sinclair’s, since corpus data are quite helpful for telling in fine detail which ‘forms are tied to a particular kind of context’, and indeed suggest that such ‘tyings’ are the rule rather than the exception, at least in English.

And precisely this fine detail may be a submerged crux of the language teaching controversy, hinging upon an  inclination of foreign language teaching, and one Widdowson himself opposes, to ‘set a high premium on correctness’: ‘the imposition of correctness’ ‘has the effect of inhibiting the learners’ engagement of relevant procedures for mediation acquired through an experience of their own language’ (Widdowson 1991: 121, 124). Learners may arrive at the intimidating misconception that there must be a ‘correct’ answer ‘rule’ for everything detail, may besiege the teacher to tell them what it is, as reported by Kovai (1998) for teaching English in Slovenia. This practice concurs only too well with a ‘linguistic theory’ wherein ‘language consists of a set of rules for the combination of words into well-formed and meaningful sentences’ (Sinclair and Renouf 1984: 76; cf. Beaugrande 1998b).

The crux would now revolve around be the danger of corpus research getting misinterpreted (to stay with Widdowson’s terms) as demonstrations of the accurate things language learners must say rather than the appropriate things the learners should take as their framework of orientation for what they say. Only then would the teaching and learning of EFL be saddled with the doomed precept that ‘what is not attested in the data is not real English’. If this be Widdowson’s real anxiety, it would be heartily shared by Sinclair and his team, witness the Collins COBUILD English Grammar ‘that contains a lot of productive rules; these rules are not restrictive, they are “do not” rules; they are “try this one” rules where you can hardly go wrong’ (Sinclair 1991: 493; cf. Sinclair et al. 1990: 493).

Moreover, the same anxiety might profoundly disturb language teachers about large-corpus data if they viewed these as a colossal compilation of fine-grained ‘prescriptions’ that must be ‘drilled’ into the learners on top of the usual ‘grammar’ and ‘vocabulary’. Sinclair has on numerous occasions espoused the opposite view, viz.:

More adequate description will so organise the detail that it largely falls in line with the meaning, and becomes easy, rather than difficult, to learn. If the grammatical choices turn out in the main to be also lexical choices, then a massive simplification can be expected. If on top of that, grammar is seen as a springboard for creativity rather than as an instrument of social discipline, the pleasure to teaching and learning can increase enormously. (Sinclair 1991: 497)

These prospects reinforce the advocacy repeatedly lodged in my own paper against separating of ‘grammar’ from ‘vocabulary’, which pull the unity of the language apart. Francis and Sinclair (1994: 200) in turn warn against ‘presenting learners with syntactic structures’ and ‘then presenting lexis separately and haphazardly as a resource for slotting into these structures’; ‘we should not burden learners with vast amounts of syntactic information on the one hand, and lexical (“vocabulary”) information on the other, which they then have to match according to principles which are not naturally available to them as non-native speakers’.

Nor again should the relative frequency statistics in corpus data be misinterpreted as the degrees of obligation for teachers to prescribe and enforce the various usages. Such could be one implication of Widdowson’s (1991: 20) reservation that ‘language prescriptions for the inducement of learning’ ‘cannot be modelled’ on ‘the frequency profiles of text analysis’. He notes that language teaching may have sound reasons for presenting data ‘because they are useful, not because they are frequently used’ (1991: 20), and that artificially simplified data would be fully admissible under this provision. Sinclair, in contrast, would recommend simplifying language teaching by restricting the presentation of artificial data in ways to prevent learners overgeneralising by not knowing the authentic constraints. This recommendation is reasonably compatible with some positions adopted by Widdowson elsewhere, e.g.:

there is a great deal that the native speaker knows of his language which takes the form less of unanalysed grammatical rules than adaptable lexical chunks. [they] are, of course, subject to differing degrees of sentence modification. At one end of the spectrum, we have fixed phrases that cannot be dismantled; at the other end, we have collocational clusters which can be freely adjusted as sentence constituents. […] native speakers do not exercise the creative potential of syntactic rules to anything like their full extent [and] indeed if they did so they would not be accepted as exhibiting native-like control of the language [cf. Pawley and Snider 1983]; anybody producing these syntactic variants of fixed idiomatic phrases would nevertheless be adjudged incompetent in the language. (1989: 132f)

Here again, corpus data could offer language teachers handy ways for estimating the status of their grammatical and lexical materials along the parameter between ‘fixed phrases’ versus ‘collocational clusters’.

‘Widdowson’s point about unpredictable gaps in corpora’ (Sinclair 1991: 493) needs further clarification too. Just as an array of choices in a corpus can, as a whole, be highly improbable or even unique in a statistical sense, there will be many arrays which do not happen to show up in a corpus but which could be readily produced and comprehended by native speakers of the language. Yet insofar as these arrays are related to productive regularities that are implemented in the corpus data, they do not properly constitute ‘gaps’. Sinclair (and I) would predict that in a corpus of the size of the Bank of English, all of the really significant productive regularities of English will be represented, but also that we will always find ‘patterns for which there is some evidence, but insufficient to make a conclusive case for significance’ (1991: 491). The gravity of this problem should steadily recede as the corpus arrives at higher orders of magnitude. At that stage, I would be surprised if we discover regularities (not specific wordings like flip-flop or roger) which both are not represented in corpus data and yet are judged essential by teachers of EFL.

Clarification might be helpful once more when Widdowson (1991: 18) asserted the ‘intuitive significance’ and ‘psychological reality’ of ‘kernel sentences’, which ‘may not be authentic as units of behaviour’, but which ‘are the stock in trade of language teaching’. As with ‘I-language’, Widdowson must be using the term in a looser sense than Chomsky (1957: 106f), for whom the ‘kernel of basic sentences’ must be ‘simple, declarative active with no complex verb or noun phrases’. By this definition, the ‘stock in trade of language teaching’ would be to feed learners on invented data like the man hit the ball and the cat sat on the mat, but not the  next man at bat was hit by a knuckle ball, or the striped cat continued to sleep on the mat, let alone innocuous authentic data like the lion and the unicorn were fighting for the crown, black sheep, have you any wool?, or Polly, put the kettle on! Surely Widdowson meant simple sentences, and only they have genuine ‘intuitive significance’ and ‘psychological reality’. He may be concerned lest corpus data not be appropriately simple for the earlier stages of EFL; but the regularities most simply implemented in such sentences are of course represented in corpus data as well.

Perhaps Widdowson would be content if a specially selected corpus of appropriate data could be compiled to fit the levels of simplicity he would recommend. At least, in a recent discussion (January 1997) he approved of my proposal (elaborated in Beaugrande 1998a) to offer both teachers and learners access to browse through strategically selected and sorted ‘model corpora’, guided by user-friendly walk-throughs. They could work together in exploring for themselves not just contemporary English and other languages, but specific social, regional, and registerial varieties of a language, including ones being spoken as non-native languages in relevant pedagogical, academic, or professional settings. Learners could also receive user-friendly rough-and-ready training for working together in describing the regularities they can find in the data. Here, I would advocate replacing the traditional term and concept of rules, which has accumulated far too much prescriptive and authoritarian baggage, with the term and concept of reasons. The replacement would be sound both on grounds of theory, because speakers certainly do not follow ‘rules’ in the sense of either traditional or formalist ‘grammar’ for every choice they make but nearly always have ‘reasons’; and on grounds of practice, because ‘rules’ carries disempowering connotations of authorities, compulsions, violations, and punishment. Learners should be reassured that they are basically ‘reasonable’ and deserve to know the ‘reasons’ why they should do or say things, and to have their own ‘reasons’ respected. Moreover, we would help to rebalance creativity with conformity, since appropriate contexts supply good reasons to choose creatively on the basis of a steadily more ‘delicate’ sensitivity toward the typical collocations and colligations.

Browsing through a learner-oriented corpus on one’s own pacing and initiative might finally eliminate much of the stress, anxiety, and indifference fostered by conventional language teaching with its focus on ‘accuracy’ or ‘correctness’. The learners who actively invest their creativity in discovering other people’s ‘reasons’ could thus gain substantial initiative and authority during the overall process of learning, with a matching rise in interest and motivation as compared to the passive, alienating, and mechanical application of ‘rules’ laid down by teachers or textbooks.

A fascinating prospect would be to make the enterprise cumulative. Advanced learners could guide the newcomers though the browsing procedures and share their own results. Also, the total results could be accumulated in a data base which could eventually serve to formulate the first learner-generated grammar and lexicon in the history of language education. Such a work would be an impressive implementation of the principle of learners taking charge of their own learning processes, long advocated by democratic educators like Paulo Freire (1985 [1970]).

Co-operative browsing might be an excellent activity for dispelling the misunderstandings and anxieties language teachers may harbour about large-corpus data. The misunderstandings I wish to dispel here concerns the positions attributed to John Sinclair. He by no means asserts that any corpus, however large, equals the total or ‘real English’; or that the ‘performed’ equals the ‘possible’. What he does assert is that the difference between those data and regularities which are found in a very large corpus versus those which are not should be significant for people who purport to make authoritative statements in textbooks or reference works about ‘real English’, especially when addressing learners of English who will try to put the statements into practice. Sinclair also asserts that the same difference is significant for the competence of adult native speakers, who are likely to say combinations that are frequent in the corpus and are unlikely to say combinations that are infrequent or do not occur, although they certainly can say the latter in appropriate contexts. Such speakers have an intuitive sense of which combinations are common, sensible, useful, and so on, without at all implying that others are ‘just not possible’ or ‘not real English’. Their ‘immediate intuitive response this is part of competence and of a well ordered view of language’ (Sinclair 1996)

Furthermore, Sinclair asserts that the data and regularities which do appear frequently in a large corpus should be relevant and interesting for teachers and learners of English as a native language and even more as a foreign language. And finally, he asserts that taking corpus data into account could improve the quality of English world-wide because non-native learners would have much more detailed models and targets to aim for (Sinclair 1996).


2. conclusion and outlook

I have tried to explain why some major ‘revisions’ are on the cards for both theoretical linguistics and applied linguistics, and why, rather than ‘fearing for our future work’, we may justly sustain some refreshing optimism. I have suggested that many important problems facing our work in both theory and practice have been artificially fostered by ill-advised moves to replace real language with ideal language.  A natural and unfortunate consequence has been the symptomatic ‘antipathy to data’, which Sinclair (1997: 8) invokes, and which may now mislead language teachers about the vital opportunities offered by finally having access to vast amounts of authentic language data. We might ponder Sinclair’s (1994) allegory of the church authorities who refused to look through Galileo’s telescope lest they see that the earth is not the eternal centre of the universe; so also might language authorities refuse to work with corpora lest they see that their ideal ‘language’ (or I-language’) is not the eternal centre of human ‘competence’ or the true sphere of ‘linguistic universals’.

Most importantly, perhaps, large-corpus data can provide an really effective counter-weight for the deeply ingrained insecurity many speakers have about the real language they themselves produce, whether native or foreign. Corpus data reveal how skilled ordinary speakers actually are; and how the real language they produce is, as Sinclair (1991: 492) writes,exhilarating creative, marvellously unpredictable, wayward, unruly, quite incredibly productive’.



1 I am deeply indebted to John Sinclair for discussing a number of the issues raised here and for providing access to his Bank of English terminal and to his unpublished materials. I also profited from discussions with Henry Widdowson, Michael Halliday, Sid Greenbaum, Clive Holes, Elena Tognini-Bonelli, Jeremy Clear, John Milton, and Nigel Turton.

2 Ironically, ‘langage’ was precisely Saussure’s term for ‘speech’, as compared to ‘parole’ (translated as ‘speaking’)!

3 On these terms, compare already Firth (1968); Greenbaum (1974). The term ‘preferences’ is elaborated in Louw (1993); Sinclair (1994). Sinclair’s term ‘prosodies’ for ‘prosodies’ for ‘the attitudinal meanings that emerge once you extend the phrase sufficiently far — the point where the surface patterns of language give way to meaningful choices’ (1996) could be misunderstood as referring to intonation.

4 Yet Widdowson seemed a bit inconsistent later: ‘discourse analysts tend more and more towards the relativism’, and ‘to the extent that they favour direct confrontation with actual data, they make common cause with the text analysis of corpus linguistics’ (1991: 16). In my view, a restrictive separation between ‘discourse analysis’ versus ‘text analysis’ hardly seems justified nowadays; but Widdowson might well think so (compare Note 6).

5 Such could be one reading of Sinclair’s remark about ‘dealing with uncomfortable material’ ‘by tying it to a discredited methodology’ (1991: 490).

6 Widdowson certainly has his own special views on what ‘discourse analysis’ should be — it was the topic of his unpublished thesis at university —  and has recently signed contract to write a book about it.




