‘Corporate Bridges’ Twixt Text and Language:
Twenty Arguments against Corpus Research
And
Why They're a Right
Load of Old Codswallop
Robert
de Beaugrande
Every
time a new sockdolager of a word come along and I learnt where she orter fit in
to make sense it kind o’ tickled me all over.
—
Don Marquis, Danny's Own Story
A.
‘Language’ versus ‘data’: bridged or unabridged?
1.
The mild pun on ‘corporate’ in my title may be tolerated for supplying a
handy adjective — ‘relating to corpora’ — whilst acknowledging the
‘corporate’ funding of research in its justified anticipation of new
reference works. Now that a whole generation of such works has transformed the
commercial market, the signals seem clear enough for a methodological revolution
in our practices, such as compiling dictionaries for the purposes of learners of
English; but the signals are far from clear for a scientific revolution in our
theories, such as defining ‘language’ for the purposes of linguists, whether
theoretical or applied (§ 12, 34, 166).
2.
These mixed signals are not too surprising. Since the first emergence of the
academic discipline, linguistic theory has been beset with substantive problems
regarding the question of whether how to define and describe ‘language’
through some bridge to an actual or potential source of data in texts (survey in
Beaugrande 1991). For a retrospective overview, the alternative outlooks might
be summarised in these terms:
2.1.
Language is best represented by the largest and broadest corpus of authentic
data that can be collected and described. This view is prominent in fieldwork
linguistics (e.g. Longacre 1964 [1958]) with its ties to ethnography (language
as culture), and is urgently needed when working with a language which the
linguists do not know and which has not been previously described (§ 32).
2.2.
Language can be represented by such a corpus, but doing so is not obligatory,
and can be supported by practical shortcuts with non-authentic data, assuming
that the same results would be obtained with authentic data. This view is
prominent in descriptive linguistics (e.g. Bloomfield 1933) with its ties to
behaviourism (language as habit), especially when working with a language which
the linguists do know and which has been previously described.
2.3.
Language need not be described from a corpus at all; linguists can safely rely
on their own intuition and introspection as native speakers to supply them with
data. This view is prominent in generative linguistics (e.g. Chomsky 1965) with
its ties to idealist philosophy (language as mind). Here, linguists can work
only with a language they know.
2.4.
Language is an abstract, ideal system not directly manifested in data, and so
must be deduced by formal or logical means. This view is prominent in
glossematics (e.g. Hjelmslev 1969 [1943]) with its ties to formal philoso-phy
(language as calculus). Here, linguists can work without reference to any
particular language, e.g., in research on ‘universal grammar’.
2.5.
Language is a delicate system menaced by errors and abuses, and so must be
described as it ought to be used rather than how it is. This view is prominent
in prescriptive linguistics (e.g. Alford 1864), with its ties to social elitism
and ‘conservative’ politics (language as refinement). Linguists work only
with a carefully purified version of a language they know very well. These
outlooks roughly fall along a parameter where a bridge from language to
authentic data is expressly required at one pole, and expressly dismissed at the
other pole. Only the prescriptive outlook juts out awkwardly, accepting data but
if they are certified to be ‘correct’, ‘educated’, ‘elegant’, and so
forth. Many of its adherents are not accredited linguists, and might better be
called ‘language guardians’. Even so, it is at least implicitly the dominant
outlook among the general population and has engendered a gallery of
self-serving pot-boiler handbooks with cautionary titles like 1001
Pitfalls in English Grammar.
3.
We could plausibly predict that these outlooks will produce respective
descriptions of language that differ substantially from each other; and that the
differences within a single outlook will be more substantial wherever a theory
is not bridged to authentic data — like a house of cards with no driveway, or
a castle in the air with no drawbridge. ‘Theories’ will abound where they
can be devised out of whole cloth, like clothing fads in fashion, by strenuous
theoretical bootstrapping (§ 33).
4.
In the purview of science at large, linguistics markedly stands out for its
periodic resolves to get by without data. Perhaps daunted by a vision of
language data being ‘unabridged’ like the contents of those massive
dictionaries, some linguists have expressly devised or embraced theories to show
why the discipline need not, in principle, sustain a bridge between theory and
data. Saussure (1966 [1916]: 9, 11) already asserted that ‘speech cannot be
studied’, ‘for we cannot discover its unity’; it is only a
‘heterogeneous mass’ of ‘accessory and accidental facts’. In the same
vein, Chomsky (1965: 4, 201) later asserted that the ‘observed use of
language’ ‘surely cannot constitute the subject-matter of linguistics, if
this is to be a serious discipline’; ‘from the standpoint of the theory’,
‘much of the actual speech observed consists of fragments and deviant
expressions of a variety of sorts’. And both linguists found a large and
ardent following.
5.
The key qualifier here is ‘from the standpoint of the theory’ —the theory
is what tells us our data are ‘deviant’. Although Chomsky was purportedly
discussing the ‘theory constructed by the child’ learning a language (1965:
201), the real constructor implied here was surely the generative linguist. One
curious consequence of this outlook is that both the child learning the language
and the linguist describing it would be working against the grain of available
data and in despite of the ‘actual speech observed’. Conversely, any
description of a language directly based upon and confirmed by data would be
inadequate a priori, irrespective of the size and sources of the sample.
6.
This consequence is reflected by at least four central precepts in the
generative outlook on ‘linguistic theory’: (1) children succeed by means of
an ‘innate language acquisition device’; (2) ‘unquestionable data’ are
to be produced and judged by the linguist’s own ‘intuition’ as a ‘native
speaker’; (3) language data need to be ‘transformed’ or ‘formalised’
in order to be investigated scientifically; (4) language should be described by
analogy to a more abstract and formal system, such as ‘context-free grammar’
(see Beaugrande 1998, 2001a, for detailed documentation). Each precept plots the
circuitous, evasive routes between theory and data after the direct bridge has
been closed off.
7.
A second and even more curious consequence is that
the production of data in a real speech community would resemble a
‘catastrophe’ — a ‘sudden violent change representing a discontinuous
response of a system to smooth changes in the external conditions’ (Arnold
1984: 2). The speaker accesses order and equilibrium (language), transforms it
into disorder and disequilibrium (speech) and transmits the data to the hearer,
who transforms them back into order. Communication would be highly ‘noisy’
in the sense of electrical engineering; and perilously ‘far from
equilibrium’ in the sense of complexity theory. Ambiguity and vagueness should
abound (cf. § 122).
8.
This second consequence makes a difficult
and perilous enterprise
out of ordinary language use, and not just of learning
or describing the language. But the consequence
has been evaded by an expedient inconsistency. On
the one hand, the linguist was declared to command an ‘enormous
mass of unquestionable data concerning the linguistic
intuition of the native speaker, often himself’ (Chomsky 1965: 20). On
the other hand,
the ‘speaker of a language’ was declared incapable of being or becoming
‘aware of the rules of the grammar’, so his ‘reports and viewpoints about
his behaviour and competence may be in error’ (1965: 8). We thus come to a
third curious consequence: generative linguists must command a superhuman
rationality for ‘becoming aware of and reporting’ what real speakers
cannot — and are thus the humans best empowered to reveal the ‘competence’
of the ‘ideal speaker-hearer’ in a completely
homogeneous
speech-community’
who
‘knows the language perfectly’, which Chomsky (1965: 3) has famously vowed
to be the ‘primary concern’ of ‘linguistic theory’.
9.
The downside of this empowerment is that such linguists seem qualified to do
only that. Halliday (1984: 51) has accordingly critiqued the ‘assumption’
questioned that ‘the only job for which a professionally trained linguist was
fitted is to go back and train more linguists’ ‘in a university linguistics
department’, ‘insulated from the real world’. Those departments feel
entitled to ‘dismiss questions raised by “non-linguists”’ — i.e.,
‘the rest of humanity’ — who are deemed ‘prejudiced and ill-informed’;
yet ‘behind the questions lies a concern with real issues of social value’
and ‘effective communication’.
10.
The privileged status of professional linguists was at all events implicitly
claimed by their common practice of inventing their own data to illustrate their
particular notions, such as the evergreens [1-3].
[1]
The man hit the ball.
[2]
John is eager to please.
[3]
John is easy to please.
Samples
like [2] and [3] were designed to show that sentences apparently having the same
surface structure differ in their ‘underlying’ structure (John pleasing or
getting pleased). Other samples were designed to accentuate the
‘distinctions between well-formed and deviant’, ‘corresponding to the
intuition of the speaker’ (Chomsky 1965: 24),
e.g., [4] versus [5] (Chomsky 1957: 42, 78).
[4]
John admires sincerity.
[5]
John frightens sincerity.
But
a subtle
objection might be raised here. The task of distinguishing between events versus
non-events is surely an oddity for any science, insofar as a set of non-events
lacks any systemic organisation. Nobody seriously expects meteorology to show
why rain doesn’t fall upwards; or geography to explain why the earth is not
flat; or astronomy to prove that the earth is not the centre of universe; these
sciences explain real and possible events rather than the impossible non-events.
Yet generative linguistics seemingly proposed to explain why ‘grammars’
preclude wildly ungrammatical or ill-formed data. Surely the set of such
linguistic non-events would be the true ‘heterogeneous
mass’ of ‘accessory and accidental facts’ that we have seen Saussure
imagining to comprise the real events of ‘speech’ (§ 4) — and may
actually be infinite, which a set of real events never is (cf. § 65ff).
11.
Working with large corpus data for the past seven years has impelled me to grasp
an even more subtle objection.
Much of the data presented as non-deviant — as ‘grammatical’,
‘well-formed’, and so on — also do not occur as real events. Not even in
the Bank of English (BoE), the world’s largest corpus of authentic texts, nor
in the British National Corpus (BNC), did I find a single occurrence of samples
[1] through [5]. Such trivial data are evidently possible but not probable, and
this factor might tell us something significant about human language. For
example, speakers or writers of English don’t just say that somebody
‘admires sincerity’, full stop; but rather that they ‘admire the
sincerity’ of a particular person on a particular occasion, as when desperate Valancourt
confessed to Emily St. Aubert that he was ‘irreparably ruined’ by his
‘debts’:
[6]
Emily, while she was compelled to admire his
sincerity, saw, with unutterable anguish, new reasons for fear in the
suddenness of his feelings (Mysteries of
Udolpho)BAWC
[= British and American Writers Corpus data]
And
we don’t normally just say ‘the man hit the ball’ but a deal more about
who, why, and how:
[7]
Leconte, by contrast, hit
the ball with the joy of a player savouring rare moments free of physical
pain and won, 6-4 (Independent )BNC
[= British National Corpus data]
[8]
A back pass from player-manager
Hoddle seemed to catch Hammond by surprise and the goalkeeper hit
the ball straight to the feet of Posh’s Disappointed Swindon boss. (Today
Sports Page)BNC
[9]
All
of life, as we know it, moves in little, unavailing circles. More justly than to
anything else, it can be likened to the game of baseball. Crack! we hit
the ball, and away we go. If we earn a run in life we call it success
(Whirligigs)BAWC
[10]
I
’lowed I’d knock that durned little ball way over into the next county. So I
rolled up my sleeves and spit on my hands and got a good holt on that war club
and I whaled away at that little ball agin,
and by chowder I hit it. I knocked it clear
over into Deacon Witherspoon’s pasture, and hit his old muley cow, and she got
skeered and run away, jumped the fence and went down the road, and the durned
fool never stopped a-runnin’ till she went slap dab into Ezra Hoskins’
grocery store, upsot four gallons of apple butter into a keg of soft soap, and
sot one foot into a tub of mackral, and t’other foot into a box of winder
glass (Uncle
Josh’s Punkin Centre Stories)BAWC
Further
on, I shall suggest that non-authentic data are distinguished by static
predictability, and authentic data by a dynamic tension between predictable and
unpredictable (§ 15, 73, 79ff, 129). A bit paradoxically, data like [1-4] are
so strenuously made to seem probable that they flip over to improbable.
12.
The prospect impends that linguistics may have set itself a task truly without
precedent in the annals of science: defining our object of investigation, namely
language, by contrasting two sets of non-events. Not only has no
remotely complete ‘explanation’ or ‘grammar’ of this type ever been
published for any human language; the task is inherently impossible. If so, the
so-called ‘generative revolution’ was more properly an ‘anti-scientific
revolution’, and the time has come for a genuine ‘scientific revolution’
that will restore the reality of language (§ 33f).
B.
Some data parameters: authentic, rich, literary, academic
13.
From the standpoint of corpus linguistics, which, by definition, works with
authentic data, the staid dichotomies of ‘langue and parole’ or
‘competence and performance’ are basically irrelevant because, strictly
speaking, they imply a dichotomy of non-data versus data insofar as
neither ‘langue’ nor ‘competence’ is manifested as data. Saussurians
have evaded this implication by a double tracking: they have vowed
to ‘deal only with linguistics of language’ [langue], yet to ‘use material
belonging to speaking [parole] to illustrate a point’ (Saussure 1966 [1916]:
19), which suggests that data might also occur somewhere else besides in
‘parole’ but doesn’t say where. Chomskyans
more expediently ‘assumed
that the set of sentences is somehow given in advance’
(Chomsky 1957: 18, 54, 85, 103). To be more precise, that ‘ideal
speaker-hearer’ would never say
anything at all (which would entrain him in the ‘deviant’ conduct of
‘performance’); he would stand transfixed in rapturous ‘introspection’
upon the ‘infinity’ of ‘well-formed sentences’ hovering in the
‘perfect’ nirvana of his ‘competence’.
14.
For us, the
most relevant dichotomy should be between authentic data versus non-authentic
data: whether the data are attested by actual occurrence in text and
discourse. Not surprisingly, this dichotomy has hardly figured in linguistic
approaches built chiefly on non-authentic, non-attested data. No doubt the
long-standing convention of idealising language has encouraged the notion that
we can study it best with idealised data. Yet the very fact that we can
intuitively recognise non-authentic data as such should indicate their
exceptional status and argue against their being a valid representation of the
language. Our parameters would not impose a boundary between ‘grammatical’
versus ‘ungrammatical’ sentences, but would seek to describe the parameters
of authenticity among data which are all unquestionably grammatical.
15.
One influential parameter here could be termed rich data versus
sparse data, where ‘richness’ denotes the potential of a context to
determine the meaning of some term. We can finally shelve the projects of
linguistics to provide a ‘context-free’ description (still envisioned, say,
in Cook
1992; Keenan 1993), e.g. for
‘describing the structure of a sentence in isolation from its possible
settings’ (Katz and Fodor 1963: 170). Data actually freed of all contexts and
settings would no longer be language nor data, but merely symbol strings of the
kind displayed in inscriptions of an undeciphered dead language. The act of
recognising language as data is inseparable from the act of imagining
‘possible settings’ (§ 71). Even Katz and Fodor do just that with their
non-authentic sparse-context example [11].
[11]
The bill is large.
by
imagining whether one is dealing with a hefty payment request or a bulky bird.
The quest for the ‘structure
of semantic theory’ can only defeat itself by maximizing sparseness, as if we
were required to explain whatever
a space traveller might mean who lands on earth, says nothing but ‘the bill is
large’, and then is instantly transported to the planet Tattooine by Jabba the
Hutt.
We need to explain what actual speakers might mean by collocating ‘large’
with ‘bill’, e.g., for an exorbitant charge [12] (the only meaning attested
in the BNC); a hefty menu complete with hefty prices, [13]; a banknote of high
denomination [14]; or a wall poster [15].
[12]
Budgeting loans are not available for gas and electricity bills. If you have a large
bill which you cannot pay you may be able to go on the fuel direct scheme (Age
Concern)BNC
[13]
The large bill of fare held an array of dishes sufficient to feed
an army, sidelined with prices which made reasonable expenditure a ridiculous
impossibility (Sister
Carrie)BAWC
[14]
Merriam had his bank balance of $2,800 in his pocket in large bills, and
brief instructions to pile up as much water as he could between himself and New
York. (Whirligigs)BAWC
[15]
He always prints, I know, ’cos he learnt writin’ from the large bills
in the bookin’ offices. (Pickwick Papers)BAWC
These
data are ‘rich’ in the sense that we can in each case determine a
distinctive meaning of our collocation, even though a portion of the data must
be unpredictable (cf. 11, 73, 79ff)
16.
Another influential parameter could be termed literary data versus
non-literary data. My own
corpora of British and American Writers (BAWC), whose construction I shall
briefly describe later on (§ 115ff), contains mostly texts that would be
labelled ‘literature’ for purposes such as library catalogues. I have
elsewhere proposed to define literature as socially accredited discourse about
alternative worlds, which we can compare and contrast with our notions of our
own (e.g. Beaugrande 1988). This principle of ‘alternativity’ underwrites
the human validity of fiction despite its not being ‘fact’: it uses
imaginary people and events to convey statements about the human situation.
17.
For this reason, most literature sustains a ‘world-creating’ potential by
providing a rich background and setting for its audiences. Authors feel
encouraged to present rich discourse frames
telling how things were said instead of just reporting the words, e.g.:
[16]
‘Tis because you are an indifferent person’, said Lucy, with some pique, and
laying a particular stress on those words, ‘that your judgment might justly
have such weight with me’. (Sense and Sensibility)BAWC
[17]
‘Step this way, if you please!’ I repeated, in so determined a manner that
he could not, or did not choose to resist its authority. (Tenant of Wildfell
Hall)BAWC
Representing
such data without the frames would lose some of the information conveyed by the
frames (§ 118).
18.
A final parameter I would propose would be academic data versus non-academic
data: whether or not a text is
produced in or for some institution of ‘higher learning’ or ‘research’.
Academic texts frame academic sources by their prestige and conviction rather
than their manner of speech, e.g. [18]; and construct periodic sentences like
tapestry, better suited to the eye than the ear, e.g. [19]. Such
samples would be difficult to imagine anywhere but in academic data.
[18]
The Hon. and Rev. W. Herbert, afterwards Dean of Manchester, in the fourth
volume of the Horticultural Transactions, declares that ‘horticultural
experiments have established, beyond the possibility of refutation, that
botanical species are only a higher and more permanent class of varieties’. (The
Origin of Species)BAWC
[19]
On the throne of Samarcand, Timour displayed his magnificence and power;
listened to the complaints of the people; distributed a just measure of rewards
and punishments; employed his riches in the architecture of palaces and temples;
and gave audience to the ambassadors of Egypt, Arabia, India, Tartary, Russia,
and Spain, the last of whom presented a suit of tapestry which eclipsed the
pencil of the Oriental artists. (Decline
and Fall of the Roman Empire)BAWC
The
richness of academic data clearly differs in this regard from the richness of
literary data. We would not have, for example, ‘The
Hon. W. Herbert declared, with some pique…’.
19.
The contrast between sparse and rich seems obvious for [1] versus [7-10], or for
[11] versus [12-15], but must be partially an intuitive one, and cannot be
reduced to ‘syntactic rules’ or ‘semantic features’. But then linguistic
theory is not required to do so if authenticity is decided by attestation, not
by abstract features or formal structures (§ 13). One could perhaps extract
from authentic materials some individual sentences that would seem as sparse as
non-authentic data, e.g.:
[20]
I know the man. (Uncle Tom’s Cabin)BAWC
[21]
You are only fourteen. (Cash Boy)BAWC
But
their sparseness is merely an illusion created by isolating the data from their
richer contexts, e.g., [20] being a reason to believe what Cassy says to Uncle
Tom about the odious Legree [20a]; or [21] being a reason to doubt whether Frank
will be able to ‘take care of Grace’ [21a].
[20a]
now you’ve got his ill will upon you, to follow you day in, day out, hanging
like a dog on your throat — sucking your blood, bleeding away your life, drop
by drop. I know the man.
[21a]
‘But Grace? She is a delicate girl’, said the mother, anxiously. ‘She
cannot make her way as you can.’ ‘She won’t need to’, said Frank,
promptly; ‘I shall take care of her.’ ‘But you are very young even to
support yourself. You are only fourteen.’
When
linguists purport to analyse data ‘free of context’ (§ 15), they are
instead creating artificially sparse contexts where the activities of imagining
‘possible settings’ are performed under the counter. The sparseness is
intensified when linguists go on to convert their non-authentic data into some
formal representation, such as a ‘syntactic structure’ [22], or a ‘general
postulate’ to signify that every office building has a window [23].
[22]
the man hit the ball => T + N + Verb + NP (Chomsky 1957: 26f).
[23]:
("x)
[office (x) ƒ=>
building (x)]; b: ("x)
[building =>
($y)
(has (x, y) & window (y))] (van Dijk 1977: 100)
But
as long as we are still working with expressions of natural language, such as
‘window’, we retain contact with contexts. The key factor for is that
authentic data characteristically occur in rich contexts, which, in my BAWC
data, shed an unflattering light on ‘buildings’ well-supplied with
‘windows’:
[24]
Coketown […] had a vast pile of buildings full of windows where
there was a rattling and a trembling all day long (Hard
Times)BAWC
[25]
here and there would be a great factory, a dingy building with
innumerable windows in it, and immense volumes of smoke pouring from the
chimneys (The Jungle)BAWC
20.
I would vigorously contest the assumption implicit in modern linguistics that
the processes of inventing sparse data and making rich data sparse increase the
validity and generality of our description of language (cf. § 82ff, 92). I
would assert just the opposite insofar as these processes are usually arbitrary
and uncontrolled. We are flatly presented with the results (e.g. ‘John
admires sincerity’, § 10f) rather
than with an explicit account of how the linguist went about inventing or
formalising the data, as if these processes were fully underwritten and
guaranteed by native speaker intuition or by an academic degree in linguistics
(cf. § 8, 38).
21.
Moreover, I submit that since language use is empirically found to constitute
rich data, scientific method demands that these must be the central basis of a
valid description or explanation. And since the products of intuition are
empirically found—in the discourse of many linguists—to constitute sparse
data, the production of data cannot be a valid function of intuition (§ 40ff).
Instead, its valid function is to sustain bridges between authentic data and
rich contexts which quite naturally cover more than the data themselves express
— ‘making rich’ the way most ordinary discourse participants do, not
‘making sparse’ the way some formal linguists do. The
validity of our ‘enrichments’ depends on whether they can be verified to be
typical of the language community (not the ‘ideal speaker-hearer’); and
doing so is one of the major tasks we face for the future. But to determine what
modes and instances of enrichment should be verified, we are obliged and
justified in relying on the interaction of authentic data with our own
intuition.
C.
Four responses to corpus research
22.
Corpus work has become a testing grounds for the various outlooks upon authentic
data in linguistics, and in other approaches to language as well. I shall sketch
four responses with light-hearted but hopefully mnemonic labels.
23.
At one extreme, the cold shoulder response totally ignores the results of corpus research. This
response signals a strong commitment to linguistic approaches based on
non-authentic data, and a determination to hide from new facts like those pious
prelates who refused to look though Galileo’s telescope (Sinclair 1994).
Curiously, we find two utterly disparate outlooks at this same extreme: the generativists who regard real language as ‘deviant’ and replace
it with ideal language (§ 4f); and the prescriptivists who regard real
language as ‘non-standard’ and replace it purified language (§ 2.5). Both
groups feel entitled to understand the nature of ‘language’ far better than
ordinary speakers do, but for entirely disparate reasons: the generativists
because they have access to the ‘perfect knowledge’ of the ‘ideal
speaker-hearer’ (§ 8); and the prescriptivists
because they know just what is ‘correct’ or ‘incorrect’, ‘good
English’ or ‘bad English’ (§ 2.5).
24.
At the opposite extreme, the red carpet response
is delighted to finally have such large
data samples and heartily welcomes the results. This response signals a
strong commitment to linguistic approaches based on texts but hitherto compelled
to follow opportunistic strategies by
getting authentic data wherever we happened to find it; and by making
compromises to invent plausible data when authentic data were not sufficiently
available (cf. § 117). Now we are happy indeed to be freed from this necessity
by the ‘corporate bridges’ that data can provide between ‘language’ and
‘text’ — or langue and parole, competence and performance, and so on.
These bridges consist principally of regularities which are more specific than
the language but more general than the text; and which are vital for making
texts sound ‘fluent’ or ‘idiomatic’ (Beaugrande 2000, 2001b)
25.
This response appears typical for systemic
functional linguistics, which has all along respected the value of
authentic data even whilst highlighting those systemic factors which are
‘realised’ or ‘actualised’ by instances in the data (e.g. Halliday
1985). Today, corpus data are being hailed as the most promising bridge between
system and instance, and has lent new energy to the project of using statistical
frequencies to assign relative probabilities to the options of the grammar (e.g.
Halliday 1991, 1992) (cf. § 130-137). The focus of systemic functional
linguistics upon paradigmatic, not just syntagmatic, fosters a natural interest
in how some choices are made in coordination with others, and how frequently.
But prior to corpora, this had to be worked out by hand, which, even for a small
corner of the lexicogrammar, can be horrendously laborious.
26.
The response is typical also
for text linguistics, at least in my own view of the field. At its
best, text linguistics has always been an implicit mode of small-corpus
linguistics, coping with practical and theoretical problems as they arose. Our
guiding rationale throughout has been that working with authentic texts will
bring to light aspects of language and communication that we otherwise miss.
Such was the essential message and demonstration of the 1981 Introduction
(Beaugrande and Dressler 1981).
27.
In between the two extremes, the limp handshake
response publicly welcomes corpus research but privately harbours
misgivings about its potential for creating pressure to change accepted views of
language and familiar methods of teaching it. Our results are at most regarded
as issues to place alongside established ones without disrupting them, e.g., as
modules to be inserted somewhere into the business-as-usual ‘lesson plans’
in EFL teaching.
28.
Also in between the two extremes, the poison needle response
exploits academic or institutional leverage to fend off the
implications and results of corpus research. Our results are regarded as
heresies — ‘mistakes, inadequacies, limitations, distortions, biases’ etc.
etc. — against which the unwary world must be resoundingly warned.
29.
Whereas the identities of ‘cold shoulderers’ and ‘red carpeters’ are
clearly on public record, the same cannot be said of these two groups in
between. Some ‘limp handshakers’ on the conference and lecture circuit
change into ‘poison needlers’ in the shielded preserves of anonymous
reviewing and academic politicking. In fact, the ‘review’ process is never
so prone to unprofessional manoeuvring as when a discipline confronts a
substantive body of evidence with a ‘revolutionary’ potential regarding
dominant theories and methods, as
I have documented elsewhere (Beaugrande
2001c) (§ 31). The harder it becomes to deny the findings of authentic data,
the harder these groups will work to keep them out of print and foreclose any
free and open discussion of the issues.
D.
Twenty arguments against corpus research
30.
The time seems opportune to clear the air by reviewing the merits of some
arguments being commonly lodged against corpus research, whether publicly or
privately. They offer
predominantly theoretical motives to conclude that corpus research cannot, in
principle, produce significant or applicable results; they pass quietly over
their practical motive to eschew the detailed labour and technical training
corpus research requires. Most of arguments are found on close scrutiny to be
empty or irrelevant-- in the parlance of the BNC, to be a right
load of old codswallop
-- arising from some fortuitous or wilful misrepresentation of
language and discourse in general or of corpus research in particular. Others
point to genuine substantive problems which we must confront but which will not
— as is apparently hoped —drive us to give up in despair.
D.1.
‘Corpus research is a new
fad.’
31.
A ‘fad’ is by definition a new trend which rapidly achieves general
acceptance through sheer brash novelty; and corpus research simply doesn’t
qualify. Placing our results can still be difficult in mainstream journals of
theoretical or applied linguistics whose ‘peer reviewers’ see corpus
research as a threat to their preferred approaches. Until our research attains
the mainstream, we authors should seek alternative strategies and outlets. We
can sustain interactive websites to post our work whenever, in our own judgment,
it seems to be of interest. Readers who do or do not agree with it are invited
to justify their opinions with thorough and substantive arguments instead of
using ‘anonymous negative reviews’ to suppress our work with no public
accountability.
32.
And, far from being a ‘novelty’, corpus research is in reality older than
most of the linguistics now arrayed against it. Already at the inception of the
field, corpus research was established in fieldwork. Philology had
inaugurated the compilation of atlases for well-known languages (e.g. Wencker
1887-95), which was continued in descriptive linguistics (e.g. Kurath 1949). In
my own view, the most impressive achievement in all of linguistics was the
description of previously undescribed languages of native America, Africa, and
the Pacific, often without the aid of bilingual informants or decent audiovisual
recording equipment (e.g. Sapir 1922; Hockett 1939; Pike 1944; Hoijer 1945;
Newman 1947; Pittman 1948). These studies founded a stream of corpus-based
fieldwork that has continued right up to the present (e.g. Eberhard
1995; Wannemaker
1999; Newman and
Ratliff eds. 2001).
33.
This monumental work most firmly established linguistics as an accredited social
and human science up until the so-called ‘scientific revolution’ that turned
against descriptive methods in the 1960s. This turn was to some extent
prefigured in long-standing uncertainties about data since Saussure, as I noted
in section A; but the deliberate and programmatic substitution of invented data
for observed data, and of the scientist’s own intuition for the reports of
informants, was a real novelty without precedent in any science, and from
today’s standpoint, deserves to be called instead an ‘anti-scientific
revolution’ (§ 12). Thus cut loose from authentic data, and licensed to
devise arbitrary ‘formalisations’, linguistics has proliferated genuine fads
(§ 3). Some forty ‘formal theories’ of language have competed for adherents
(Escribano 1993); and the definitive refutation of any one is hardly feasible if
substantive data cannot be adduced.
34.
Corpus research accordingly represents a return to the roots of linguistics, now
equipped with cutting-edge technologies (cf. McEnery and Wilson 1996). We seek
to bring about a ‘scientific revolution’ that restores what was lost in that
‘anti-scientific revolution’ against data. In the process, our dependence
upon authentic data for our claims and demonstrations precludes any mere
faddishness. However, we are definitely in a phase of swift evolution in
our theories and practices, and the outcome is by no means clearly foreseeable
(Sinclair 1997a, 1997b, 2001). And surely that is grounds for optimism, not
pessimism.
D.2.
‘Corpus
research is subjective, not objective.’
35.
This argument is a heritage of the positivism and physicalism that triumphantly
heralded the ‘unified science’ in the early 20th century (Neurath
et al. 1938). It was predictably applied to linguistics to assist its
accreditation as a relatively new science (e.g. Bloomfield 1930). Since language
as a whole hardly seemed amenable to treatment as a physical object, it was
dismantled to isolate some of its more amenable aspects, especially the
phonetics of articulation and the acoustics of audition (Jones 1914). Under the
aegis of behaviourism, real speech could thus be readmitted despite exclusions
like Saussure’s (§ 4), e.g., as a chain of ‘verbal behaviour’ composed of
objectively observable pairs of ‘stimulus and response’ (Bloomfield 1933;
Skinner 1957). In this purview, ‘speech’ constitutes ‘cause-and-effect
sequences exactly like those we may observe in the study of physics’
(Bloomfield 1933: 33).
36.
The programmatic turn of mentalism against behaviourism curiously retained some
notion of language as a set physical objects. Now, the convention in
‘physics’ whereby ‘any scientific theory is based on a finite number of
observations’, which it ‘relates’ and ‘predicts’ ‘by constructing
general laws’, was compared to a ‘grammar of English based on a finite
corpus of utterances (observations)’, ‘containing grammatical rules (laws)
stated in terms of phonemes, phrases, etc.’ and ‘expressing structural
relations among sentences of the corpus and the indefinite number of sentences
generated by the grammar’ (Chomsky 1957: 49). This comparison presumably
helped the new ‘theory’ along, even though generative linguistics soon found
its reasons to banish both ‘corpus’ and ‘observation’ (§ 4). In return,
the sentence assumed some traits of a physical object. It became a ‘string’
with a ‘surface’; its ‘structures’ can ‘branch’ to the ‘left’
and the ‘right’, or can be ‘raised’ and ‘lowered’; and so on. Such
objectifying notions help to fill the void left by draining the authenticity out
of the data.
37.
Corpus research holds the potential to transcend the competitive dichotomies
between objective versus subjective, and between behaviourism versus mentalism.
The larger the corpora and the more consensus and coverage we can achieve, the
brighter our prospects to attain intersubjectivity vis-à-vis a language
community. If solely subjective methods seem too broad and loose, solely
objective methods seem too narrow and rigid for data as rich and variegated as
ours.
38.
In view of the problems I have aired above, the position of the corpus linguists
themselves cannot be treated so casually as that of linguists who claim to know
all about the ‘ideal speaker-hearer’ (§ 8). We cannot escape our own
subjectivity as the physicalists and behaviourists aspired to do; but neither
can we exalt it as our privileged source of data, as the mentalists and
generativists proposed to do. Instead, we should invest it in our explorations
like a partial and fallible map to be filled in or corrected when the data
require it. For example, when I read this passage some years back:
[26]
They began trotting, […] tanned graduate students striding like gazelles ahead
of the pack; middle-aged duffers, white hairy legs pumping, bringing up
the rear (Lonely Hearts of the Cosmos)
I
projected too much from the context and imagined ‘duffers’ to be middle-aged,
flabby men. Today I can see from BNC data they are just people who are awkward
at something:
[27]
she had always been considered a complete duffer at languages. (Hypnosis
Regression Therapy)BNC
[28]
One longs for the Germans to give up trying to make facsimiles of other people's
cheeses. They are terrible duffers at it. (An Omelette and a Glass of
Wine)BNC
I
suspect such minor personal lapses are more commonplace than we’d like to
think. But corpus data offer us the means for being less of duffers at grasping
unfamiliar words in context.
D.3
‘Corpus research seeks to banish intuition.’
39.
This argument sounds
like the exact reverse of the previous one, and faults us for not being
subjective enough and not allying ourselves with mentalism against behaviourism.
The sources for this argument (e.g.
Widdowson 1991; Owen 1993) evidently propose yet another competitive
dichotomy, this one between corpus
versus intuition, as if we are must choose only one. This dichotomy entails a
serious category error, because every mode of contact with language
implicates intuition. No matter how rich, the data never speak for themselves or
declare their own significance. And corpus linguists must at least partially
approach data from the standpoint of a potential audience, say, the fans who
read the sports news in the Independent and Today [7-8], and who
have heard of Leconte and Hoddle (which I hadn’t).
40.
However, we would transform the role and function of intuition so as to enlist
it in the purposes of corpus research (Francis and Sinclair 1993). It should be
transposed from before the fact
— the source for supplying the occasional sentence from the linguist’s own
mental data-bank — to after the fact — the resource for interpreting
discourse samples from a collaborative electronic data bank. Our interpretations
will in part run parallel to those of the wider language community, but will
also run at the higher awareness and in the broader scope enabled by multiple
bridges between authentic data and rich contexts (§ 21, 58). However, we imply
no claim to any
superhuman rationality whereby we
command access to some ‘universal deep structure’ of language or to the
‘perfect knowledge’ of some ‘ideal speaker-hearer’ (§ 8, 23).
41.
This transposition is vital insofar as intuition is much less adept in prediction
than in retrospection. If, like me, we work as English teachers, we often
get asked how one should say this, that, or the other; working with corpus data
has made me much more circumspect about answering. My own intuition, at any
rate, does not run at the degree of precision needed to give reliable
information on specific questions. When a student wrote [29], my response was
that the Verb is not used that way. But corpus data proved me wrong with samples
like [30-31]. And in the 1913
edition of Webster's Dictionary (before radio days), this meaning was the
main one [32].
[29]
The woman follow the oxen them to broadcast seeds
[30]
sowers flinging their seed about broadcast (Mayor of Casterbridge)BAWC
[31]
The
second method is to broadcast the seeds together with not more than 1 kg.
to the acre of rape and turnips in late June or early July. (The Challenge of
Smallholding)BNC
[32]
Broadcast
(Agric.)
1. A casting or throwing seed in all directions, as from the hand in sowing; 2.
Scattering in all directions (as a method of sowing); opposed to planting in
hills, or rows
42.
Sinclair (2001: 10) has recently remarked that it might be ‘difficult
for one who has been a Professor of Modern English Language for 35 years to
admit that there are words whose meaning he does not know’ and ‘several
thousand words of English’ he ‘could not define’. But this admission feels
difficult mostly where teachers of English have, willingly or not, been seen in
the role of infallible authorities — an unfortunate tendency from which access
to corpus data should finally release us (§ 155). In the future, our role will
be to assist people in accessing data they need; and our authority will depend
on having scanned large data sets, and not just on holding an ‘advanced degree
in language’ (cf. § 8, 23, 40).
43.
What teachers and learners of English alike require, and Sinclair says so, is
not massive standing vocabularies so much as skills for grasping the meanings of
expressions in real contexts. Even familiar words may be found in unfamiliar
meanings, as befell me with ‘broadcast’. The trick is to pick the rich data
whose contexts are most helpful. For example, the Modifier ‘knackered’ was
not in my own vocabulary, and rich data like [33] intuitively led me to the
meaning of ‘physically exhausted’. I could then extend the meaning by
analogy to sparser data concerning ‘inflation’ [34] or a ‘car axle’
[35]. But I had to find a different rich context for the Noun ‘knacker’
[36], which suggests to me an ominous derivation for the Modifier.
[33]
I forced myself towards it. I was utterly knackered. It took my last
reserves of strength and will to reach it and then to heave myself in. (Pilot)BNC
[34]
you can’t have a decent life if you’ve got high inflation all the time knackering
you up (conversation)BNC
[35]
In no time, we’ll have done in £500 worth of tyres and knackered the
rear-axle. (Esquire)BNC
[36]
Richard Cross was a knacker in Camden Town. He supplied dead horses and
asses for dissection, and also dealt in dead cows. (Royal Veterinary College)BNC
44.
Still less reliable is intuition in predicting frequencies. When
I was
adapting Halliday’s (1985) ‘functional lexicogrammar’ that describes
‘Processes’ by such categories as ‘Transitivity’ and ‘Ergativity’,
my intuition predicted that a key distinction would be whether and how far a
Process is judged to be under the control of the Agent or Initiator (Beaugrande
1997). This in turn ought to show up as frequencies in corpus data for the Verbs
collocating with ‘could not help’ and ‘couldn’t
help’, where you say that a spontaneous Action was not fully under control.
This usage might be classed as a Face-Saving
Auxiliary, along with ‘couldn’t
resist’, ‘couldn’t refrain from’, and so on: expressions which
attenuate the Agency of Process Verbs after some Action that might indicate
insufficient regard for social norms.
45.
But my intuition could by no means have predicted the actual frequencies of the
particular collocations I found among the 515 occurrences returned
from the Bank of English (BoE) in July 1994, then at 226 million words. There,
just four Process Verbs, ‘feel’ (68 occurrences), ‘think’ (59),
‘notice’ (58), and ‘wonder’ (49), totalled up to 234, 45% of the data.
Still, my intuition can retrospect upon these data by noting that these Verbs
are prime examples of Processes which might well elude the Agent’s full
control and which might lead into emotions, perceptions, and thoughts which
render some speakers of English self-conscious.
46.
In the BNC, which is 44% of the size the BoE was at that time, I find 225
attestations of ‘could not help’ and 378 of ‘couldn’t help’
collocating in these proportions for the same Verbs: ‘feel’ (61),
‘think’ (58), ‘notice’
(44), and ‘wonder’ (37), for a total of 200, 33% of the data. In
percentages of the BoE totals, these would be 88 - 98 - 76 - 75, or on average
84%, quite high for a corpus only 44% as large. This factor might be due to
differences in their composition, notably the higher proportion of news media in
the BoE and of popular fiction in the BNC; news reporters hardly write about
what the Prime Minister or the Queen ‘couldn’t help feeling’. Or, we may
simply be encountering the accidental scatter to be expected by working at finer
degrees of precision in very large complex systems. But the way the four Verbs
nicely lined up the same relative to each other in both corpora suggest that
scatter may not prove to be a serious problem.
47.
In my BAWC data, in its turn only 38% as large as the BNC, I found 944
occurrences of ‘could not help’ and 291 of ‘couldn’t help’ for a
rather massive total of 1235. We can safely attribute this high proportion to
the literary status of the data, as a text domain where feelings, thoughts, and
so on, are often presented from a narrator’s standpoint. Also, the 19-century
society described by many of my writers had rather firm ideas about what one
really shouldn’t ‘feel’ though one may not be able to ‘help’ it: in my
data, ‘bitterness, distrust, jealousy, envious, angry, hurt’ etc.; and, more
intensely, ‘utter hopelessness, delirious happiness, infinite pity’.
48.
The four Verbs did not line up this time, but collocated at these proportions in
the BAWC: ‘feel’ at 84, ‘think’
at 106,
‘notice’ at 17, and ‘wonder’ at
26 for a total of 233, roughly 19% of all the data. In relation to the
BNC data, ‘feel’ and ‘think’
are
sharply skewed at 137% and 252%; ‘notice’
fits almost exactly at 40%; and ‘wonder’ is less sharply skewed at 72%. The presence
of ‘think’ becomes still more obtrusive when we take into account the 30
occurrences of ‘feel
that’ plus a Clause in a sense similar to ‘think that’; still, in 12 of
those, feelings were clearly involved too, e.g.:
[37]
he could not help feeling that he was getting the worst of it—there was some
faint stigma attached (Sister Carrie)BAWC
[38]
she cannot help feeling that her children are cruelly handicapped by the fact
that he is their father, nor can she help feeling guilty about it (In Defense
of Women)BAWC
A
plausible explanation might be the far heavier representation of literary and
popular fiction in the BAWC than the BNC, and the attractions for authors to
tell us what someone ‘couldn't help thinking’, even if one would hardly say
so, e.g.:
[39]
Mr. Wharton could not help thinking: ‘How poorly this young man compares with
my young friend. Still, as he is Mrs. Bradley's nephew, I must be polite to
him.’ (Cash Boy)BAWC
[40]
I could not help thinking that, with his queer head and length of thinness, he
was made to hop along the road of life rather than to walk, […] and I bade my
inward spirit keep close to discretion. (Pointed Firs)BAWC
49.
Such explanations are themselves products of intuition, but they are still
data-driven; the data placed before me some facts of usage I had never noticed
as such, despite fairly extensive readings in literature. Of course, without a
great deal more data, I cannot say whether the proposed category of
‘Face-Saving Auxiliary’ should be regarded as explanatory. I can only offer
it as plausible and point to confirming evidence, such as the powerful
preference to colligate with ‘I’ as the Subject who ‘couldn’t help’:
150 occurrences in the BoE, 171 in the BNC, and 448 in the BAWC; the face that
most often needs saving is the speaker’s.
D.4
‘Corpus data represent outer behaviour rather than inner knowledge
of language.’
50.
This argument, also put forward by Widdowson (1991), fits the
previous one, portraying
us to be studying only behavioural factors and ignoring mental factors. Here we
should recall the long-standing controversy within linguistics and adjacent
fields like philosophy and psychology between behaviourism versus mentalism
(e.g. Skinner 1957; Chomsky 1959, 1965; discussion in Beaugrande 1980). Both
sides presented their case as if we must accept the one and reject the other —
a gesture closer to academic politics than scientific method. Purely behavioural
data could only consist of observable events of the body. Such data could
represent text or discourse only as an array of articulatory and acoustic
operations for speech, or of inscriptions and visual recognitions for writing (§
35); and corpus linguistics certainly does not propose to describe language in
those terms, nor does a corpus represent language that way.
51.
Purely mental data would only consist of non-observable events of the mind. Such
data could represent text or discourse only as an array of meanings, intentions,
mental images, and so on, as distinct from the act of their expression. But they
can become our data only when they are expressed, and there some reprocessing
occurs, notably, to convert a hierarchical network of activations into linear
sequences (Beaugrande 1980, 1984).
52.
Surely language and discourse represent the most elaborate interaction of
body and mind. So our research needs to sustain a dialectical cycle between the
behavioural and mental. One prospect is to exploit the rich indicators in corpus
data about how outward behaviour might be interpreted in respect to people’s
inner knowledge, e.g. when you ‘survey
them from head to foot’, or ‘look full in their face’:
[41]
after surveying Mr. Winkle from head to foot, [he] said: ‘You’re a wery
humorous young gen’l’m’n, you air, sir!’ ‘What do you mean by this
conduct, Sam?’ inquired Mr. Winkle, indignantly. ‘Get out, sir, this
instant.’ […] ‘I shall leave this here room, sir, just precisely at the
wery same moment as you leaves it’, responded Sam, speaking in a forcible
manner, and seating himself with perfect gravity. [And he] planted his hands on
his knees, and looked full in Mr. Winkle’s face, with an expression of
countenance which showed that he had not the remotest intention of being trifled
with. (Pickwick Papers)BAWC
Such
indicators
help us to enlist our data in interpreting them in ways the producers might
plausibly have intended.
D.5
‘Corpus data do not reveal what is possible but only what is performed.’
53.
This argument, once more advanced by Widdowson (e.g. 1991), might be construed
as yet another re-issue of the dichotomy of langue and parole or competence and
performance, but the fit is not exact. The possible must include all of the
performed, whether a speaker or writer is judged competent [42] or incompetent
[43] in an ordinary sense:
[42]
There was a communication before her, one which she only could be competent
to make: the confession of her engagement (Emma)BAWC
[43]
What he would have asked her he did not say, and instead of encouraging him she
remained incompetently silent. (Mayor of Casterbridge)BAWC
According
to Widdowson (1991: 13), ‘Chomsky’s view is that you go for the possible,
Sinclair’s view is that you go for the performed’. But as I have pointed
out, Chomskyans appear to go further by proposing a theory to distinguish the
possible from the impossible; and when they invent ‘well-formed’ and
‘ill-formed’ sentences that accentuate this distinction, they in effect
contrast two sets of non-events (§ 10ff).
54.
For corpus research, the conditions of a performance possess far greater
relevance than the simple fact, and here too we can usefully distinguish
authentic from non-authentic data (§ 14). The inventing of data by a linguist
is a peculiar performance and so is less, rather than more, significant for
representing the ‘competence’ underlying authentic performances. As I have
remarked, the linguist paradoxically implies a claim to special competence,
despite the assertion of a ‘completely
homogeneous speech-community’
(§ 8).
55.
Within the set of authentic events, we should further distinguish between more
or less probable, and this we can attempt only by examining very large
sets of events — more precisely, interactions among co-occurring events
wherein some events make others more or less probable, and also more or less
natural, fluent, idiomatic, and so on, for intuitive retrospection. The collocations
and the colligations in performance offer the only rational
perspective on collocability and colligability in competence
by suggesting where to search for them.
56.
Sinclair (1994) has suggested a visual analogy to the list display returned when
we query a corpus for a given key word or phrase: reading horizontally, we
encounter performance; reading vertically, we encounter competence. I would add
that either dimension is a glimpse rather than a vision, or a snapshot rather
than a video — both the horizontal and the vertical extend further than our
vision could take in. Competence will routinely encompass more uses for the key
than we see; and performance will be saying more than the context in the
display.
D.6
‘Corpus linguistics adopts an exclusively third person perspective’.
57.
This mildly arcane argument ties in to the previous two. It too was advanced by
Widdowson (1991: 15), and in these terms:
The
description of internalised language requires a first person perspective. You
really have no choice if you are seeking to prise knowledge out from the
recesses of the mind: knowledge which is not realised as behavioural evidence
available to the observer […] Corpus linguistics […] adopts the third person
perspective and only describes what can be observed, [and so cannot] reveal
[…] ‘member categories’ […] of the speech community itself which account
for their intuitions about the language.
Widdowson
appears to conflate the ‘Persons’ in the grammar of English Verbs, which are
fairly distinct in their forms, with the roles of the participants in discourse,
which are not. A speaker or writer usually has no need to frame his or her own
views beside saying ‘I believe’ or ‘I assert’ and such like; and in
academic discourse, the use of the First Person Singular is indeed actively
discouraged by prescriptive teachers or editors, ostensibly to enhance
‘objectivity’. (I cannot agree that it does; in the present paper,
advocating the return to real data, I return myself to a real author.) In
literary discourse, in contrast, author and reader can emerge as ‘I’ and
‘you’ by convention, e.g.:
[44]
Gentle
reader, may you never feel what I then felt! May your eyes
never shed such stormy, scalding, heart-wrung tears as poured from mine.
(Jane Eyre)BAWC
The
frank literary conventions probably tell us more about participant roles than do
the staid academic ones. Speakers or writers so frequently use ourselves as
models for our hearers or readers that first and second person roles only
occasionally need to be made fully distinct, e.g., in writing letters [45]. Also
‘you, my reader’ is not clearly distinct from just ‘the reader’ in the
Third Person, e.g. [46].
[45]
Dear Niece: I am writing this in a hurry, as we
are going a week before we expected to. I
think you will find everything all right. (Lavender
and Old Lace)BAWC
[46] It would make the Reader pity me, or rather laugh at me, to tell how many awkward ways I took to raise this Paste (Robinson Crusoe)BAWC