for WORD
(incurably inconsistent font sizes courtesy of FrontPage)
ROBERT
de
BEAUGRANDE——————————————————————
Descriptive linguistics at the millennium:
Corpus data as authentic language
In the best sense of the word, descriptive linguistics must be practical,
[…] designed to handle instances of speech, spoken or written
—
J.R. Firth
1. Theory and practice in the concept of description
1.1. If we agree to use our terms quite broadly, we
can define a language to be a general theory of human knowledge and
experience, and discourse to be the set of practices for working out the
theory (cf. Sapir 1921; Hartmann 1963; Halliday 1994). Language would be a
theory — or a whole network of criss-crossing ‘theories’ — for
representing our world and ourselves and each other in the world, and for
constructing alternative states of the world or alternative worlds. We
understand each other insofar as our theories of our language are similar in
principle and get more finely tuned during discourse (Beaugrande 1997a).
1.2.
The relations between theory and practice would logically constitute a dialectic,
being an interactive cycle wherein two sides guide or control each other. When
the dialectic is working smoothly, the practice is theory-driven, and the theory
is practice-driven; the theory predicates and accounts for the practice; and the
practice specifies and implements the theory. The
real-life practices of discourse are strongly ‘theory-driven’ in obliging
the participants to ‘theorise’ about what words mean, what people intend,
what makes sense, and so on. Indeed, discourse is the most theoretical practice
humans can perform, and also the most efficient and effective in using the least
effort for the most goals. In return, language is the most practical theory
humans can devise, offering the resources to shape and guide almost any of our
practical activities.
1.3.
Yet the ‘theoreticalness’ of language is dexterously concealed from the
majority of speakers who practice it. If asked, they would probably describe
discourse as a thoroughly practical matter; they would be surprised if we told
them they possess a ‘theory of their language’ that gives them the status of
‘theoreticians’. No doubt the theory can be practised so efficiently because
many operations function below the level of conscious awareness; in return, the
nature and organisation of the theory are difficult to determine or describe by
means of introspection alone (but cf. 1.8ff; 3.36f; 4.4).
1.4. Moreover, a language is a unique type of theory. It
cannot be conclusively verified or falsified in the conventional manner of a
scientific theory, because we cannot adduce some language-independent testing
grounds, such as a set of free-standing meanings for which the language could be
judged a valid or invalid expression. Instead, language is a theory that
partially creates and constitutes what it postulates, and thus tends to confirm
itself. For practical purposes, we normally take things to be what our language
calls them. When we wish to express them more validly, we can practice our
language more elaborately; we cannot suspend its practices and go to meanings or
things without it. We cannot get outside language to inspect it.
1.5. By the definitions proposed above, a ‘theory of
language’ expounded in modern linguistics would more precisely be termed a meta-theory,
whereas the discourse we produce to expound the theory would manifest our own meta-practices.
“The constructs or schemata of linguistics” could thus be described as
“language turned back on itself” (Firth 1957 [1950]: 190). This convolution
renders linguistics unique among the sciences. We set about formulating an explicit
theory of language whilst we already sustain an implicit theory
as language; and our formulations are instances of practising the latter
theory. Moreover, every explicit theory proposed so far undoubtedly falls far
short of the richness and complexity of the implicit theory, though we may not
be able to demonstrate just how.
1.6. Modern linguistics might in turn be characterised
as a set of projects for rendering explicit the implicit ‘theoreticalness’
of language. Yet linguistics has been signally undecided about deriving its
theories dialectically from the description of the ordinary practices of text
and discourse. The most resolute position has been adopted in fieldwork
linguistics. Providing descriptions of previously undescribed languages is
by necessity practice-driven, since data in and about the language must come
from observing the practices of native speakers. In addition, the fieldworker
must subject every step in the theorising about the language to practical tests
with informants. Achieving a reasonable fluency in the language demonstrates a
practical competence that should plausibly enhance the authority of one’s
theoretical statements.
1.7. Still, fieldwork is theory-driven in its own ways.
The linguist holds a general conception about possible types of language, e.g.,
whether one is “analytic” like Ammanite of Vietnam, or “polysynthetic”
like Yana of California (Sapir 1921:142). The type is a high-level meta-theory
directing attention to certain classes of features or patterns, such as
“reduplication” to “indicate such concepts as distribution, plurality,
repetition, customary activity, increase in size” or “intensity” (Sapir
1921:76). But the fieldwork linguist is always stimulated upon discovering some
previously unknown feature or aspects, e.g., when Dyirbal of North Queensland
was found to have a separate Dyalŋuy variety or dialect used only in the hearing
of taboo relatives like a man’s mother-in-law or a woman’s father-in-law
(Dixon 1968). Such discoveries are also of interest to neighbouring disciplines
in the social sciences of sociology, anthropology, and ethnography (cf. 3.8;
3.40).
1.8. The opposite approach commonly goes by the name of
‘theoretical linguistics’ but might, for the present discussion, be more
aptly called homework linguistics.1 It is heavily
theory-driven, and presents invented data from well-described languages, notably
English, of which the linguists are fluent or native speakers from the start.
Instead of deriving the theory of a particular language dialectically by
describing its practices, ‘homeworkers’ derive a theory of language in
general by a theoretical bootstrapping that combines their own intuition and introspection
with conceptions sporadically borrowed from language philosophy, formal logic,
or mathematics (cf. 3.22). The standards of science are to be upheld by
‘theorising’ the more practical and ordinary qualities out of language. The
most scientific
statements should describe ‘language’ in the most abstract and general
sense, and ultimately in terms of ‘linguistic universals’ (cf. 1.16, 20).
1.9.
The decisive step in this outlook was to “give priority to
introspective evidence” and “intuition” (Chomsky 1965:20). The homework
linguist was now said to command an “enormous mass of
unquestionable data” merely by virtue of holding the “linguistic
intuition of the native speaker”; and precisely for these “data”, a
“description, and, where possible, an explanation” were to be “constructed”
(1965:20). The linguist would apparently become the representative of the “ideal
speaker-hearer in a completely homogeneous speech-community, who knows its
language perfectly” (Chomsky 1965:4) (1.13). Yet to discredit fieldwork with
informants, homework linguists felt impelled to deny that the “speaker of a
language”, who has “mastered and internalised a generative grammar, is aware
of the rules of the grammar or even” “can become aware of them”; and that
“his statements about his intuitive knowledge are necessarily accurate”,
since “a speaker’s reports and viewpoints about his behaviour and competence
may be in error” (1965:8). These denials should cast serious doubts upon
authorising linguists to act as model “speakers”, unless their academic
training and status grant them super-human powers of introspection (1.12; 3.36).
But then they would be patently untypical and unsuited as models of a
“completely homogeneous speech-community”.
1.10. Such perplexing lines of argument might help to
explain why homework linguists have so often used data from a well-described
language like English, besides just being native speakers. They could presuppose
extensive information about the language and did not have to supply it. They
could exploit their own intuition and introspection to swiftly elevate their
deliberations up beyond the laborious problems of fieldwork in order to address
purely theoretical rather than practical issues: theory becomes meta-theory, or,
in the terms proposed here, meta-meta-theory; and their discourse on language
manifests not just meta-language but meta-meta-language. So the discussion
naturally seeks illustrations in invented data whose status seems so secure as
to camouflage the role of the linguist as inventor, e.g.:
(1) The farmer kills the duckling (Sapir)
(2)
John ran away (Bloomfield)
(3) The man hit the ball (Chomsky)
Paradoxically,
such data were invented to seem incontestable, yet they can be empirically
classified as non-authentic insofar as they do not spontaneously occur in
ordinary discourse.2 Nonetheless, these same
data, accompanied by rather cursory descriptions, have often been adduced to
support general statements about the nature of language, e.g., that “word
order is unquestionably an abstract entity” (Saussure) or that “grammar
is autonomous and independent of meaning” (Chomsky). The essential
paradox thus consists of basing a general theory upon special cases by expressly
selecting data devoid of special features (cf. 4.2).
1.11. Moreover, non-authentic data represent an
unannounced compromise between “langue and parole”, or “competence and
performance”, which homework linguistics has separated by a radical dichotomy.
Saussure had roundly asserted that “speech cannot be studied”,
“for we cannot discover its unity”; it is only a “heterogeneous mass” of
“accessory and accidental facts” (1966 [1916]:9, 11) (cf. 1.21f; 3.13;
3.17). In the same vein, Chomsky (1965:4, 201) asserted that the “observed use
of language” “surely cannot constitute the subject-matter of linguistics, if
this is to be a serious discipline”; “from the standpoint of the theory”,
“much of the actual speech observed consists of fragments and deviant
expressions of a variety of sorts”. Such pronouncements suggest that
authentic data do not practice theory of a language, but seriously disrupt it.
The production of such data would resemble a catastrophic phase transition from
the extreme order of language over to the extreme disorder of discourse. The
speaker takes order, transforms it into disorder and transmits it to the hearer,
who transforms it back into order. Made explicit, this account of the relation
between language and discourse is obviously unsustainable.
1.12.
In parallel, homework linguists announced that “the concrete entities of language are not
directly accessible” (Saussure 1966 [1916]:110); and that “knowledge of the
language” is “neither presented for direct observation nor extractable from
data by inductive procedures of any known sort” (Chomsky 1965:18). These
claims too were meant to discredit fieldwork linguistics. But they also imply an
unsustainable
account of native-language learning, namely struggling against the grain of what
a child can “access and observe” — which is “fragmentary and
deviant” anyway. This implication presumably helped to garner
support for the universalist notion of an “innate language acquisition
device” (Beaugrande 1997b, 1998a).
1.13. Once “actual speech” has
been declared “heterogeneous” and “deviant”, the linguist can proceed to
invent non-authentic data which
have been quietly rendered homogeneous and purified of all deviance. Similarly,
if language is represented as an abstract, ideal system, then it is most
expediently exemplified by idealised data. By implication, homework linguists do not represent ordinary speakers
in real life, but rather “ideal” super-speakers who, thanks to their “perfect
knowledge”, can practice
the language with far greater unity and purity (cf. 1.9).
1.14. The perplexities implied for linguistic description
became most virulent in Hjelmslev’s “prolegomena to a theory of language”.3
Though acknowledging that
“the linguist who describes a language” “uses that language in the
description”, he issued a plea to “rise above the level of mere primitive
description to that of a systematic, exact, and generalizing science, in the
theory of which all events (possible combinations of elements) are foreseen”
(1969 [1943]:9, 121). The “theory” would be “applicable even to texts and
languages” that have “never been realised, and some of which will probably
never be realised” (1969:17). This startling project would be the linguists’
equivalent of a theory of everything, or the grand unification theory currently
much sought in physics. “The linguistic theoretician” proceeds to
“discover certain properties present in all those objects that people agree to
call languages, in order then to generalise those properties and establish them
by definition”; by doing so “he decrees to which objects his theory can and
cannot be applied” (1969:18). Such a “linguistic theory” “provides the
tools for describing” “a given text and language”, and “cannot be
verified — confirmed or invalidated — by reference to existing texts and
languages” (1969:18).
1.15. If these methods were literally adopted, the
linguist must examine all the world’s “languages” in the ordinary sense
(that “people agree” about) and construct the theory solely out of those
“properties” that have in fact been “discovered” everywhere. Then, it
would trivially, indeed automatically apply to all languages without requiring
any “decree”, “verification”, or “confirmation”. Yet the set of
properties would undoubtedly be far too small, abstract, and general to
“provide tools for describing a text” (4.5). One could only describe the
features that the text shares with every other text in every language, including
languages that don’t exist and never will — an esoteric exercise, to put it
mildly.
1.16.
When Saussure had earlier counselled “the linguist” to “acquaint himself
with the greatest possible number of languages in order to determine what is
universal in them”, he had surmised that “the diversity of idioms hides a profound
unity”, and that “all idioms embody certain fixed principles that the
linguist meets again and again” (1966 [1916]:23. 99). But he had conceded that
“it is very
difficult to command scientifically such different languages”; and wryly
concluded, with immense understatement, that “the ideal, theoretical form of a
science is not always the one imposed upon it by the exigencies of practice”
(1966:99). Not so Hjelmslev, who conjured the ideal whereby “mere primitive
description” would be replaced by “self-consistent and exhaustive
description” (1969:9, 18). To judge from his published work, he never tried to
present such a description of any text, and so did not confront its
impracticability as a method.
1.17. To include all non-existent, merely
“possible” languages, the set of languages to which Hjelmslev’s
“theory” could apply would be infinite; as a corollary, so too would be the
set of “texts” to be “described”. If so, the results of describing a
text or a set of texts would always seem too restricted to claim genuine
significance — just as homework linguists of the generative school would
predict anyway (1.20). Yet, again by implication, the processes of comprehending
a text would be infinite as well, which is blatantly false. Here, we see how far
the requirements placed upon description vastly overreach actual language, even
though, as I suggested, the theory falls far short (cf. ¶ 1.5). In parallel,
the “competence” and “perfect knowledge” of the “ideal
speaker-hearer” (1.9) vastly overreach the performance and knowledge of real
speakers. Both overreachings render homework linguistics empirically vacuous:
striving to describe everything at once and not describing anything.
1.18. I would argue this point just as emphatically
for the definition of “language” as an “infinite set of sentences” (e.g.
Chomsky 1957:13), presumably calculated to suggest that the description of data
was not merely impracticable, but incapable in principle of ever leading to a
theory of language (or to a “grammar”). Yet an “infinite set” would
contain every conceivable sentence, including the most flagrantly improbable
ones offered as counter-examples (like “colourless green ideas sleep
furiously”). The paradoxes of the infinite inhabit imaginative prose, such as
that of Jorge Luis Borges. In his infinite library:
For
every line of straightforward statement, there are leagues of senseless
cacophonies, verbal jumbles, and incoherences. […] Homer composed the Odyssey;
if we postulate an infinite period of time and infinite circumstances, the
impossible thing is not to compose the Odyssey (Borges 1964: 53, 114)
Moreover, “performance” would require infinite
search times. And it would be related to “competence” in purely accidental
ways, just as, in the familiar parable, a roomful of chimpanzees with
typewriters would, in infinite time, write the complete works of Shakespeare.
Such is the proper mathematical meaning of the “infinite”, and it cuts a
theory of language off from all practices.
1.19. We can accordingly dismiss the reservation
that descriptive linguistics is “inadequate” because “the corpus of
observed utterances” is “finite” (cf. Chomsky 1957:15; 1965:67). This
reservation holds for every set of observations and every set of data in every
science. Only the finite can be observed; and data are, both by definition and
by etymology, ‘the given’, and can never be other than finite.
1.20. The justified assessment
should be that a language is manifested in a very large but always
finite set of data; and that its system provides for indefinitely larger
sets, which will also be finite at any time. No such set can ever be
completely observed, but due to practical limitations rather than theoretical
principles. Like all scientists who work with such large data sets, linguists
must manage a trade-off between breadth (how much data a theory
can describe) and depth (what degrees of detail and precision the
description can achieve) (3.10ff). Now, if a language were an infinite set, then
its description would entail an infinite breadth that flattens out our depth to
an infinite shallowness, and our description (completed in infinite time, by the
way) would capture only infinitesimal details. In practice, homework linguistics
evaded its own “infinity” postulate by “assuming that the set of
grammatical sentences is somehow given in advance” (e.g. Chomsky 1957:18, 54,
85, 103). Breadth was merely hypothetical, bootstrapped into the theory by
invoking “language universals” “stated only in general linguistic theory
as part of the definition of the notion ‘human language’” (Chomsky 1965:6,
117), Breadth in the practical sense I suggest was left off the agenda, as when “gross coverage of data” was decried because it does not help a
linguist “learn anything about the principles” (Chomsky 1982:82f).
1.21. We can also dismiss the reservation that
“the corpus of observed utterances” is “accidental”. Every science must
confront the accidental in its data; the role of theory is not to leave real
data aside and invent some data that suits it better, but to stipulate how we
can distinguish between accidents and regularities (3.17). And the crucial
requirement for doing so is to collect and collate data sets as large as current
technologies allow. Of course, the state of technology is itself contingent upon
accidents, e.g., whether funds are allotted for super-colliders in physics or
for space telescopes in astronomy. But the capacity of technology to produce
data has usually been well ahead the capacity of theory to account for those
data — and nowhere more so than in linguistics today (3.2).
1.22. Moreover, science can enlist technologies
precisely for coping with accidents
in our data, most crucially at frontiers where our theories are still struggling
to distinguish the accidents from the regularities (3.17). The more significant
the potential for accidents, the greater the breadth we should seek, and the
more we should deploy those technologies that increase breadth without
materially decreasing depth. We may thereby push down the significance of any
particular accident (or set of accidents) by reassessing its probability.
Conversely, we may discover regularities when we can inspect a large set of data
where we saw accidents before (cf. 3.8).
2.
Recovering the dialectic
2.1. The issues raised in the foregoing section indicate
that mainstream linguistics has not managed to capture the dialectical cycle
displayed back in Fig. 1. In descriptive linguistics, the practices have usually
run well ahead of the theories. Numerous steps and strategies actually applied
in fieldwork research were entirely data-driven, and nowhere accounted for in
the sparse linguistic theories of the times. Even Pike’s (1967 [originals
1945-1964]) monumental programme to situate language within a “unified theory of the structure of human behavior”
was fenced within the confines of behaviorism
and ‘unified science’, which hindered him from expounding a unified theory
of meaning (Beaugrande 1991:107-11). More recently, some significant and
original phenomena discovered and described in fieldwork, as in Longacre’s
(1970, 1990) work on “spoken paragraphs” and “storylines”, or in
Grimes’ (1975) work on the “thread of discourse”, were nowhere accredited
in linguistic theory nor mentioned in conventional linguistics textbooks. Either
new terms were coined, such as “staging” and “collateral”; or else
accredited terms were assigned unconventional meanings, as for “predicate”
and “transformation”.
2.2. In generative linguistics, in sharp contrast, the
theories have run far ahead of the practices — so far indeed that practices
seem to have been left behind altogether (Beaugrande 1998). Descriptive
linguistics was sternly rebuked for not being theoretical enough, and, more
specifically, for trying to construct theory out of practice, namely through the
observation and analysis of data (Chomsky 1957). In respect to fieldwork, the
rebuke was patently unfair: no other method can succeed when the linguist has no
prior or outside information about the organisation of a language. What emerges
is of course a theory about that one particular language, not about the
“universal nature” of all languages. But within its modest scope, the theory
has been vigorously tested by data (1.6), and can be retested whenever the data
undergo a substantive increase.
2.3. In generative linguistics, the construction of
theory became independent of the observation and analysis of data; on the
contrary, these methods were expressly declared incapable of producing a theory
(1.11) They could be bypassed precisely because the linguist as native speaker
had so much prior or outside information about the language (1.10). But where
then should the theory come from? In the event, it mostly came, impressively
recast into more technical terminologies, from traditional grammar-books about
that same native language. Thus, the “universality” of “phrase
markers” was asserted, yet the accompanying diagrams displayed some obviously
English-flavoured grammar-book categories like “definite” and “Article”
(e.g. Chomsky 1965:107ff). Long before, Bloomfield (1933:233, 270) had warned against
“linguists taking for granted the universal nature” of the “categories”
of their own “native language”. Now, real prospects arose of “forcing all
languages into the mould of English, just as in earlier periods they were forced
into that of classical Latin” (Hall 1968:53) (1.10). The relatively rigid
word-order of English engendered the theory of “autonomous syntax”. The
absence of a systematic morphology in English led to morphology being left
homeless in generative theory. And so on.
2.4. The dialectical nature of language and discourse
was now thoroughly obscured. Language was not regarded as a theory which
discourse puts into practice, but as a theory about a theory (a meta-theory
about itself) which is independent of practice and indeed disrupted by practice.
Paradoxically, these linguists discredited the data produced by ordinary native
speakers as “fragmentary and deviant”, yet accredited the data invented by
themselves on the grounds of their own competence as native speakers (cf. ¶
1.9ff). The data were invented precisely out of the theory — just the reverse
of descriptive linguistics. Here, Hjelmslev’s vision seems to come alive: a
“linguistic theory” that “cannot be confirmed or invalidated” (1.15).
2.5. Perhaps the most far-reaching implication of this
approach is that the very term “language” no longer refers to what most
people, including most scientists, consider a language. Instead, it refers to a
construct of linguistic theory so strenuously idealised it might ironically
qualify as a Hjelmslevian “language that has never been realised and will
probably never be realised” (1.14). How and why such a construct should
promote the description of the languages that are being realised all around the
world has never been convincingly expounded. Indeed, we could predict some
compelling obstacles against description.
2.6. One obstacle lies in the terminology. The purely
virtual status of “language” as a non-realised system at the centre of
linguistics spreads out into the more specific terms. Such seems to have
occurred with “syntax” as a formal system of rules which determine the
word-order of all “grammatical sentences” in a language. Because real
speakers put words in order for many motives quite unrelated to formal rules,
this “syntax” does not exist in real language (Beaugrande 2000a). Still less
does a “semantics” exist which assumes a fully stable, deterministic meaning
for each expression of a language, whether based upon “meaning postulates”
or “semantic features” (Beaugrande 1984). The virtual, non-existent status
of these two “levels” or “components” of language makes them unsuitable
in principle for the description of authentic data, whence the unquestioned
substitution of non-authentic invented data (cf. 1.10ff).
2.7. The second obstacle is the peculiar meaning
allotted to the term “description”. When the operation of “assigning a
structural description to a sentence” was equated with “generating the
sentence” (Chomsky 1965:9), the formal analysis of data was equated with the
original production of data, despite Chomsky’s denials of doing so. Yet the
categories of that same analysis are utterly insufficient for production, e.g.,
in taking no account of meaning during the “generative” stage. In effect,
this “description” strips the sentence of most of its operational features
and leaves a mere trace — not even a blueprint for the design, let alone a
record of the implementation of the design.
2.8. Evidently, replacing “language” with a virtual
construct leads to replacing “description” with a virtual operation. Here
too is a motive for preferring non-authentic data: they are most amenable to
just such an operation. A “transformational grammar” needs only the
descriptive categories for converting the sentence into another more essential
and general structure (“kernel”, “deep structure” etc.). This operation
does not even describe given the sentence itself, but analyses it away, and
presents once again a structure which requires no confirmation because the
theory had introduced it in the status of an axiom. So the description is
effectively circular in the manner of a foregone conclusion.
2.9. If linguistics is to reinstate language as an
empirical object of study, we must reassert its descriptive heritage and recover
the dialectical interaction between language as theory and discourse as
practice. These two sides must be seen to constitute a dynamic cycle between two
distinct but closely co-ordinated modes of order. The order of language must be
practice-driven and expressly designed to support the theory-driven order of
discourse without fully predetermining it. So far, a large grey area persists
between these two orders, comprising a host of constraints that are more
specific or local than a language yet more general or global than a discourse
(Beaugrande 2000b) (cf. 4.2).
3. The impact of very large corpora
3.1. For practical reasons, much corpus research based
upon fieldwork in the past has had to be content with relatively small amounts
of data. I can discover in that work no theories stipulating just how large a
corpus ought to be; nor would such a theory be particularly relevant or
interesting as long as the fieldworker may have to confront bizarre, fortuitous
circumstances to get data. In his fieldwork on Cantonese in the 1940s, Halliday
made speech recordings on cumbersome wire spools, and the breaking of wires
would frequently damage or destroy his data.4 Improved technologies
have reduced such mechanical dangers, but not the labours of transcribing and
interpreting the data. Voice recognition by computer, now finally achieved, will
help us only for transcribing data in those languages that have already been
described extensively enough to configure the program; and transcribing data is
just a partial step in analysing or interpreting it.
3.2. Today, corpus research has access to very
large corpora of authentic data for a number of languages, and may confidently
foresee many more in the near future. We now face a daunting decision about whether some established theory
and practice of linguistic description will be reapplied to corpus studies; or
whether the foundations of linguistics will be revised in light of corpus
studies (Tognini Bonelli 1996; Sinclair 1999). As we know from the work on
“scientific revolutions” in the philosophy of science since Kuhn (1970), a
theory is not displaced by data alone but only by another theory which handles
more data and extracts new and important insights from data. My own experiences
in corpus research lead me to predict that linguistics should brace itself for a
major scientific revolution or paradigm shift similar to those ensuing upon the
introduction of such technologies as the telescope in astronomy or the
microscope in biology (Sinclair 1994, 1999). As with other technologies, this one wields
the capacity to produce data far ahead of the capacity of our theories to
account for those data (1.21). To
extend the analogy: we are ‘seeing’ phenomena in language which only become
visible through the technology.
3.3.
However, the technology also renders visible some far-reaching problems. These problems do not, as has sometimes been argued (e.g.
Widdowson 1991), arise from weaknesses inherent in corpora. Rather, the problems
have been inherent in language research all along but
would hardly be addressed when data were either restricted by the practices of
fieldwork linguistics or else marginalised by the theories of homework
linguistics. Now, corpus research
confronts us with principled questions like these:
What
size should a corpus have in order to represent a language?
What
is the ratio between quantity and quality of data?
What
is the ratio between breadth and depth of description?
What
is the ratio between the uniformity and diversity of data?
What is the ratio between regularities and accidents in data?
What is the ratio between grammar and lexicon in a language?
What
is the ratio between manifest and underlying organisation of language?
These questions are so intricately related to
each other that discussing any one of them by itself is an uneasy task. Even so,
corpus research should eventually lead us toward some worthwhile answers through
the aid of technology itself (cf. 4.6).
3.4. So our first question concerns the representative size
of a corpus. The notion of an entire language having a quantifiable size
at all hardly seems to figure in modern linguistics.5 It
would of course be moot if language is defined as an “infinite set of
sentences”; but I have tried to show why this definition is invalid (1.18).
3.5.
Once a language is defined as a finite though very large
set of data, and also a system providing for indefinitely larger sets (1.20),
then our question concerns the ratio between the actual size of a corpus and its
potential size. Actual size has been mainly dominated by practical factors. In
early corpus research on computers, when the technology of memory and
programming were rather limited, a million words seemed an ambitious size. When
the technology advanced, practical motives were again dominant in bumping up the
size to 20 million and then to 200 million in the Collins Birmingham University
International Database (COBUILD) — familiarly called the “Bank of English”
(BoE) — namely, for compiling a new type of data-driven dictionary that soon
became the market standard. Then the corpora themselves were offered on
commercial markets, such as COBUILD on CD-ROM (5 million) and the British
National Corpus (BNC) from Oxford University Press (100 million).
3.6. This dominance of the
practical side was to be expected. Lexicography has traditionally been a
practical enterprise; and theoretical linguistics has focussed far more upon
grammar than on the lexicon (cf. 3.11; 3.23). Even so, practical advances are
still needed for more friendly technology at the users’ end. Direct access to
a corpus via the Internet is subject to multiple disturbances, such as lines
being overloaded, busy, or periodically cut off in mid-operation. A corpus on a
single CD-ROM (like the COBUILD’s) can only hold a modest data set and do
simple searches and calculations. For larger sizes and more complex searches
like the BNC, users work with several CDs on ponderous operating systems like
UNIX or LINUX, and require technical training in mastering systems like the
“Corpus Data Interchange Format” based on “Standard Generalised Markup
Language” (Aston and Burnard 1998).
3.7. But viewed from inside
linguistics, theory is the side where advances are pressingly called for now.
There, size leads us to the further question of the ratio
between quantity and quality of the data. The null hypothesis
would be that beyond some threshold (say, a million words), increases
in size just multiply out in a mechanical proportionality: an item or pattern
appearing once at 1 million words will appear 20 times at 20 million words and
200 times at 200 million words. But this hypothesis could hold only if a
language were so uniform a system that its output hits a definite information
ceiling and its features go asymptotic. Beyond that, quantity would rise whilst
quality remained constant.
3.8. Corpus research, on the contrary, suggests a dialectical
ratio whereby a major rise in quantity brings a rise in quality; so the
language system must be far more diverse than the null hypothesis stipulates.
New
data can reveal previously undetected constraints upon an
apparently unconstrained regularity. For example, most grammars of English,
including the COBUILD Grammar based on a 20-million-word corpus, present
the pattern of Definite Article plus Adjective for referring to a whole class of
people, and declare it “possible to use almost any Adjective this way”
(Sinclair et al. 1990:21f). But Sinclair (1998:86) recently reported
“attitudinal biases and selectional restrictions” in the corpus at 336
million words: the pattern is mainly reserved for “unfortunate” people, such
as the elderly, the injured, the unemployed, the sick,
the aged, the poor, and the handicapped, as in (4).
Fortunate people occurred mainly by way of contrast with the unfortunate, as in
(5-6).
(4) On services to the mentally ill, the
elderly and the handicapped, Mr Cook pledged that Labour would
appoint a minister for community care. (newspaper)
(5) This is a system in which the rich are
cared for and the poor are left to suffer in silence. (newspaper)
(6)
the appeal, especially in Latin countries, is rather to envy the fortunate
than to pity the unfortunate. (Bertrand Russell)
This “attitudinal bias” might be explained from the effect of
depersonalising by omitting a Noun for the Adjective to modify. Such
explanations may not be foreseen or admissible in established linguistic
theories, but could be helpful for fieldwork research as well as ethnography
(1.7), and also in the teaching of English (4.4)
3.9. New insights are both
reassuring and disturbing. Just because linguists are stimulated
when new regularities are discovered (1.7), we are troubled by the prospect of
stopping the advance of theory by freezing the size of a corpus for practical or
technological motives. This fate may befall when a dictionary or reference work
arrives on the market, and the funding agent terminates support. Linguistics
should therefore provide the public in general and user groups in particular
with enough theoretical and practical knowledge to appreciate the dialectical
ratio between quantity and quality. Only then will commercial markets be
impelled to build larger corpora as grounds to claim better products.
3.10. Our next and closely
related question concerns the ratio between breadth and depth of
description (1.20). Whereas fieldwork research managed a balance by sheer
practical diligence in describing authentic recorded data, homework research
sought to appropriate “infinite” breadth and “universal” depth by sheer
theoretical bootstrapping with handfuls of non-authentic invented data (cf.
1.7f; 1.20). So whereas breadth and depth were slowly achieved by fieldworkers
through an arduous progress of small steps, they were swiftly built right into
the theory of “language” by homeworkers.
3.11. Today, the very large
corpus makes unprecedented breadth accessible but not necessarily achievable.
The computer resembles a long ladder on which we are still learning the skills
for scaling the higher levels in language description. Here too, much depends on
how uniform or diverse a language system might be. For a highly uniform system,
a description would have favourable chances to be both complete (total breadth)
and precise (total depth). The closest approximation in actual language research
is in the companion sciences of phonology and phonetics, sharing theory and
practice in impressive accord. But their uniformity is a straightforward
projection from the human vocal apparatus and the phonetic alphabet. In grammar,
uniformity was brightly postulated in theory but never demonstrated in practice.
And in the lexicon, the undeniable diversity has kept many linguists from
undertaking research at all (cf. 3.23)
3.12. Breadth becomes a
virulent issue when we get access to vast quantities of data. Depth becomes
virulent when we must choose among sources for those data. Most descriptions
produced in modern linguistics have been aimed at an entire language, e.g., at
the “set of grammatical sentences somehow given in advance” (1.20). Data
sources were not acknowledged to constitute a problematic factor, least of all
when the data were invented by the linguists. The same depth of description
would be appropriate everywhere, as would the methods for achieving it. In
corpus research, this optimism soon breaks down. A language itself is by no
means uniformly deep; the Number of Nouns is less deep than Definiteness: Polar
Auxiliaries are less deep than Modal Auxiliaries. Reaching one depth is likely
to open a view of still further depths, as when an analysis of the Agency of
Verbs leads to the discovery of constraints on Pronouns as Subjects or Objects
(cf. 3.32ff; 3.44). And the breadth of a deep description, once achieved, may be
hard to determine, e.g., how many Verbs share constraints on their Agency
(3.34).
3.13. By now we are in the midst of probing the ratio
between uniformity and diversity in a language. Here too, linguistic theory has often
inclined to a sharp dualism. Total uniformity was attributed to language,
witness Chomsky’s “completely homogeneous speech-community”
(1.9); yet total
diversity was attributed to discourse, witness Saussure
“heterogeneous mass of accidental facts” (1.11). And theory nowhere
explained how so extreme a dualism of order
and disorder could inhabit
the same system (1.11).
3.14. No doubt the heavy emphasis upon uniformity was intended to accommodate
commonplace notions of science, but failed to recognise the uniqueness of
language as an object of scientific investigation. There, uniformity and
diversity constitute a dynamic dialectic, parallel though not identical
to the dialectic between language and discourse. Every aspect of uniformity in a
language must be designed to sustain diversity (cf. 3.41). In phonology, the
uniformity of phonemes as shared targets underwrites enormous diversity among
acts of pronunciation due to such factors as the age, gender, and emotional
state of speakers, and their regional or educational background. In grammar, the
functions of uniformity are different in modality due to their more complex and
multimodal needs for expressing multiple modes of meaning. And the lexicon of
English — in contrast to many languages — affords fairly modest and sporadic
uniformity, due to its historical and cultural overlayering of extrinsic or
specialised approaches to word-composition, e.g., borrowing roots from Latin and
Greek.
3.15.
Corpus research is now beginning to reveal the significance of the dialectic
between uniformity and diversity. Language is found to be less uniform, and
discourse less diverse, than linguistic theory is wont to assume. The uniformity
of language is designed to generate diversity on-line; and the diversity of
discourse continually refers back to and renews the uniformity of language (cf.
3.41).
3.16. In terms of corpus practice, uniformity may actually
be a drawback. If we are compiling what Sinclair (1999) calls a ‘generic or
reference corpus’ to cover the English language as broadly as possible, then
we must consider how far the newly arriving data appear uniform or diverse
alongside our already acquired data. The
information value of a corpus would not rise significantly from increasing the
store of uniform data of the same type. This problem applies especially to mass
media, such as the plentiful newspapers conveniently posted on the Internet or
made available by direct electronic transmission, like the Sunday Times.
There, the diversity of the data is restricted in being authored by a relatively
small, well-trained group of writers, and being edited by an even smaller group.
I would also point out the massive ballooning of frequencies like I found in the
BoE in July 1994 of key-words such as violence
(19,226), kill (51,746), death
(31,013),
murder (18,383), rape
(5,890), and assault
(4,055),6 reflecting the
morbid, voyeuristic interests of mass media more than the frequencies of
authentic English at large.
3.17. Similar factors bear
upon the ratio between regularities and accidents. Once again, linguistic
theory has been largely dualistic: language constituted by regularities insofar
as it can be an object of science; and discourse littered with accidents and
therefore no fit object of science. Before very large corpora became available,
projects for actually demonstrating regularities by means of statistic
frequencies and probability measures were rare and laborious (e.g. Kučera and Francis 1967). Linguists gave reassurances
that “a linguistic observer can describe the speech habits of the community
without resorting to statistics” because “the forms of language” are “rigidly standardized” (Bloomfield 1933:37); or that when
neither “sentences nor any part of them have
ever occurred in any English discourse” or in “the linguistic experience of a speaker”, they are “statistically” all “equally remote” (Chomsky 1957:17). These two reassurances flatly contradicted each other — data
being all highly probable or highly improbable. But neither could be tested
without powerful technology for measuring the ratio between regular and
accidental (1.22).
3.18. The potential roles for
statistics and probabilities are surely due for reassessment now that we have
very large corpora (Halliday 1991, 1992). Finding and counting manifest items
is most tractable, yet least informative. The frequencies of items in a corpus
may give no reliable indication of their functional load in the language system.
Finding exactly 6000 occurrences for of
in the 5-million-word COBUILD Corpus on CD-ROM is not helpful; we need to know
the proportions for each of its multiple functions in combinations. And
combinations too are subject to the ballooning effects I noted a moment ago in
news media. Among the 20,569 occurrences of sex
returned by the BoE in July 1994, I found Sex
Pistols (at 707), sex appeal (at 762) oral sex (at 203), and sex
discrimination (at 209). Such frequencies are not meaningful unless we
can determine how far the occurrences entail the ‘same’ item at all.
3.19. The frequency of manifest
combinations is thus less tractable, but more informative. Corpus
research has devoted much exploration to the typical lexical
combinations called collocations, and the typical grammatical
combinations called colligations.7
Yet typicality is not readily explained in terms of frequency alone. In my
combined 12-million-word corpora of British and American writers, which I shall
be citing further on, among a total of 339 occurrences of the Verb fled
were only 3 of the collocation fled the country. To my
intuition, this combination seems typical even if its frequency and statistical
probability are negligible. It also occurred just once among 99 uses of fled in
the COBUILD on CD-ROM:
(7)
after the collapse of Tsarist authority, opportunists declared an independent
democracy, then a military junta that fled the country. (book)
But I can draw
some confirmation where the Verb fled took country names as Direct
Objects: France,
Iraq, Kuwait, Croatia, Germany.
3.20. We now come to a truly
daunting question: the ratio between manifest and underlying organisation of
language. Modern linguistics has been postulating an “underlying”
organisation of language all along (e.g. Saussure 1966[1916]:56;
Sapir 1921:144; Bloomfield 1933:225f; Hjelmslev 1969 [1943]:9f;
Chomsky 1965:4f, 10, 18, 22). Among the grandest prospects was that the
“descriptive grammars of diverse languages” will “some day” enable us to
“read from them the great underlying ground plans” (Sapir 1921:144).
Presumably, such “plans” are the goal of work on “linguistic
universals”, but most of that work lacks a secured base in descriptive
grammars.
3.21. Moreover, linguistics
has remained disturbingly evasive about how we can derive the “underlying”
organisation from the manifest organisation. Thus, Chomsky's
provision that “actual data of linguistic performance” would provide
“evidence for determining the correctness of hypotheses about underlying
structure” conflicted with his insistence that “surface structure” is
“unrevealing” and “irrelevant” and “hides underlying distinctions”
(1965:18, 24). With surprising candour, he conceded that his proposed “grammar
does not, in itself, provide any sensible procedure for finding a deep structure
of a given sentence”; and he evaded the whole issue by operating on the
“simplifying and contrary to fact assumption that the underlying basic string is
the sentence” (1965:141, 18).
3.22. Such
evasions readily follow from the already noted tendencies to attribute to
language highly idealised modes of order and to transpose the concept of
language from the particular instance over to a universal abstraction (cf. 1.8,
13, 16, 20; 2.5). Doing so naturally fosters a readiness to see disorder in
manifest data, and hence a reluctance to exploit them in the search for
underlying order (cf. 1.11; 3.12). Instead, artificial modes of order get
borrowed from sources like formal logic or mathematics, which only intensifies
the idealised and abstract nature of “language” (1.8).
3.23. Here, we can highlight
the ratio between grammar and lexicon. Linguistic theory
has long regarded “grammar” as the epicentre of uniformity and regularity
for an entire language and as a home for linguistic universals (compare Saussure
1966 [1916]:133, 152; Sapir 1921:38; Bloomfield 1933:163; Chomsky 1957:56).
In exchange, linguists have long concurred that the lexicon is a mere “list of
basic irregularities” (Bloomfield 1933:274; cf. Sweet 1913:31; Saussure 1966
[1916]:133;
Chomsky 1965:86f, 142, 214, 216). On a smaller scale, this dichotomy re-enacts
the dichotomy between the order of language and the disorder of discourse
(1.11), and again linguistics has chosen order: much work on grammar, little on
lexicon (3.6). Eventually, a homework linguist can baldly announce that
“linguistics is not about language; it is about grammar” (Smith 1984).
3.24. Here
again, linguistic theory should replace the dichotomy with a dialectical
relation, this one co-ordinating grammar and lexicon and constituting the interactive
lexicogrammar,
the “semogenic powerhouse of language” ()
The two sides differ not in kind, but in degrees of delicacy: lower toward the grammatical side and higher
toward the lexical side. Perhaps the lexicon could be regarded for some purposes
as “most delicate grammar” (Halliday 1961:256; Hasan 1987:184; Cross
1993:199).8
3.25. The interactions of grammar and lexicon are
readily evident from corpus research on colligations
and collocations
in the sense of 3.19. Since these are
defined as typical combinations,
they continually draw our attention toward plausible motives of speakers or
writers for coordinating multiple selections. For example, the English Verb “brook” meaning
“accept, tolerate” usually requires a Negative element (Sinclair 1994), as
in:
(8)
Johnson could not brook appearing to be worsted in argument (Life)
(9)
Bouille rides, with thoughts that do not brook speech. (French)
(10) his work was of a sort
that would brook no negligence (Lady)
This
Verb is infrequently used, and preferentially in solemn language about some weighty
business, as in Shakespearean drama:
(11) This weighty business will not brook delay (Henry VI)
(12) My business cannot brook this dalliance. (Comedy of Errors)
(13) False king, why hast thou broken faith with me,
Knowing
how hardly I can brook abuse? (Henry VI)
This
second constraint is more delicate than the one requiring a Negative, yet more
difficult to define in terms of manifest lexical choices. The weighty
business might be the assassination of a Duke (11), or just the collection of a
debt (12). The weightiness comes in part simply from using brook rather
than, say allow or tolerate.
3.26. Such data from the lexicogrammar of English
point us toward the immense task of accounting for multiple parameters of
variation in a language: genre, register, and style. In
terms of theory, these constitute intermediary control systems between
the language and the discourse (Beaugrande 1997eh?). Their design must be such
that when one of them is activated, the activation level is raised for
appropriate options and lowered for inappropriate ones (Kintsch 1988; Rumelhart
et al. 1986). In terms of practice, they obviously affect the selections and
combinations we can expect to find in authentic discourse data; but how to
describe those effects is far from clear at this stage.
3.27. Here, we might pursue a strategy of dialectical
resolution: building sub-corpora where
we predict systematic distinctions in quality; and then using our findings to
test and refine our predictions and to assess the typicality of specified data
inventories as indicators of some genre or style (cf. 4.5f). For a brief
demonstration, I shall draw upon three distinctive sources: (a) two corpora of
literature, one by British authors (e.g., Austin, Dickens, Wilde) and one by American authors (e.g., Hawthorne,
Mark Twain, Willa Cather), dating
roughly between 1750 and 1920 and together totalling 8.7 million words; (b)
two corpora of academic and civic writers, again including British (e.g., Darwin,
Bulwer-Lytton, J.S. Mill) and
Americans (e.g., Thomas Jefferson, Jane Addams, W.E.B. DuBois), together totalling 4.8 million words; and (c)
Collins COBUILD on CD-ROM (5 million words), which represent contemporary
everyday usage. The first two sets of corpora, totalling all together 13.5
million words (see Appendix for list of current texts), I compiled myself to run
on WordPilot©, a resource program
developed by John Milton at the Hong Kong University of Science and Technology
(Milton 1999). My compiling too faced fortuitous practical restrictions: I had
to use texts which are in public domain and can be downloaded from Internet
sites.
3.28.
In sources (a) and (b), the pattern of Definite Article plus Adjective was found
to be more balanced than in the COBUILD data reported in 3.8. The highest
frequency appeared among academic and civic writers, who are logically prone to
classify people. Alongside the contrasts like those noted by Sinclair, e.g.
(14-15), I found many where the fortunate people occurred alone, although
sometimes with the intriguing ironic twist of not being secure in their good
fortune (16-17).
(14) Smile with the simple and feed with the poor?
[…]; let me smile with the wise, and feed with the rich
(Boswell, quoting Samuel Johnson)
(15) None know the unfortunate, and the fortunate
do not know themselves (Poor Richard)
(16) There is always some levelling circumstance that puts down the
overbearing, the strong, the rich, the fortunate,
substantially on the same ground with all others (Emerson)
(17)
the educated see a menace in his [the black man’s upward
development (W.E.B. DuBois)
If grammar-books describe the pattern as being more general than is
confirmed by contemporary usage in the COBUILD, then perhaps by intuitively
taking academic discourse to be a model of English usage at large.
3.29. On the face of it, dialectical resolution
might look circular: using the type to identify the features of interest, whilst
using those features to identify the type. But text types cannot in theory be
defined through rigorous proof, since in practice most types are defined through
intuitive heuristics by language users. Besides, types are frequently mixed, as
in:
(18) A wedding is a time for merriment and an apt occasion to showcase
age-old traditions in an age where modernity is eroding important aspects of
yesteryear. This much-privy glimpse of Arabia was a re-enacted wedding ceremony
of the indigenous people, reflecting the timeless beauty and simplicity of
Arabia's life-styles, customs and unique identity until the ’70s oil-boom
brought in dramatic socio-economic development. (Khaleej
Times)
Such discourse briskly mixes the styles of solemnity (merriment,
yesteryear),
social science (modernity,
indigenous, identity, socio-economic development),
and tourism (age-old, timeless beauty and simplicity, life-styles),
along with the occasional solecism (much-privy glimpse). The mix reflects
multiple goals, such as disguising a tourist trap as a cultural site whilst
flattering the readers’ command of an educated variety of English here in Gulf
States.
3.30. Another
strategy might be for us to create local regions of substantial depth by
describing narrow data sets with some thoroughness. The resulting insights might
then be projected across broader sets and guide our selection of aspects and
features to investigate. For example, the COBUILD data at 20 million words
showed a Verb like elude being used only in the Active (cf. Sinclair et
al. 1990:407), e.g.:
(19)
Newer techniques, such as bone-scanning and ultrasound, have enabled us to find
more of the causes of back-pain, but a large number still elude us
(magazine)
(20)
Sylvie Guillem as Nikiya gave us her faultless technique and musicality,
although the spirituality of the role so far eludes
her (newspaper)
In
my literary and academic corpora I found ‘elude’ in the Passive just six
times, as in:
(21)
My importunities would not now be eluded (Wieland)
(22)
they lessen the consumption; the collection is eluded; and the product to
the treasury is not so great (FedPap)
The
meaning for data like (19-20) is roughly: some knowledge or skill would be
fitting but is not found. The meaning for data like (21-22) is more like: some
people finding ways of avoiding something. The Passive does seem to me
intuitively old-fashioned; and Passive versions of these Actives seem utterly
improbable:
(19a)
? ?we are eluded by a large number of the causes of back-pain
(20a)
? ?Sylvie Guillem is eluded so far by
the spirituality of the role
3.31. Now, to increase the depth of our analysis of elude,
we can examine some typical collocations
and colligations. Among the Nouns as Direct Objects, the collocations
noticeably clustered around vigilance, which
occurred 9 uses, e.g. (23), along with associates like observation
(24), eyes (25), and glance (25).
(23)
Nelson feared the more that this Frenchman might get out and elude his vigilance
(Nelson)
(24)
I had not neglected precautions to secure my personal safety, if I could only elude
observation. (Eyre)
(25)
That I could elude Rima’s keener eyes I doubted (Mansions)
(26) Hare’s fateful glance, impossible to elude
(Desert)
Other
typical collocates included grasp (6 uses), e.g. (27), and pursuit
(4 uses), e.g. (28).
(27) the maiden eluded the grasp of the savage (Last)
(28) I stopped at one or two stands of coaches
to elude pursuit (Wrongs)
The meanings of all these collocations involve two opposing agencies,
one of them seeking to elude the other and the potential consequences.
3.32. Among the colligations, the most striking one by far
was a marked preference for Personal Pronouns as Direct Objects. Of the 17
occurrences in COBUILD data, 13 showed this colligation, as in (19-20). Other
examples included:
(29) he defines his essential position, as a man in permanent search of a God who eludes him