for WORD
(incurably inconsistent font sizes courtesy of FrontPage)
ROBERT
de
BEAUGRANDE——————————————————————
Descriptive linguistics at the millennium:
Corpus data as authentic language
In the best sense of the word, descriptive linguistics must be practical,
[…] designed to handle instances of speech, spoken or written
—
J.R. Firth
1. Theory and practice in the concept of description
1.1. If we agree to use our terms quite broadly, we
can define a language to be a general theory of human knowledge and
experience, and discourse to be the set of practices for working out the
theory (cf. Sapir 1921; Hartmann 1963; Halliday 1994). Language would be a
theory — or a whole network of criss-crossing ‘theories’ — for
representing our world and ourselves and each other in the world, and for
constructing alternative states of the world or alternative worlds. We
understand each other insofar as our theories of our language are similar in
principle and get more finely tuned during discourse (Beaugrande 1997a).
1.2.
The relations between theory and practice would logically constitute a dialectic,
being an interactive cycle wherein two sides guide or control each other. When
the dialectic is working smoothly, the practice is theory-driven, and the theory
is practice-driven; the theory predicates and accounts for the practice; and the
practice specifies and implements the theory. The
real-life practices of discourse are strongly ‘theory-driven’ in obliging
the participants to ‘theorise’ about what words mean, what people intend,
what makes sense, and so on. Indeed, discourse is the most theoretical practice
humans can perform, and also the most efficient and effective in using the least
effort for the most goals. In return, language is the most practical theory
humans can devise, offering the resources to shape and guide almost any of our
practical activities.
1.3.
Yet the ‘theoreticalness’ of language is dexterously concealed from the
majority of speakers who practice it. If asked, they would probably describe
discourse as a thoroughly practical matter; they would be surprised if we told
them they possess a ‘theory of their language’ that gives them the status of
‘theoreticians’. No doubt the theory can be practised so efficiently because
many operations function below the level of conscious awareness; in return, the
nature and organisation of the theory are difficult to determine or describe by
means of introspection alone (but cf. 1.8ff; 3.36f; 4.4).
1.4. Moreover, a language is a unique type of theory. It
cannot be conclusively verified or falsified in the conventional manner of a
scientific theory, because we cannot adduce some language-independent testing
grounds, such as a set of free-standing meanings for which the language could be
judged a valid or invalid expression. Instead, language is a theory that
partially creates and constitutes what it postulates, and thus tends to confirm
itself. For practical purposes, we normally take things to be what our language
calls them. When we wish to express them more validly, we can practice our
language more elaborately; we cannot suspend its practices and go to meanings or
things without it. We cannot get outside language to inspect it.
1.5. By the definitions proposed above, a ‘theory of
language’ expounded in modern linguistics would more precisely be termed a meta-theory,
whereas the discourse we produce to expound the theory would manifest our own meta-practices.
“The constructs or schemata of linguistics” could thus be described as
“language turned back on itself” (Firth 1957 [1950]: 190). This convolution
renders linguistics unique among the sciences. We set about formulating an explicit
theory of language whilst we already sustain an implicit theory
as language; and our formulations are instances of practising the latter
theory. Moreover, every explicit theory proposed so far undoubtedly falls far
short of the richness and complexity of the implicit theory, though we may not
be able to demonstrate just how.
1.6. Modern linguistics might in turn be characterised
as a set of projects for rendering explicit the implicit ‘theoreticalness’
of language. Yet linguistics has been signally undecided about deriving its
theories dialectically from the description of the ordinary practices of text
and discourse. The most resolute position has been adopted in fieldwork
linguistics. Providing descriptions of previously undescribed languages is
by necessity practice-driven, since data in and about the language must come
from observing the practices of native speakers. In addition, the fieldworker
must subject every step in the theorising about the language to practical tests
with informants. Achieving a reasonable fluency in the language demonstrates a
practical competence that should plausibly enhance the authority of one’s
theoretical statements.
1.7. Still, fieldwork is theory-driven in its own ways.
The linguist holds a general conception about possible types of language, e.g.,
whether one is “analytic” like Ammanite of Vietnam, or “polysynthetic”
like Yana of California (Sapir 1921:142). The type is a high-level meta-theory
directing attention to certain classes of features or patterns, such as
“reduplication” to “indicate such concepts as distribution, plurality,
repetition, customary activity, increase in size” or “intensity” (Sapir
1921:76). But the fieldwork linguist is always stimulated upon discovering some
previously unknown feature or aspects, e.g., when Dyirbal of North Queensland
was found to have a separate Dyalŋuy variety or dialect used only in the hearing
of taboo relatives like a man’s mother-in-law or a woman’s father-in-law
(Dixon 1968). Such discoveries are also of interest to neighbouring disciplines
in the social sciences of sociology, anthropology, and ethnography (cf. 3.8;
3.40).
1.8. The opposite approach commonly goes by the name of
‘theoretical linguistics’ but might, for the present discussion, be more
aptly called homework linguistics.1 It is heavily
theory-driven, and presents invented data from well-described languages, notably
English, of which the linguists are fluent or native speakers from the start.
Instead of deriving the theory of a particular language dialectically by
describing its practices, ‘homeworkers’ derive a theory of language in
general by a theoretical bootstrapping that combines their own intuition and introspection
with conceptions sporadically borrowed from language philosophy, formal logic,
or mathematics (cf. 3.22). The standards of science are to be upheld by
‘theorising’ the more practical and ordinary qualities out of language. The
most scientific
statements should describe ‘language’ in the most abstract and general
sense, and ultimately in terms of ‘linguistic universals’ (cf. 1.16, 20).
1.9.
The decisive step in this outlook was to “give priority to
introspective evidence” and “intuition” (Chomsky 1965:20). The homework
linguist was now said to command an “enormous mass of
unquestionable data” merely by virtue of holding the “linguistic
intuition of the native speaker”; and precisely for these “data”, a
“description, and, where possible, an explanation” were to be “constructed”
(1965:20). The linguist would apparently become the representative of the “ideal
speaker-hearer in a completely homogeneous speech-community, who knows its
language perfectly” (Chomsky 1965:4) (1.13). Yet to discredit fieldwork with
informants, homework linguists felt impelled to deny that the “speaker of a
language”, who has “mastered and internalised a generative grammar, is aware
of the rules of the grammar or even” “can become aware of them”; and that
“his statements about his intuitive knowledge are necessarily accurate”,
since “a speaker’s reports and viewpoints about his behaviour and competence
may be in error” (1965:8). These denials should cast serious doubts upon
authorising linguists to act as model “speakers”, unless their academic
training and status grant them super-human powers of introspection (1.12; 3.36).
But then they would be patently untypical and unsuited as models of a
“completely homogeneous speech-community”.
1.10. Such perplexing lines of argument might help to
explain why homework linguists have so often used data from a well-described
language like English, besides just being native speakers. They could presuppose
extensive information about the language and did not have to supply it. They
could exploit their own intuition and introspection to swiftly elevate their
deliberations up beyond the laborious problems of fieldwork in order to address
purely theoretical rather than practical issues: theory becomes meta-theory, or,
in the terms proposed here, meta-meta-theory; and their discourse on language
manifests not just meta-language but meta-meta-language. So the discussion
naturally seeks illustrations in invented data whose status seems so secure as
to camouflage the role of the linguist as inventor, e.g.:
(1) The farmer kills the duckling (Sapir)
(2)
John ran away (Bloomfield)
(3) The man hit the ball (Chomsky)
Paradoxically,
such data were invented to seem incontestable, yet they can be empirically
classified as non-authentic insofar as they do not spontaneously occur in
ordinary discourse.2 Nonetheless, these same
data, accompanied by rather cursory descriptions, have often been adduced to
support general statements about the nature of language, e.g., that “word
order is unquestionably an abstract entity” (Saussure) or that “grammar
is autonomous and independent of meaning” (Chomsky). The essential
paradox thus consists of basing a general theory upon special cases by expressly
selecting data devoid of special features (cf. 4.2).
1.11. Moreover, non-authentic data represent an
unannounced compromise between “langue and parole”, or “competence and
performance”, which homework linguistics has separated by a radical dichotomy.
Saussure had roundly asserted that “speech cannot be studied”,
“for we cannot discover its unity”; it is only a “heterogeneous mass” of
“accessory and accidental facts” (1966 [1916]:9, 11) (cf. 1.21f; 3.13;
3.17). In the same vein, Chomsky (1965:4, 201) asserted that the “observed use
of language” “surely cannot constitute the subject-matter of linguistics, if
this is to be a serious discipline”; “from the standpoint of the theory”,
“much of the actual speech observed consists of fragments and deviant
expressions of a variety of sorts”. Such pronouncements suggest that
authentic data do not practice theory of a language, but seriously disrupt it.
The production of such data would resemble a catastrophic phase transition from
the extreme order of language over to the extreme disorder of discourse. The
speaker takes order, transforms it into disorder and transmits it to the hearer,
who transforms it back into order. Made explicit, this account of the relation
between language and discourse is obviously unsustainable.
1.12.
In parallel, homework linguists announced that “the concrete entities of language are not
directly accessible” (Saussure 1966 [1916]:110); and that “knowledge of the
language” is “neither presented for direct observation nor extractable from
data by inductive procedures of any known sort” (Chomsky 1965:18). These
claims too were meant to discredit fieldwork linguistics. But they also imply an
unsustainable
account of native-language learning, namely struggling against the grain of what
a child can “access and observe” — which is “fragmentary and
deviant” anyway. This implication presumably helped to garner
support for the universalist notion of an “innate language acquisition
device” (Beaugrande 1997b, 1998a).
1.13. Once “actual speech” has
been declared “heterogeneous” and “deviant”, the linguist can proceed to
invent non-authentic data which
have been quietly rendered homogeneous and purified of all deviance. Similarly,
if language is represented as an abstract, ideal system, then it is most
expediently exemplified by idealised data. By implication, homework linguists do not represent ordinary speakers
in real life, but rather “ideal” super-speakers who, thanks to their “perfect
knowledge”, can practice
the language with far greater unity and purity (cf. 1.9).
1.14. The perplexities implied for linguistic description
became most virulent in Hjelmslev’s “prolegomena to a theory of language”.3
Though acknowledging that
“the linguist who describes a language” “uses that language in the
description”, he issued a plea to “rise above the level of mere primitive
description to that of a systematic, exact, and generalizing science, in the
theory of which all events (possible combinations of elements) are foreseen”
(1969 [1943]:9, 121). The “theory” would be “applicable even to texts and
languages” that have “never been realised, and some of which will probably
never be realised” (1969:17). This startling project would be the linguists’
equivalent of a theory of everything, or the grand unification theory currently
much sought in physics. “The linguistic theoretician” proceeds to
“discover certain properties present in all those objects that people agree to
call languages, in order then to generalise those properties and establish them
by definition”; by doing so “he decrees to which objects his theory can and
cannot be applied” (1969:18). Such a “linguistic theory” “provides the
tools for describing” “a given text and language”, and “cannot be
verified — confirmed or invalidated — by reference to existing texts and
languages” (1969:18).
1.15. If these methods were literally adopted, the
linguist must examine all the world’s “languages” in the ordinary sense
(that “people agree” about) and construct the theory solely out of those
“properties” that have in fact been “discovered” everywhere. Then, it
would trivially, indeed automatically apply to all languages without requiring
any “decree”, “verification”, or “confirmation”. Yet the set of
properties would undoubtedly be far too small, abstract, and general to
“provide tools for describing a text” (4.5). One could only describe the
features that the text shares with every other text in every language, including
languages that don’t exist and never will — an esoteric exercise, to put it
mildly.
1.16.
When Saussure had earlier counselled “the linguist” to “acquaint himself
with the greatest possible number of languages in order to determine what is
universal in them”, he had surmised that “the diversity of idioms hides a profound
unity”, and that “all idioms embody certain fixed principles that the
linguist meets again and again” (1966 [1916]:23. 99). But he had conceded that
“it is very
difficult to command scientifically such different languages”; and wryly
concluded, with immense understatement, that “the ideal, theoretical form of a
science is not always the one imposed upon it by the exigencies of practice”
(1966:99). Not so Hjelmslev, who conjured the ideal whereby “mere primitive
description” would be replaced by “self-consistent and exhaustive
description” (1969:9, 18). To judge from his published work, he never tried to
present such a description of any text, and so did not confront its
impracticability as a method.
1.17. To include all non-existent, merely
“possible” languages, the set of languages to which Hjelmslev’s
“theory” could apply would be infinite; as a corollary, so too would be the
set of “texts” to be “described”. If so, the results of describing a
text or a set of texts would always seem too restricted to claim genuine
significance — just as homework linguists of the generative school would
predict anyway (1.20). Yet, again by implication, the processes of comprehending
a text would be infinite as well, which is blatantly false. Here, we see how far
the requirements placed upon description vastly overreach actual language, even
though, as I suggested, the theory falls far short (cf. ¶ 1.5). In parallel,
the “competence” and “perfect knowledge” of the “ideal
speaker-hearer” (1.9) vastly overreach the performance and knowledge of real
speakers. Both overreachings render homework linguistics empirically vacuous:
striving to describe everything at once and not describing anything.
1.18. I would argue this point just as emphatically
for the definition of “language” as an “infinite set of sentences” (e.g.
Chomsky 1957:13), presumably calculated to suggest that the description of data
was not merely impracticable, but incapable in principle of ever leading to a
theory of language (or to a “grammar”). Yet an “infinite set” would
contain every conceivable sentence, including the most flagrantly improbable
ones offered as counter-examples (like “colourless green ideas sleep
furiously”). The paradoxes of the infinite inhabit imaginative prose, such as
that of Jorge Luis Borges. In his infinite library:
For
every line of straightforward statement, there are leagues of senseless
cacophonies, verbal jumbles, and incoherences. […] Homer composed the Odyssey;
if we postulate an infinite period of time and infinite circumstances, the
impossible thing is not to compose the Odyssey (Borges 1964: 53, 114)
Moreover, “performance” would require infinite
search times. And it would be related to “competence” in purely accidental
ways, just as, in the familiar parable, a roomful of chimpanzees with
typewriters would, in infinite time, write the complete works of Shakespeare.
Such is the proper mathematical meaning of the “infinite”, and it cuts a
theory of language off from all practices.
1.19. We can accordingly dismiss the reservation
that descriptive linguistics is “inadequate” because “the corpus of
observed utterances” is “finite” (cf. Chomsky 1957:15; 1965:67). This
reservation holds for every set of observations and every set of data in every
science. Only the finite can be observed; and data are, both by definition and
by etymology, ‘the given’, and can never be other than finite.
1.20. The justified assessment
should be that a language is manifested in a very large but always
finite set of data; and that its system provides for indefinitely larger
sets, which will also be finite at any time. No such set can ever be
completely observed, but due to practical limitations rather than theoretical
principles. Like all scientists who work with such large data sets, linguists
must manage a trade-off between breadth (how much data a theory
can describe) and depth (what degrees of detail and precision the
description can achieve) (3.10ff). Now, if a language were an infinite set, then
its description would entail an infinite breadth that flattens out our depth to
an infinite shallowness, and our description (completed in infinite time, by the
way) would capture only infinitesimal details. In practice, homework linguistics
evaded its own “infinity” postulate by “assuming that the set of
grammatical sentences is somehow given in advance” (e.g. Chomsky 1957:18, 54,
85, 103). Breadth was merely hypothetical, bootstrapped into the theory by
invoking “language universals” “stated only in general linguistic theory
as part of the definition of the notion ‘human language’” (Chomsky 1965:6,
117), Breadth in the practical sense I suggest was left off the agenda, as when “gross coverage of data” was decried because it does not help a
linguist “learn anything about the principles” (Chomsky 1982:82f).
1.21. We can also dismiss the reservation that
“the corpus of observed utterances” is “accidental”. Every science must
confront the accidental in its data; the role of theory is not to leave real
data aside and invent some data that suits it better, but to stipulate how we
can distinguish between accidents and regularities (3.17). And the crucial
requirement for doing so is to collect and collate data sets as large as current
technologies allow. Of course, the state of technology is itself contingent upon
accidents, e.g., whether funds are allotted for super-colliders in physics or
for space telescopes in astronomy. But the capacity of technology to produce
data has usually been well ahead the capacity of theory to account for those
data — and nowhere more so than in linguistics today (3.2).
1.22. Moreover, science can enlist technologies
precisely for coping with accidents
in our data, most crucially at frontiers where our theories are still struggling
to distinguish the accidents from the regularities (3.17). The more significant
the potential for accidents, the greater the breadth we should seek, and the
more we should deploy those technologies that increase breadth without
materially decreasing depth. We may thereby push down the significance of any
particular accident (or set of accidents) by reassessing its probability.
Conversely, we may discover regularities when we can inspect a large set of data
where we saw accidents before (cf. 3.8).
2.
Recovering the dialectic
2.1. The issues raised in the foregoing section indicate
that mainstream linguistics has not managed to capture the dialectical cycle
displayed back in Fig. 1. In descriptive linguistics, the practices have usually
run well ahead of the theories. Numerous steps and strategies actually applied
in fieldwork research were entirely data-driven, and nowhere accounted for in
the sparse linguistic theories of the times. Even Pike’s (1967 [originals
1945-1964]) monumental programme to situate language within a “unified theory of the structure of human behavior”
was fenced within the confines of behaviorism
and ‘unified science’, which hindered him from expounding a unified theory
of meaning (Beaugrande 1991:107-11). More recently, some significant and
original phenomena discovered and described in fieldwork, as in Longacre’s
(1970, 1990) work on “spoken paragraphs” and “storylines”, or in
Grimes’ (1975) work on the “thread of discourse”, were nowhere accredited
in linguistic theory nor mentioned in conventional linguistics textbooks. Either
new terms were coined, such as “staging” and “collateral”; or else
accredited terms were assigned unconventional meanings, as for “predicate”
and “transformation”.
2.2. In generative linguistics, in sharp contrast, the
theories have run far ahead of the practices — so far indeed that practices
seem to have been left behind altogether (Beaugrande 1998). Descriptive
linguistics was sternly rebuked for not being theoretical enough, and, more
specifically, for trying to construct theory out of practice, namely through the
observation and analysis of data (Chomsky 1957). In respect to fieldwork, the
rebuke was patently unfair: no other method can succeed when the linguist has no
prior or outside information about the organisation of a language. What emerges
is of course a theory about that one particular language, not about the
“universal nature” of all languages. But within its modest scope, the theory
has been vigorously tested by data (1.6), and can be retested whenever the data
undergo a substantive increase.
2.3. In generative linguistics, the construction of
theory became independent of the observation and analysis of data; on the
contrary, these methods were expressly declared incapable of producing a theory
(1.11) They could be bypassed precisely because the linguist as native speaker
had so much prior or outside information about the language (1.10). But where
then should the theory come from? In the event, it mostly came, impressively
recast into more technical terminologies, from traditional grammar-books about
that same native language. Thus, the “universality” of “phrase
markers” was asserted, yet the accompanying diagrams displayed some obviously
English-flavoured grammar-book categories like “definite” and “Article”
(e.g. Chomsky 1965:107ff). Long before, Bloomfield (1933:233, 270) had warned against
“linguists taking for granted the universal nature” of the “categories”
of their own “native language”. Now, real prospects arose of “forcing all
languages into the mould of English, just as in earlier periods they were forced
into that of classical Latin” (Hall 1968:53) (1.10). The relatively rigid
word-order of English engendered the theory of “autonomous syntax”. The
absence of a systematic morphology in English led to morphology being left
homeless in generative theory. And so on.
2.4. The dialectical nature of language and discourse
was now thoroughly obscured. Language was not regarded as a theory which
discourse puts into practice, but as a theory about a theory (a meta-theory
about itself) which is independent of practice and indeed disrupted by practice.
Paradoxically, these linguists discredited the data produced by ordinary native
speakers as “fragmentary and deviant”, yet accredited the data invented by
themselves on the grounds of their own competence as native speakers (cf. ¶
1.9ff). The data were invented precisely out of the theory — just the reverse
of descriptive linguistics. Here, Hjelmslev’s vision seems to come alive: a
“linguistic theory” that “cannot be confirmed or invalidated” (1.15).
2.5. Perhaps the most far-reaching implication of this
approach is that the very term “language” no longer refers to what most
people, including most scientists, consider a language. Instead, it refers to a
construct of linguistic theory so strenuously idealised it might ironically
qualify as a Hjelmslevian “language that has never been realised and will
probably never be realised” (1.14). How and why such a construct should
promote the description of the languages that are being realised all around the
world has never been convincingly expounded. Indeed, we could predict some
compelling obstacles against description.
2.6. One obstacle lies in the terminology. The purely
virtual status of “language” as a non-realised system at the centre of
linguistics spreads out into the more specific terms. Such seems to have
occurred with “syntax” as a formal system of rules which determine the
word-order of all “grammatical sentences” in a language. Because real
speakers put words in order for many motives quite unrelated to formal rules,
this “syntax” does not exist in real language (Beaugrande 2000a). Still less
does a “semantics” exist which assumes a fully stable, deterministic meaning
for each expression of a language, whether based upon “meaning postulates”
or “semantic features” (Beaugrande 1984). The virtual, non-existent status
of these two “levels” or “components” of language makes them unsuitable
in principle for the description of authentic data, whence the unquestioned
substitution of non-authentic invented data (cf. 1.10ff).
2.7. The second obstacle is the peculiar meaning
allotted to the term “description”. When the operation of “assigning a
structural description to a sentence” was equated with “generating the
sentence” (Chomsky 1965:9), the formal analysis of data was equated with the
original production of data, despite Chomsky’s denials of doing so. Yet the
categories of that same analysis are utterly insufficient for production, e.g.,
in taking no account of meaning during the “generative” stage. In effect,
this “description” strips the sentence of most of its operational features
and leaves a mere trace — not even a blueprint for the design, let alone a
record of the implementation of the design.
2.8. Evidently, replacing “language” with a virtual
construct leads to replacing “description” with a virtual operation. Here
too is a motive for preferring non-authentic data: they are most amenable to
just such an operation. A “transformational grammar” needs only the
descriptive categories for converting the sentence into another more essential
and general structure (“kernel”, “deep structure” etc.). This operation
does not even describe given the sentence itself, but analyses it away, and
presents once again a structure which requires no confirmation because the
theory had introduced it in the status of an axiom. So the description is
effectively circular in the manner of a foregone conclusion.
2.9. If linguistics is to reinstate language as an
empirical object of study, we must reassert its descriptive heritage and recover
the dialectical interaction between language as theory and discourse as
practice. These two sides must be seen to constitute a dynamic cycle between two
distinct but closely co-ordinated modes of order. The order of language must be
practice-driven and expressly designed to support the theory-driven order of
discourse without fully predetermining it. So far, a large grey area persists
between these two orders, comprising a host of constraints that are more
specific or local than a language yet more general or global than a discourse
(Beaugrande 2000b) (cf. 4.2).
3. The impact of very large corpora
3.1. For practical reasons, much corpus research based
upon fieldwork in the past has had to be content with relatively small amounts
of data. I can discover in that work no theories stipulating just how large a
corpus ought to be; nor would such a theory be particularly relevant or
interesting as long as the fieldworker may have to confront bizarre, fortuitous
circumstances to get data. In his fieldwork on Cantonese in the 1940s, Halliday
made speech recordings on cumbersome wire spools, and the breaking of wires
would frequently damage or destroy his data.4 Improved technologies
have reduced such mechanical dangers, but not the labours of transcribing and
interpreting the data. Voice recognition by computer, now finally achieved, will
help us only for transcribing data in those languages that have already been
described extensively enough to configure the program; and transcribing data is
just a partial step in analysing or interpreting it.
3.2. Today, corpus research has access to very
large corpora of authentic data for a number of languages, and may confidently
foresee many more in the near future. We now face a daunting decision about whether some established theory
and practice of linguistic description will be reapplied to corpus studies; or
whether the foundations of linguistics will be revised in light of corpus
studies (Tognini Bonelli 1996; Sinclair 1999). As we know from the work on
“scientific revolutions” in the philosophy of science since Kuhn (1970), a
theory is not displaced by data alone but only by another theory which handles
more data and extracts new and important insights from data. My own experiences
in corpus research lead me to predict that linguistics should brace itself for a
major scientific revolution or paradigm shift similar to those ensuing upon the
introduction of such technologies as the telescope in astronomy or the
microscope in biology (Sinclair 1994, 1999). As with other technologies, this one wields
the capacity to produce data far ahead of the capacity of our theories to
account for those data (1.21). To
extend the analogy: we are ‘seeing’ phenomena in language which only become
visible through the technology.
3.3.
However, the technology also renders visible some far-reaching problems. These problems do not, as has sometimes been argued (e.g.
Widdowson 1991), arise from weaknesses inherent in corpora. Rather, the problems
have been inherent in language research all along but
would hardly be addressed when data were either restricted by the practices of
fieldwork linguistics or else marginalised by the theories of homework
linguistics. Now, corpus research
confronts us with principled questions like these:
What
size should a corpus have in order to represent a language?
What
is the ratio between quantity and quality of data?
What
is the ratio between breadth and depth of description?
What
is the ratio between the uniformity and diversity of data?
What is the ratio between regularities and accidents in data?
What is the ratio between grammar and lexicon in a language?
What
is the ratio between manifest and underlying organisation of language?
These questions are so intricately related to
each other that discussing any one of them by itself is an uneasy task. Even so,
corpus research should eventually lead us toward some worthwhile answers through
the aid of technology itself (cf. 4.6).
3.4. So our first question concerns the representative size
of a corpus. The notion of an entire language having a quantifiable size
at all hardly seems to figure in modern linguistics.5 It
would of course be moot if language is defined as an “infinite set of
sentences”; but I have tried to show why this definition is invalid (1.18).
3.5.
Once a language is defined as a finite though very large
set of data, and also a system providing for indefinitely larger sets (1.20),
then our question concerns the ratio between the actual size of a corpus and its
potential size. Actual size has been mainly dominated by practical factors. In
early corpus research on computers, when the technology of memory and
programming were rather limited, a million words seemed an ambitious size. When
the technology advanced, practical motives were again dominant in bumping up the
size to 20 million and then to 200 million in the Collins Birmingham University
International Database (COBUILD) — familiarly called the “Bank of English”
(BoE) — namely, for compiling a new type of data-driven dictionary that soon
became the market standard. Then the corpora themselves were offered on
commercial markets, such as COBUILD on CD-ROM (5 million) and the British
National Corpus (BNC) from Oxford University Press (100 million).
3.6. This dominance of the
practical side was to be expected. Lexicography has traditionally been a
practical enterprise; and theoretical linguistics has focussed far more upon
grammar than on the lexicon (cf. 3.11; 3.23). Even so, practical advances are
still needed for more friendly technology at the users’ end. Direct access to
a corpus via the Internet is subject to multiple disturbances, such as lines
being overloaded, busy, or periodically cut off in mid-operation. A corpus on a
single CD-ROM (like the COBUILD’s) can only hold a modest data set and do
simple searches and calculations. For larger sizes and more complex searches
like the BNC, users work with several CDs on ponderous operating systems like
UNIX or LINUX, and require technical training in mastering systems like the
“Corpus Data Interchange Format” based on “Standard Generalised Markup
Language” (Aston and Burnard 1998).
3.7. But viewed from inside
linguistics, theory is the side where advances are pressingly called for now.
There, size leads us to the further question of the ratio
between quantity and quality of the data. The null hypothesis
would be that beyond some threshold (say, a million words), increases
in size just multiply out in a mechanical proportionality: an item or pattern
appearing once at 1 million words will appear 20 times at 20 million words and
200 times at 200 million words. But this hypothesis could hold only if a
language were so uniform a system that its output hits a definite information
ceiling and its features go asymptotic. Beyond that, quantity would rise whilst
quality remained constant.
3.8. Corpus research, on the contrary, suggests a dialectical
ratio whereby a major rise in quantity brings a rise in quality; so the
language system must be far more diverse than the null hypothesis stipulates.
New
data can reveal previously undetected constraints upon an
apparently unconstrained regularity. For example, most grammars of English,
including the COBUILD Grammar based on a 20-million-word corpus, present
the pattern of Definite Article plus Adjective for referring to a whole class of
people, and declare it “possible to use almost any Adjective this way”
(Sinclair et al. 1990:21f). But Sinclair (1998:86) recently reported
“attitudinal biases and selectional restrictions” in the corpus at 336
million words: the pattern is mainly reserved for “unfortunate” people, such
as the elderly, the injured, the unemployed, the sick,
the aged, the poor, and the handicapped, as in (4).
Fortunate people occurred mainly by way of contrast with the unfortunate, as in
(5-6).
(4) On services to the mentally ill, the
elderly and the handicapped, Mr Cook pledged that Labour would
appoint a minister for community care. (newspaper)
(5) This is a system in which the rich are
cared for and the poor are left to suffer in silence. (newspaper)
(6)
the appeal, especially in Latin countries, is rather to envy the fortunate
than to pity the unfortunate. (Bertrand Russell)
This “attitudinal bias” might be explained from the effect of
depersonalising by omitting a Noun for the Adjective to modify. Such
explanations may not be foreseen or admissible in established linguistic
theories, but could be helpful for fieldwork research as well as ethnography
(1.7), and also in the teaching of English (4.4)
3.9. New insights are both
reassuring and disturbing. Just because linguists are stimulated
when new regularities are discovered (1.7), we are troubled by the prospect of
stopping the advance of theory by freezing the size of a corpus for practical or
technological motives. This fate may befall when a dictionary or reference work
arrives on the market, and the funding agent terminates support. Linguistics
should therefore provide the public in general and user groups in particular
with enough theoretical and practical knowledge to appreciate the dialectical
ratio between quantity and quality. Only then will commercial markets be
impelled to build larger corpora as grounds to claim better products.
3.10. Our next and closely
related question concerns the ratio between breadth and depth of
description (1.20). Whereas fieldwork research managed a balance by sheer
practical diligence in describing authentic recorded data, homework research
sought to appropriate “infinite” breadth and “universal” depth by sheer
theoretical bootstrapping with handfuls of non-authentic invented data (cf.
1.7f; 1.20). So whereas breadth and depth were slowly achieved by fieldworkers
through an arduous progress of small steps, they were swiftly built right into
the theory of “language” by homeworkers.
3.11. Today, the very large
corpus makes unprecedented breadth accessible but not necessarily achievable.
The computer resembles a long ladder on which we are still learning the skills
for scaling the higher levels in language description. Here too, much depends on
how uniform or diverse a language system might be. For a highly uniform system,
a description would have favourable chances to be both complete (total breadth)
and precise (total depth). The closest approximation in actual language research
is in the companion sciences of phonology and phonetics, sharing theory and
practice in impressive accord. But their uniformity is a straightforward
projection from the human vocal apparatus and the phonetic alphabet. In grammar,
uniformity was brightly postulated in theory but never demonstrated in practice.
And in the lexicon, the undeniable diversity has kept many linguists from
undertaking research at all (cf. 3.23)
3.12. Breadth becomes a
virulent issue when we get access to vast quantities of data. Depth becomes
virulent when we must choose among sources for those data. Most descriptions
produced in modern linguistics have been aimed at an entire language, e.g., at
the “set of grammatical sentences somehow given in advance” (1.20). Data
sources were not acknowledged to constitute a problematic factor, least of all
when the data were invented by the linguists. The same depth of description
would be appropriate everywhere, as would the methods for achieving it. In
corpus research, this optimism soon breaks down. A language itself is by no
means uniformly deep; the Number of Nouns is less deep than Definiteness: Polar
Auxiliaries are less deep than Modal Auxiliaries. Reaching one depth is likely
to open a view of still further depths, as when an analysis of the Agency of
Verbs leads to the discovery of constraints on Pronouns as Subjects or Objects
(cf. 3.32ff; 3.44). And the breadth of a deep description, once achieved, may be
hard to determine, e.g., how many Verbs share constraints on their Agency
(3.34).
3.13. By now we are in the midst of probing the ratio
between uniformity and diversity in a language. Here too, linguistic theory has often
inclined to a sharp dualism. Total uniformity was attributed to language,
witness Chomsky’s “completely homogeneous speech-community”
(1.9); yet total
diversity was attributed to discourse, witness Saussure
“heterogeneous mass of accidental facts” (1.11). And theory nowhere
explained how so extreme a dualism of order
and disorder could inhabit
the same system (1.11).
3.14. No doubt the heavy emphasis upon uniformity was intended to accommodate
commonplace notions of science, but failed to recognise the uniqueness of
language as an object of scientific investigation. There, uniformity and
diversity constitute a dynamic dialectic, parallel though not identical
to the dialectic between language and discourse. Every aspect of uniformity in a
language must be designed to sustain diversity (cf. 3.41). In phonology, the
uniformity of phonemes as shared targets underwrites enormous diversity among
acts of pronunciation due to such factors as the age, gender, and emotional
state of speakers, and their regional or educational background. In grammar, the
functions of uniformity are different in modality due to their more complex and
multimodal needs for expressing multiple modes of meaning. And the lexicon of
English — in contrast to many languages — affords fairly modest and sporadic
uniformity, due to its historical and cultural overlayering of extrinsic or
specialised approaches to word-composition, e.g., borrowing roots from Latin and
Greek.
3.15.
Corpus research is now beginning to reveal the significance of the dialectic
between uniformity and diversity. Language is found to be less uniform, and
discourse less diverse, than linguistic theory is wont to assume. The uniformity
of language is designed to generate diversity on-line; and the diversity of
discourse continually refers back to and renews the uniformity of language (cf.
3.41).
3.16. In terms of corpus practice, uniformity may actually
be a drawback. If we are compiling what Sinclair (1999) calls a ‘generic or
reference corpus’ to cover the English language as broadly as possible, then
we must consider how far the newly arriving data appear uniform or diverse
alongside our already acquired data. The
information value of a corpus would not rise significantly from increasing the
store of uniform data of the same type. This problem applies especially to mass
media, such as the plentiful newspapers conveniently posted on the Internet or
made available by direct electronic transmission, like the Sunday Times.
There, the diversity of the data is restricted in being authored by a relatively
small, well-trained group of writers, and being edited by an even smaller group.
I would also point out the massive ballooning of frequencies like I found in the
BoE in July 1994 of key-words such as violence
(19,226), kill (51,746), death
(31,013),
murder (18,383), rape
(5,890), and assault
(4,055),6 reflecting the
morbid, voyeuristic interests of mass media more than the frequencies of
authentic English at large.
3.17. Similar factors bear
upon the ratio between regularities and accidents. Once again, linguistic
theory has been largely dualistic: language constituted by regularities insofar
as it can be an object of science; and discourse littered with accidents and
therefore no fit object of science. Before very large corpora became available,
projects for actually demonstrating regularities by means of statistic
frequencies and probability measures were rare and laborious (e.g. Kučera and Francis 1967). Linguists gave reassurances
that “a linguistic observer can describe the speech habits of the community
without resorting to statistics” because “the forms of language” are “rigidly standardized” (Bloomfield 1933:37); or that when
neither “sentences nor any part of them have
ever occurred in any English discourse” or in “the linguistic experience of a speaker”, they are “statistically” all “equally remote” (Chomsky 1957:17). These two reassurances flatly contradicted each other — data
being all highly probable or highly improbable. But neither could be tested
without powerful technology for measuring the ratio between regular and
accidental (1.22).
3.18. The potential roles for
statistics and probabilities are surely due for reassessment now that we have
very large corpora (Halliday 1991, 1992). Finding and counting manifest items
is most tractable, yet least informative. The frequencies of items in a corpus
may give no reliable indication of their functional load in the language system.
Finding exactly 6000 occurrences for of
in the 5-million-word COBUILD Corpus on CD-ROM is not helpful; we need to know
the proportions for each of its multiple functions in combinations. And
combinations too are subject to the ballooning effects I noted a moment ago in
news media. Among the 20,569 occurrences of sex
returned by the BoE in July 1994, I found Sex
Pistols (at 707), sex appeal (at 762) oral sex (at 203), and sex
discrimination (at 209). Such frequencies are not meaningful unless we
can determine how far the occurrences entail the ‘same’ item at all.
3.19. The frequency of manifest
combinations is thus less tractable, but more informative. Corpus
research has devoted much exploration to the typical lexical
combinations called collocations, and the typical grammatical
combinations called colligations.7
Yet typicality is not readily explained in terms of frequency alone. In my
combined 12-million-word corpora of British and American writers, which I shall
be citing further on, among a total of 339 occurrences of the Verb fled
were only 3 of the collocation fled the country. To my
intuition, this combination seems typical even if its frequency and statistical
probability are negligible. It also occurred just once among 99 uses of fled in
the COBUILD on CD-ROM:
(7)
after the collapse of Tsarist authority, opportunists declared an independent
democracy, then a military junta that fled the country. (book)
But I can draw
some confirmation where the Verb fled took country names as Direct
Objects: France,
Iraq, Kuwait, Croatia, Germany.
3.20. We now come to a truly
daunting question: the ratio between manifest and underlying organisation of
language. Modern linguistics has been postulating an “underlying”
organisation of language all along (e.g. Saussure 1966[1916]:56;
Sapir 1921:144; Bloomfield 1933:225f; Hjelmslev 1969 [1943]:9f;
Chomsky 1965:4f, 10, 18, 22). Among the grandest prospects was that the
“descriptive grammars of diverse languages” will “some day” enable us to
“read from them the great underlying ground plans” (Sapir 1921:144).
Presumably, such “plans” are the goal of work on “linguistic
universals”, but most of that work lacks a secured base in descriptive
grammars.
3.21. Moreover, linguistics
has remained disturbingly evasive about how we can derive the “underlying”
organisation from the manifest organisation. Thus, Chomsky's
provision that “actual data of linguistic performance” would provide
“evidence for determining the correctness of hypotheses about underlying
structure” conflicted with his insistence that “surface structure” is
“unrevealing” and “irrelevant” and “hides underlying distinctions”
(1965:18, 24). With surprising candour, he conceded that his proposed “grammar
does not, in itself, provide any sensible procedure for finding a deep structure
of a given sentence”; and he evaded the whole issue by operating on the
“simplifying and contrary to fact assumption that the underlying basic string is
the sentence” (1965:141, 18).
3.22. Such
evasions readily follow from the already noted tendencies to attribute to
language highly idealised modes of order and to transpose the concept of
language from the particular instance over to a universal abstraction (cf. 1.8,
13, 16, 20; 2.5). Doing so naturally fosters a readiness to see disorder in
manifest data, and hence a reluctance to exploit them in the search for
underlying order (cf. 1.11; 3.12). Instead, artificial modes of order get
borrowed from sources like formal logic or mathematics, which only intensifies
the idealised and abstract nature of “language” (1.8).
3.23. Here, we can highlight
the ratio between grammar and lexicon. Linguistic theory
has long regarded “grammar” as the epicentre of uniformity and regularity
for an entire language and as a home for linguistic universals (compare Saussure
1966 [1916]:133, 152; Sapir 1921:38; Bloomfield 1933:163; Chomsky 1957:56).
In exchange, linguists have long concurred that the lexicon is a mere “list of
basic irregularities” (Bloomfield 1933:274; cf. Sweet 1913:31; Saussure 1966
[1916]:133;
Chomsky 1965:86f, 142, 214, 216). On a smaller scale, this dichotomy re-enacts
the dichotomy between the order of language and the disorder of discourse
(1.11), and again linguistics has chosen order: much work on grammar, little on
lexicon (3.6). Eventually, a homework linguist can baldly announce that
“linguistics is not about language; it is about grammar” (Smith 1984).
3.24. Here
again, linguistic theory should replace the dichotomy with a dialectical
relation, this one co-ordinating grammar and lexicon and constituting the interactive
lexicogrammar,
the “semogenic powerhouse of language” ()
The two sides differ not in kind, but in degrees of delicacy: lower toward the grammatical side and higher
toward the lexical side. Perhaps the lexicon could be regarded for some purposes
as “most delicate grammar” (Halliday 1961:256; Hasan 1987:184; Cross
1993:199).8
3.25. The interactions of grammar and lexicon are
readily evident from corpus research on colligations
and collocations
in the sense of 3.19. Since these are
defined as typical combinations,
they continually draw our attention toward plausible motives of speakers or
writers for coordinating multiple selections. For example, the English Verb “brook” meaning
“accept, tolerate” usually requires a Negative element (Sinclair 1994), as
in:
(8)
Johnson could not brook appearing to be worsted in argument (Life)
(9)
Bouille rides, with thoughts that do not brook speech. (French)
(10) his work was of a sort
that would brook no negligence (Lady)
This
Verb is infrequently used, and preferentially in solemn language about some weighty
business, as in Shakespearean drama:
(11) This weighty business will not brook delay (Henry VI)
(12) My business cannot brook this dalliance. (Comedy of Errors)
(13) False king, why hast thou broken faith with me,
Knowing
how hardly I can brook abuse? (Henry VI)
This
second constraint is more delicate than the one requiring a Negative, yet more
difficult to define in terms of manifest lexical choices. The weighty
business might be the assassination of a Duke (11), or just the collection of a
debt (12). The weightiness comes in part simply from using brook rather
than, say allow or tolerate.
3.26. Such data from the lexicogrammar of English
point us toward the immense task of accounting for multiple parameters of
variation in a language: genre, register, and style. In
terms of theory, these constitute intermediary control systems between
the language and the discourse (Beaugrande 1997eh?). Their design must be such
that when one of them is activated, the activation level is raised for
appropriate options and lowered for inappropriate ones (Kintsch 1988; Rumelhart
et al. 1986). In terms of practice, they obviously affect the selections and
combinations we can expect to find in authentic discourse data; but how to
describe those effects is far from clear at this stage.
3.27. Here, we might pursue a strategy of dialectical
resolution: building sub-corpora where
we predict systematic distinctions in quality; and then using our findings to
test and refine our predictions and to assess the typicality of specified data
inventories as indicators of some genre or style (cf. 4.5f). For a brief
demonstration, I shall draw upon three distinctive sources: (a) two corpora of
literature, one by British authors (e.g., Austin, Dickens, Wilde) and one by American authors (e.g., Hawthorne,
Mark Twain, Willa Cather), dating
roughly between 1750 and 1920 and together totalling 8.7 million words; (b)
two corpora of academic and civic writers, again including British (e.g., Darwin,
Bulwer-Lytton, J.S. Mill) and
Americans (e.g., Thomas Jefferson, Jane Addams, W.E.B. DuBois), together totalling 4.8 million words; and (c)
Collins COBUILD on CD-ROM (5 million words), which represent contemporary
everyday usage. The first two sets of corpora, totalling all together 13.5
million words (see Appendix for list of current texts), I compiled myself to run
on WordPilot©, a resource program
developed by John Milton at the Hong Kong University of Science and Technology
(Milton 1999). My compiling too faced fortuitous practical restrictions: I had
to use texts which are in public domain and can be downloaded from Internet
sites.
3.28.
In sources (a) and (b), the pattern of Definite Article plus Adjective was found
to be more balanced than in the COBUILD data reported in 3.8. The highest
frequency appeared among academic and civic writers, who are logically prone to
classify people. Alongside the contrasts like those noted by Sinclair, e.g.
(14-15), I found many where the fortunate people occurred alone, although
sometimes with the intriguing ironic twist of not being secure in their good
fortune (16-17).
(14) Smile with the simple and feed with the poor?
[…]; let me smile with the wise, and feed with the rich
(Boswell, quoting Samuel Johnson)
(15) None know the unfortunate, and the fortunate
do not know themselves (Poor Richard)
(16) There is always some levelling circumstance that puts down the
overbearing, the strong, the rich, the fortunate,
substantially on the same ground with all others (Emerson)
(17)
the educated see a menace in his [the black man’s upward
development (W.E.B. DuBois)
If grammar-books describe the pattern as being more general than is
confirmed by contemporary usage in the COBUILD, then perhaps by intuitively
taking academic discourse to be a model of English usage at large.
3.29. On the face of it, dialectical resolution
might look circular: using the type to identify the features of interest, whilst
using those features to identify the type. But text types cannot in theory be
defined through rigorous proof, since in practice most types are defined through
intuitive heuristics by language users. Besides, types are frequently mixed, as
in:
(18) A wedding is a time for merriment and an apt occasion to showcase
age-old traditions in an age where modernity is eroding important aspects of
yesteryear. This much-privy glimpse of Arabia was a re-enacted wedding ceremony
of the indigenous people, reflecting the timeless beauty and simplicity of
Arabia's life-styles, customs and unique identity until the ’70s oil-boom
brought in dramatic socio-economic development. (Khaleej
Times)
Such discourse briskly mixes the styles of solemnity (merriment,
yesteryear),
social science (modernity,
indigenous, identity, socio-economic development),
and tourism (age-old, timeless beauty and simplicity, life-styles),
along with the occasional solecism (much-privy glimpse). The mix reflects
multiple goals, such as disguising a tourist trap as a cultural site whilst
flattering the readers’ command of an educated variety of English here in Gulf
States.
3.30. Another
strategy might be for us to create local regions of substantial depth by
describing narrow data sets with some thoroughness. The resulting insights might
then be projected across broader sets and guide our selection of aspects and
features to investigate. For example, the COBUILD data at 20 million words
showed a Verb like elude being used only in the Active (cf. Sinclair et
al. 1990:407), e.g.:
(19)
Newer techniques, such as bone-scanning and ultrasound, have enabled us to find
more of the causes of back-pain, but a large number still elude us
(magazine)
(20)
Sylvie Guillem as Nikiya gave us her faultless technique and musicality,
although the spirituality of the role so far eludes
her (newspaper)
In
my literary and academic corpora I found ‘elude’ in the Passive just six
times, as in:
(21)
My importunities would not now be eluded (Wieland)
(22)
they lessen the consumption; the collection is eluded; and the product to
the treasury is not so great (FedPap)
The
meaning for data like (19-20) is roughly: some knowledge or skill would be
fitting but is not found. The meaning for data like (21-22) is more like: some
people finding ways of avoiding something. The Passive does seem to me
intuitively old-fashioned; and Passive versions of these Actives seem utterly
improbable:
(19a)
? ?we are eluded by a large number of the causes of back-pain
(20a)
? ?Sylvie Guillem is eluded so far by
the spirituality of the role
3.31. Now, to increase the depth of our analysis of elude,
we can examine some typical collocations
and colligations. Among the Nouns as Direct Objects, the collocations
noticeably clustered around vigilance, which
occurred 9 uses, e.g. (23), along with associates like observation
(24), eyes (25), and glance (25).
(23)
Nelson feared the more that this Frenchman might get out and elude his vigilance
(Nelson)
(24)
I had not neglected precautions to secure my personal safety, if I could only elude
observation. (Eyre)
(25)
That I could elude Rima’s keener eyes I doubted (Mansions)
(26) Hare’s fateful glance, impossible to elude
(Desert)
Other
typical collocates included grasp (6 uses), e.g. (27), and pursuit
(4 uses), e.g. (28).
(27) the maiden eluded the grasp of the savage (Last)
(28) I stopped at one or two stands of coaches
to elude pursuit (Wrongs)
The meanings of all these collocations involve two opposing agencies,
one of them seeking to elude the other and the potential consequences.
3.32. Among the colligations, the most striking one by far
was a marked preference for Personal Pronouns as Direct Objects. Of the 17
occurrences in COBUILD data, 13 showed this colligation, as in (19-20). Other
examples included:
(29)
he defines his essential position, as a man in permanent search of a God who eludes
him. (newspaper)
(30)
they were artists of considerable distinction, but man-in-the-street recognition
has eluded them (newspaper)
(31)
‘River Lane,’ said Shields. ‘Clarke, of course!’ That was what had been eluding
him. (book)
Here,
a further meaning concerns the lack of some insight or knowledge; the Active
Transitivity shifts the Agency in this lack from the person over to the
knowledge.
3.33. The proportions among the colligations in my other
corpora were less striking but still suggestive: out of 76 occurrences, 22 with
Personal Pronoun Objects. Alongside an idea (32) or a fact (33),
concrete agents like a person (34) or animal (35) did the eluding.
(32)
He spoke like one who was trying to keep hold of an idea that eluded him.
(Time)
(33)
Something seemed to give way in Jimmy’s brain. The simple fact which had eluded
him till now sprang into his mind. (Damsel)
(34)
Although Sam haunted lobby and stairway and halls half the night, the fugitives eluded
him (Whirl)
(35)
All four boats gave chase again; but the whale eluded them (Moby)
Only
two eluded Agents appeared in Direct Objects as Nouns rather than
Pronouns:
(36) this whale eludes both hunters
and philosophers. (Moby)
(37) often the Captain darted out of the shop
to elude imaginary MacStingers [his landlady] (Domb)
I
am aware of no reference, in the linguistic literature on Pronouns, to classes
of Verbs which colligate with Pronoun Objects, let alone any prospective
theoretical account. Provisionally, we might describe such Verbs as expressions of Agent-Opposing
Processes, which are usually accompanied by some preparatory background identifying the Agents.
In some contexts, both Agents are persons (or animals), the Subject doing
something and the Object eluding it. In other contexts, the Subject is not a
person and hence a Pseudo-Agent, but some knowledge or skill that is lacking,
and the Object is an Agent who does not have the initiative. In either
type of context, the eluded Agent is often clear and can be designated by
a Pronoun.
3.34.
The next and much harder problem would be to
explore how broad this locally detected constraint might be. Since a brute-force
query of Verb + Personal Pronoun Object in a large corpus would be explosive, we can tap our intuition to suggest plausible
candidate Verbs. By this means, my queries brought to light the Verbs rebuke
colligating in the Active with Personal Pronoun Objects in 24 out of 51 occurrences; beseech in 94 out of
126; and thank in 121 out of 185. (Also, thank had a fair quota of
Personal Pronoun Subjects, namely in 84 occurrences.) Similar measures of
Personal Pronoun Objects
were found with
the Pseudo-Agent Verbs behove
in 14 out of 19 occurrences; and befall in 108
out of 189. The data for befall showed a distinct and ominous attitudinal
bias for choices of the Pseudo-Agent Subjects: the
primary collocates were misfortune (at 26), accident (at 23), calamity
(at 19), and disaster (at 10).
3.35. Using intuition in this way is far from proclaiming
it to supply the “enormous mass of unquestionable data”
invoked by homework linguists (1.9). Intuitions are always
questionable, and the corpus makes the questioning easy. For example, my
intuition suggested Verbs that the corpora did not display in the colligation
patterns of rebuke in any significant proportions, such as reprimand
(6 out of 27) and rebuff (3 out of 34).
3.36. Corpus research recasts
the linguist: not in the role of the “ideal speaker-hearer in
a completely homogeneous speech-community, who knows its language perfectly”,
but in the role of an ordinary speaker-hearer (and
writer-reader) in a heterogeneous community, who knows its language only
partially and actively seeks access to the knowledge of others. We claim
authority for our statements not from harbouring super-human powers of
introspection (1.9), but from examining large sets of authentic data produced by
a community that puts their implicit theories of the language into a wide range
of practices (cf. 1.3). And our statements are not about “language” as some
“universal” abstraction, but about those data in one language and often
about only one genre, register, or style (3.25). Such statements can easily be
“confirmed or
invalidated” by more or other data — a normal effect of the dialectic of
quantity and quality (3.7) — but either step confirms once again the vitality
of using authentic data.
3.37. Intuition and introspection are
thus largely heuristic and opportunistic. They suggest things to try or watch
for, and they help us determine status and meaning after the fact once authentic
data are put before us (Francis and Sinclair 1994:194). They are not too
reliable as sources of data, and still less as sources of information about the
proportions among selections and combinations of data.
3.38. Allow me to demonstrate this point
with one final data set. In July 1994, I found 515 occurrences of couldn’t help and could not
help in the Bank of English, then at 225 million words. My intuition led me
to predict a fair quantity of data colligating with a Direct Object Noun for
some Target person who could not be given assistance, but I found just four, not
even 1% of the total. Here I encountered another phenomenon pointed out by
Sinclair (1991:493f): the presumably basic stand-alone meaning listed in first
place by conventional dictionaries not being at all the most frequent in corpus
data. The meaning of help as ‘give
assistance to’ is listed first in Webster’s
Seventh Collegiate (p. 387) whereas the meaning of ‘refrain from’ or ‘avoid doing’
is listed in seventh place. The design of such a dictionary would hardly admit a
separate definition for not help or could not help, even though the meaning is demonstrably distinct.
3.39.
The leading
colligations by far in the COBUILD data were with Verbs: either a Present
Participle (e.g. couldn’t help
admiring)
or else with but + Infinitive (e.g. couldn’t help
but
laugh).
This I could have predicted, but not my finding that no Adverb ever came in
between (e.g. couldn’t help deeply
admiring
her)
— a fully grammatical option, but not found (but cf. 3.45). In return, I found
two less grammatical mixed
patterns (couldn’t help
but thinking and couldn’t help from
crying) —the second one by the distraught Mary Wells, Tornado Victim.
3.40.
Still less
could my intuition have predicted the
proportions among the collocations. Almost half of the total (at 234)
collocated with one out of a set of just four Verbs; could you predict which
ones? They were feel (at 68), notice (at 58), think (at
59), and wonder (at 49). Still, if I could not predict, I might
‘retrodict’ after the fact by noting that these Verbs represent Processes
which might well be judged not properly subject to conscious control: they might
lead into emotions, perceptions, and thoughts where it seems fitting to remark
that someone couldn’t help it. The
pattern might therefore be termed a Face-Saving
Auxiliary: an expression which attenuates the Agency of Process Verbs in
order to save face after some Action that might be interpreted as hasty or
inappropriate. Such an explanation may again not be foreseen or admissible in
the theories of mainstream linguistics, but might be useful for ethnographers
(3.8).
3.41.
Moreover,
these same frequent Verbs could also provide useful Headwords for most of
the more delicate collocations, indicating one important way that uniformity is
designed to support a diversity (cf. 3.14) The top-ranked feeling could
be the Headword for attested collocations with crying, laughing/chuckling,
smiling/grinning, blushing, fearing, liking, loving, marvelling, sympathising,
wincing, worrying, plus nearly all the delicate collocates in colligation
with being or be:
touched, charmed,
impressed, moved, emotionally
involved, fascinated, struck, carried away, swept along, amused, jealous,
puzzled, nervous, frightened, surprised, shocked, offended.
Emotions might plausibly render you self-conscious, whether pleasant or
unpleasant, witness also the list of Direct Objects or Modifiers collocating
with the Verb feel in the data: the
pleasant ones enthusiasm, passion,
thrill, pleased, impressed, vindicated, and the unpleasant ones envy,
guilty, ashamed, sorry, miffed, apprehensive, alarmed.
3.42.
The slightly
less frequent noticing could provide a Headword for seeing, looking
at, glancing, hearing, overhearing, remembering, being consciously aware. Thinking
could be the Headword for knowing, considering, reflecting, imagining,
and could subsume the frequent wondering, where uncertainty rather than
emotion might be making you self-conscious.
3.43.
One group of
collocates formed a cluster with no frequent Headword: speaking,
saying, telling, commenting, pointing out, remarking, declaring, suggesting,
responding, agreeing, objecting, reminding, congratulating, blurting out.
Here we might pick the Headword by its generality rather its frequency: speaking
being involved in all the others but not vice-versa (proverbially, one can speak
without saying anything).
3.44.
The
colligating Subjects were evenly divided between Nouns and Pronouns. Yet the
proportions among the Pronouns were dramatically uneven. I logged in far ahead at 150 occurrences, followed after a large gap by she (48) and he
(45), and then after another gap by you (15), we
(7), and they (6), plus the
Impersonal one (11) — for a
total of 282 Pronoun Subjects (55% of the total data). Here we may have evidence
for constraints upon what we could call Multi-Process Agency, such that
the identity of the Agent is established for one (or more than one) Process
before saying that Agent couldn’t help
it.
3.45. The data in my two literary corpora gave a more
delicate picture of these constraints. There, I registered 147 occurrences of couldn’t
help and 320 with could not help, for a total of 467. Also, those 320
constituted 86% of the 370 occurrences of not help. The frequency is
highly significant if we consider that these corpora, at a total of just 8.7
million words, are about 25 times smaller than the COBUILD at 225 million, which
returned 515. The most plausible explanation I can find — again not a
“linguistic” one in any established sense — is the useful function for
framing Events in literary discourse so as to communicate to the reader a
character’s own perspective, such as what someone was feeling or thinking,
perhaps with no manifest Action, as in:
(38)
Connie stuck to him passionately. But she could not help feeling how
little connexion he really had with people. (Chatter)
(39)
Mrs Tulliver’s imagination was not easily acted on, but she could not help
thinking that her case was a hard one (Floss)
The
literary style might account for the attestation of inserted Adverbs, which
never appeared in COBUILD data (3.39), such as:
(40)
She could not help frequently glancing her eye at Mr. Darcy (Pride)
(41)
she could not help secretly advising her father not to let her go.
(Pride)
(42)
Florence could not help sometimes comparing the bright house with the
faded dreary place (Domb)
In
some such data, there is no other reasonable place to put the Adverb.
3.46. The
personal and internal quality might also help explain the tremendous
frequencies, similar to those noted in COBUILD data, of First and Third Person
Singular Pronouns as Subjects: I
(151), he (75), and she (85), for a total of 311 (67% of all my
data). The Plurals were rare —we (6) and they (5) — probably
because a feeling or a thought normally belongs to just one Agent. The Second
Person Pronoun you was rare too (4), doubtless because of the low
probability of telling somebody else to their face what they couldn’t
help.
3.47.
At still
greater delicacy, I found that choice of the Contraction couldn’t made
a difference here. Whereas she and he were about half as frequent
as for could not, I was more than twice as frequent:
couldn’t help (total of 147)
could not help (total of 320)
I
73 (49%)
78 (24%)
she
17 (11.5%)
68 (21%)
he
19 (13%)
78 (24%)
I
checked all the data to see if the Contraction was preferred for spoken
discourse. And in fact, only 14 out of 73 uses with couldn’t
did not occur in direct speech like (43), but in the narrator’s voice of
first-person narratives like The Adventures of
Huckleberry
Finn, e.g. (44); this last work alone contributed 7 uses, but then Huck never says could
not in any context. Conversely, only 4 out of 78 uses with could
not appeared in direct speech like (45); all the rest were in the
narrator’s voice, like (46).
(43)
‘she took all the grit out o’ him. I couldn't help feelin’ sorry for him
sometimes’. (Fauntle)
(44)
I had to skip around a bit, and jump up and crack my heels a few times — I
couldn't help it (Finn)
(45)
‘He was a very good man, sir; I could not help liking him’. (Eyre)
(46)
For my part, I could not help thinking this lawyer was not such an invalid as he
pretended to be. (Clink)
3.48.
Related
constraints applied to the occurrences of the Pronoun it
as Direct Object. The data with the Contraction logged 70 instances (47%), the
data with could not a mere 19 (6%). Here also, the context tends
to establish Identities: not for Agents and Targets, as for the Subject (3.44),
but for Actions and States. The tiny frequencies of the third Person Pronouns her
(1), him (1), and them (2) as Direct Objects again documents the
rarity of the sense of help as ‘give assistance to’ (3.38). The few
Nouns as Direct Objects were also expressions for Actions, not Agents, as in:
(47)
Connie could not help a sudden snort of astonished laughter (Chatter)
(48)
‘I couldn't help the interruption, but I made up for it afterward by
working until two’ (Carrie)
I accordingly found a modest scatter of pairs with
the same Action as Noun or as Verb, as in:
(49)
With such a possibility impending he could not help watchfulness.
(Caster)
(50)
Catherine, though not allowing herself to suspect her friend, could not help watching
her closely (Abbey)
The colligation with a Verb in the Present Participle
was quite conspicuous with could not: 256 out of 320 (exactly 80%). For couldn’t
help, this colligation logged in at 61 out of 147 (41%), having to compete
there with it at 70. Some authors used couldn’t help exclusively
with it, such as Mark Twain, Harriet Beecher Stowe, and Theodore Dreiser.
3.49. The matter of authors’ preferences as compared to linguistic regularities is a puzzling one in corpus research. We might contend that my corpora are far too small, which is doubtless perfectly true, the more so given the sheer size of some single texts, such as Joyce’s Ulysses at over 266,000 words. However, differences in size among sample texts is an important empirical given, especially when the public is expected to read the whole text. Besides, we cannot determine in advance how far an author or a text might be internally consistent enough to skew our measurements in one direction — Ulysses surely is not. The colligation depend upon it (meaning ‘you may be sure’) appears 55 times in my corpora, of which 28 come from Jane Austen; yet her usage was typical of whole sample, where fully 46 are Imperatives and 8 more colligate as you may depend upon it in the same meaning. The typicality was confirmed by data in my corpora of British and American academic and civic writers. There, depend upon it appears 23 times again as Imperative or with you may. 14 of them were uttered by Dr. Johnson in Boswell’s Life, whose following item Sir in 12 occurrences can be safely charged to a personal idiosyncrasy.
3.50. At least as puzzling is the matter of translators’ preferences as compared to linguistic regularities of multiple languages The English colligation couldn’t help plus Verb (51-52) does not show regular correlates in the German (51a-52a) or Spanish (51b-52b) versions of Alice in Wonderland, whereas French makes do with ne pouvoir s'empêcher (51c-52c). But the colligation couldn’t help it has a separate correlate in all three versions (53-53c).
(51)
Alice was very
nearly getting up and saying, ‘Thank you, sir, for your interesting story’,
but she could not help thinking there must
be more to come
(51a) Alice war nahe daran, aufzustehen und zu sagen: ‘Besten
Dank für deine wirklich interessante Lebensgeschichte’,
aber dann sagte sie sich, daß doch noch einfach etwas kommen mußte
(51b)
Alicia estaba dispuesta a levantarse y decir: ‘Gracias, señora, por su interesante historia’,
pero no pudo dejar de pensar que algo más iba a decir la Tortuga
(51c) Alice fut sur le point de se lever en disant: ‘Je
vous remercie, madame, de votre intéressante histoire’, mais elle ne put s'empêcher de penser
qu'il devait sûrement y avoir une suite
(52) it would twist itself round
and look up in her face, with such a puzzled expression that she could
not help bursting out laughing
(52a) hatte das Tier eine Art, sich umzudrehen und ihr mit einem so
verwunderten Ausdruck ins Gesicht zu sehen, daß sie laut herauslachen
mußte
(52b)
el ava de pronto se giraba, mirándole a la cara con tan perpleja expressión
que Alicia no podía contener la risa
(52c) le flamant ne manquait pas de se retourner et de la regarder bien en
face d'un air si intrigué qu’elle ne pouvait s'empêcher de rire
(53)
‘Look out now, Five! Don’t go splashing paint over me like that!’ ‘I
couldn't help it’, said Five
(53a)
‘Paß doch auf, Fünf. Du spritzt mich ja überall voll mit deiner Farbe!’
‘Dafür kann ich nichts’, sagte Fünf
(53b)
‘¡Ten cuidado, Cinco! ¡Me estás salpicando todo de pintura!’ ‘Fue
sin querer’- dijo Cinco
(53c) ‘Fais donc attention, Cinq! ne m’éclabousse
pas de peinture comme ça!’ -‘Je ne l'ai pas fait exprès’, répondit
l'autre
Here
looms a vast field of research for translation studies with parallel text
corpora (cf. King and Woolls 1996). Correlated expressions that collocate and
colligate the same way in two or more languages will probably prove to be rare
indeed.
4.
Into the millennium
4.1.
I would hope that the present discussion may have etched some scratches upon the
surface of the changing picture of language and discourse under the impact of
large corpus data. The impact seems
sufficiently radical that a major scientific revolution or paradigm shift could
be predicted. In the past, linguistics has tended to cultivate a large supply of
abstract theories whilst postponing and marginalizing description
of practices. Today we confront a far larger supply of concrete practices, which
must be described before we can even define what a “language” is. I do not
advocate that theory-building should be shelved, even temporarily; but rather
that theory-building should finally and definitively cease to run so far ahead
of practice, and cease to devise arguments why theory cannot be derived or
tested from practice.
4.2.
As a corollary, unquestioned
scientific priority
would no longer be allotted to abstract
and general statements. These may prove the hardest to demonstrate with
authentic data. And we may incur the paradox of trying to base
a general theory upon special cases by selecting data devoid of special features
(cf. 1.10). How general or specific a description deserves to be should be
decided by our data and by the purposes of our research. Concrete and specific statements may prove
more realistic, and for some purposes, such as language teaching, more useful.
Moreover, data-driven descriptions are by nature specific in
the incipient stages, and gradually gain generality as our picture improves of
what to examine. A substantial range of constraints should turn out to be more
specific than a discourse yet less general than the whole language (2.9).
4.3.
As a further corollary, we should no longer displace real data with invented
data, or convert data into formal representations. Instead, we should work to
get as far as we can using real data to represent themselves. Even our
description of the underlying organisation of data should be as data-driven as
possible, rather than expressed in some purely theory-driven “deep
structure” comprising “universal categories”, which I hold least suitable
to “provide tools for describing a text” (cf.
1.14-15) To
judge from past experience,
‘universals’ tend to be indirectly extrapolated from particular languages
after all, especially English (2.3). The latter’s dominance in linguistic
theory can only be effectively transcended by much resolute work on large
corpora in as many languages as possible, each treated on its own terms.
4.4.
Meanwhile, the well-described languages like
English could be used by corpus researchers not to hasten above and beyond the
data (as homework linguists did, 1.10), but to present data to wide audiences of
specialists and non-specialists to test and discuss. By broadening our audience
base, we can most safely offset personal biases in our own intuition and
introspection. And the chances for productive applications will improve, such as
language teaching.
4.5. My own prediction would
be that progress will evolve out of the process I have called dialectical resolution. (3.27): the corpora that confront us with
problems will provide vital support in solving those problems. If authentic data
confront us with diversity, then we should keep building sub-corpora
until each of them displays signally enhanced internal uniformity. Then we can
compare these sub-corpora to identify and investigate which parameters and
constraints are more general or more specific. My own work on text types
indicates that types are often untidy and fuzzily defined, due especially to
differences between insiders and outsiders, e.g., between academic journals and
learner textbooks (Beaugrande 2001). Much academic writing is strenuously and
gratuitously technical and actually impedes communication; but effective
strategies to improve efficiency require corpus data for describing current
practices.
4.6.
Again by dialectical
resolution, a large corpus can increase breadth without flattening depth if the technology
itself is enlisted in the operations of description. Doing so requires
sophisticated software for ‘tagging’ and ‘parsing’ the data; the
description of ‘open text’ with no such preparation is still not genuinely
operational (Sinclair 1999). The
more secure categories like “Article”, “Preposition”, or “Auxiliary
Verb” are by no means delicate enough. The more innovative ones, like “staging” and “collateral” in fieldwork
(2.1), or “Agent-Opposing
Process” and “Face-Saving Auxiliary” proposed here (3.33; 3.40), are not
secure. At this stage, the categories of our description
can only be heuristic,
not formalised. Certainly, we have no sound reason to junk our established terms, nor to reintroduce
them in technical guises; instead, corpus data should enable us to render them
more applicable and precise as tools of description. We could for example retain
the terms
“Noun” and “Verb” whilst exploiting corpus data to make their meanings
more delicate, e.g., by determining whether the “Nominal” or the
“Verbal” formation from the same stem can be regarded as more basic; or
whether the two might have evolved apart into quite distinct ranges of
colligation and collocation.
4.7.
If the dialectic of language and discourse can be restored to the centre of
linguistic description, then the prospects for dialectical resolution
should be favourable in the long run. For the present, the imperative would be
to sustain a spirit of renewal and openness for new phenomena, new methods, and
new discoveries stretching out into a new millennium.
ENDNOTES
1 Fillmore’s (1992) term of ‘armchair
linguist’ is not quite accurate when a computer terminal is in front of the
chair.
2 For the
record, none of these three had appeared in the Bank of English, the world’s
largest data corpus, as of July 1994.
3 The noted Danish
linguist Jacob Mey (personal communication) tells me that the term “prolegomena” with its Kantian echoes in
the English title was entirely on the initiative of Hjelmslev’s translator,
the late Francis Whitfield. Hjelmslev “was never a Kantian, not even a Neo-
one”; Mey “would in hindsight characterise him as a purebred Neopositivist”.
4 Reported to me by Halliday in conversation in
Beijing in July 1995.
5 Except by Halliday, who told me in
1994 that he had written a paper entitled “How big is a language?”, but it
was not published. Some of its core ideas were taken over into Halliday (1996).
6
These totals include inflected forms too.
7 The terms were introduced
by J.R. Firth (1957 [1934-51], 1968 [1952-59]), but gained little substance
until corpus data arrived.
8 Actually, the term used in this work is not
“lexicon” but “lexis”, defined to cover “the resources of the
vocabulary” and the “process of lexical choice” (Cross 1993:196). But we
might retain the old term in this newer meaning.
REFERENCES
Aston
G. and L. Burnard. 1998. The BNC Handbook: Exploring the British National
Corpus with SARA. Edinburgh: Edinburgh University Press.
Beaugrande,
R. de. 1984. “Linguistics as discourse: A case study from semantics.”
WORD 35:15-57
Beaugrande,
R. de. 1991. Linguistic theory: The
discourse of fundamental works. London: Longman.
Beaugrande, R. de. 1997a. New foundations for a science of text and discourse. Stamford, CT:
Ablex.
Beaugrande, R. de.
1997b. “Theory and practice in applied linguistics: Disconnection,
conflict, or dialectic?”
Applied Linguistics 18/3:279-313.
Beaugrande,
R. de. 1998. “Performative
speech acts in linguistic theory: The rationality of Noam Chomsky.”
Journal of Pragmatics 29:1-39.
Beaugrande, R. de. 2000a. “There
is no such thing as syntax — And it’s a good thing too!”
Festschrift in honour of Jan Firbas.
Ed. Josef Hladky et al. Amsterdam: Benjamins.
Beaugrande, R. de. 2000b. “Text linguistics at the millennium: Corpus data and missing links.”
Text 21.
Beaugrande, R. de. 2001. “Cognition
and technology in education: Knowledge and information — language and
discourse.” Cognition and Technology 1/2.
Bloomfield, L. 1933. Language.
New York: Holt.
Borges,
J.L. 1964. Labyrinths. New York: New Directions.
Chomsky, N. 1957. Syntactic
structures. The Hague: Mouton.
Chomsky, N. 1965. Aspects
of the theory of syntax. Cambridge: MIT.
Chomsky, N. 1982. The
generative enterprise. Dordrecht: Foris.
Cross,
M. 1993. “Collocation in computer modelling of lexis
as most delicate grammar.” Register Analysis. Ed. M. Ghadessy. London:
Pinter. Pp 196-220.
Dixon,
R.M.W. 1968. The Dyirbal language of North Queensland. London: University
of London PhD Thesis.
Firth, J.R. 1957. Papers
in linguistics 1934-1951. London: Oxford University Press.
Firth, J.R. 1968. Selected
papers of J.R. Firth 1952-1959. Ed. F.R. Palmer. London: Longman.
Francis, G. and J.McH. Sinclair. 1994. “I
bet he drinks Carling Black Label: A riposte to Owen on corpus grammar.” Applied
Linguistics 15:190-200.
Grimes,
J. 1975. The thread of discourse. The
Hague: Mouton.
Hall,
R.A. Jr. 1968. An essay on language. Philadelphia: Chilton Books.
Halliday, M.A.K. 1961. “Categories of a theory
of grammar“.” WORD 17:241-292.
Halliday, M.A.K. 1991. “Corpus studies and
probabilistic grammar.” English Corpus
Linguistics. Ed.
K. and B. Alterberg. London: Longman. Pp. 30-43.
Halliday, M.A.K. 1992. “Language
as system and language as instance: The corpus as a theoretical construct”.”
Directions in corpus linguistics. Ed.
J. Svartvik. Berlin: Mouton de Gruyter. Pp. 61-77.
Halliday,
M.A.K. 1994. Language in a changing world.
Sidney: Australian Association of Applied Linguistics.
Halliday, M.A.K. 1996. “Grammar
and grammatics.” Functional descriptions. Ed. R. Hasan, C. Cloran,
and D.G Butt.
Amsterdam: Benjamins. Pp. 1-13.
Halliday M.A.K. 1997. “Linguistics
as metaphor.” Reconnecting language. Ed. A.M Simon-Vandenbergen,
K. Davidse, and D. Noël. Amsterdam: Benjamins. Pp. 3-27.
Hartmann, P. 1963. Theorie
der Sprachwissenchaft. Assen: van Gorcum.
Hasan, R. 1987. “The grammarian’s
dream: Lexis as most delicate grammar.” New
developments in systemic linguistics. Ed. M.A.K. Halliday and R. Fawcett.
London: Pinter. Pp. 184-211.
Hjelmslev, L. 1969 [orig. 1943]. Prolegomena to a
theory of language. Madison: University of Wisconsin Press.
King, P. and D. Woolls.
1996. “Creating
and using a multilingual parallel concordancer.” Translation and Meaning
4:459-66.
Kintsch,
W. 1988. “The role of knowledge in discourse comprehension: A
‘construction-integration model’.”
Psychological Review 95/2:163-82.
Kučera, H. and W.N.
Francis. 1967. Computational analysis of present-day American English.
Providence: Brown University Press.
Kuhn,
T. 1970. The structure of scientific
revolutions. Chicago: Chicago University Press.
Longacre,
R. 1970. Discourse, paragraph, and
sentence structures in selected Philippine languages. Santa Ana: Summer
Institute of Linguistics.
Longacre,
R. et al. 1990. Storyline concerns and
word order typology in East and West Africa. Studies in African Linguistics,
Supplement 10. Los Angeles: UCLA Dept. of Linguistics.
Milton, J. 1999. “Lexical
thickets and electronic gateways: Making text accessible by novice writers.”
Writing: Texts, processes and practices. Ed. C. Candlin and K. Hyland.
London: Longman. Pp. 221-243.
Pike, K.L. 1967 [orig.
1945-64]. Language in relation to a
unified theory of the structure of human behavior. The Hague: Mouton.
Rumelhart, D., et al. 1986. Distributed parallel processing: Explorations in the microstructures of
cognition. Cambridge, MA: MIT Press.
Sapir, E. 1921. Language. New York:
Harcourt, Brace, & World.
Saussure, F. de. 1966
[orig. 1916]. Course in general
linguistics. Transl. Wade Baskin. New York: McGraw-Hill.
Sinclair,
J.McH.1991. “Shared
knowledge.” Georgetown
University Round Table on languages and linguistics 1991. Ed. J. Alatis.
Washington, DC: Georgetown University Press. Pp. 489-500.
Sinclair, J.McH. 1994. Large corpora are here to
stay. Lecture at the University of Vienna, June 1994.
Sinclair, J.McH. 1998.
“Large corpus research and foreign language teaching.” Language Policy and Language Education in Emerging Nations:
Focus on Slovenia and Croatia. Ed. R. de Beaugrande,
M. Grosman, and B. Seidlhofer. Stamford, CT: Ablex. Pp. 79-86.
Sinclair,
J.McH. 1999. “New roles for
language centres: The mayonnaise problem.” Language centres:
Innovation through integration. Ed.
D. Bickerton and M. Gotti. Plymouth: CercleS. Pp. 31-50
Sinclair, J.McH. et al. 1990. Collins COBUILD English grammar. London: Harper Collins.
Smith N.V. 1983. Speculative linguistics: An
inaugural lecture. London University College.
Sweet, H. 1913
[1875-76]. “Word,
logic, and grammar.” Collected papers of Henry Sweet. Oxford:
Clarendon. Pp. 1-33.
Tognini Bonelli, E. 1996. Corpus
theory and practice. Pescia: Tuscan Word Centre.
Widdowson, H.G. 1991. “The description and
prescription of language.” Georgetown
University Round Table on Languages and Linguistics. Ed. J. Alatis.
Washington, D.C.: Georgetown University Press. Pp. 11-24.
Appendix 1. KEY TO ABBREVIATIONS
Abbey: Jane Austen Northanger
Abbey
Carrie:
Theodore Dreiser Sister Carrie
Caster:
Thomas Hardy Mayor of Casterbridge
Chatter: D.H. Lawrence Lady Chatterley’s Lover
Clink:
Tobias Smollett The Expedition of Humphry Clinker
Damsel:
Pelham Grenville Wodehouse A Damsel in Distress
Desert
Zane Grey The Heritage of the Desert
Domb: Charles
Dickens Dombey and Son
Eyre: Charlotte
Brontë Jane Eyre
Fauntle:
Frances Hodgson Burnett Little
Lord Fauntleroy
FedPap:
Alexander Hamilton, John Jay, and James Madison The Federalist Papers
Finn:
Mark Twain, The Adventures of Huckleberry Finn
Floss:
George Eliot Mill on the Floss
French: Thomas Carlyle The French Revolution
Lady: Henry James The Portrait of
a Lady
Last:
James Fenimore Cooper The
Last of the Mohicans
Life: James Boswell The
Life of Samuel Johnson, LL.D.
Mansions: W. H. Hudson Green Mansions
Moby: Herman Melville Moby Dick
Nelson:
Robert Southey The Life of
Horatio Lord Nelson
Pride: Jane
Austen Pride and Prejudice
Time:
H.G. Wells The Time Machine
Whirl: O.
Henry Whirligigs
Wieland: Charles
Brockden Brown, Wieland
Wrongs: Mary
Wollstonecraft, Maria, or The Wrongs of Woman
Appendix
2. TEXTS CURRENTLY IN THE WRITERS’ CORPORA
AMERICAN AND BRITISH LITERATURE
Ambrose Bierce, The Devil’s Dictionary
Anne Bronte, The
Tenant of Wildfell Hall
Anthony Trollope, Barchester
Towers
Ayn Rand, Anthem
Booth Tarkington, Penrod
Charles
and Mary Lamb,
Tales from Shakespeare
Charles Brockden Brown, Wieland;
Or the Transformation
Charles
Dickens, Dombey and Son
Charles
Dickens, Pickwick Papers
Charlotte
Brontë, Jane Eyre
D.H.
Lawrence, Lady Chatterley’s Lover
Daniel Defoe, The Life and
Adventures of Robinson Crusoe
E. Nesbit, The Wouldbegoods
Edgar Allan Poe, The
Narrative of Arthur Gordon Pym
Edgar Allen Poe, The
Fall of the House of Usher
Edward George Bulwer-Lytton, The
Last Days of Pompeii
Emily
Brontë, Wuthering Heights
F. Scott Fitzgerald, This
Side of Paradise
Frances Hodgson Burnett,
Little Lord Fauntleroy
G. K.
Chesterton, The Innocence of Father Brown
George
Eliot, Mill on the Floss
H. H.
Munro (Saki), Beasts and Super-Beasts
H.G. Wells, The Time Machine
Harriet Beecher Stowe, Uncle
Tom’s Cabin
Henry Fielding, The History of
Tom Jones, a Foundling
Henry James, The Portrait
of a Lady
Henry James, The Turn of the
Screw
Herman Melville, Moby Dick
Horace Walpole, The Castle of
Otranto
Horatio Alger, Jr., The
Cash Boy
Hugh
Lofting, The Story of Doctor Dolittle
James Fenimore Cooper, The
Last of the Mohicans
James Joyce, Portrait of the
Artist as a Young Man
James Joyce, The Dubliners
James Joyce, Ulysses
Jane Austen, Sense
and Sensibility
Jane
Austen, Emma
Jane
Austen, Northanger Abbey
Jane
Austen, Pride and Prejudice
John Bunyan, The
Pilgrim’s Progress
John Masefield,
Martin Hyde - The Duke’s Messenger
John
Ruskin, Sesame and Lilies
Jonathan
Swift, A Modest Proposal
Joseph
Conrad, Heart of Darkness
Joseph
Conrad, Lord Jim
Kate Chopin, The Awakening and
Selected Short Stories
Katherine
Mansfield, In a German Pension
Kenneth Grahame, The Wind in
the Willows
Laurence
Sterne, A Sentimental Journey
Laurence Sterne, The Life and
Opinions of Tristram Shandy
Lewis Carroll, Alice in
Wonderland/Through the Looking Glass
Louisa May Alcott, Little
Women
Lucy
Maud Montgomery, Anne of
Green Gables
Mark Twain, The Adventures of
Huckleberry Finn
Mark Twain, Tom Sawyer/Tom Sawyer, Detective/ Tom Sawyer Abroad
Mary
Shelley, Frankenstein
Mary Wollstonecraft, Maria,
or The Wrongs of Woman
Nathaniel Hawthorne, Tanglewood
Tales
Nathaniel Hawthorne, The
Scarlet Letter
O. Henry, Options/
Voice of the City /Whirligigs
Oscar
Wilde, Portrait of Dorian Gray
Pelham Grenville Wodehouse, A
Damsel in Distress and Piccadilly Jim
Ralph Waldo Emerson, Essays
Robert Louis Stevenson, The Strange Case of Dr
Jekyl and Mr Hyde
Robert
Louis Stevenson, Treasure Island
Rudyard
Kipling, The Jungle Book
Sarah Orne Jewett, The Country
of the Pointed Firs
Sherwood Anderson, Winesburg,
Ohio
Sinclair Lewis, Babbitt
Stephen Crane, The Red
Badge of Courage
Susanna Rowson, Charlotte
Temple
Theodore Dreiser, Sister
Carrie
Thomas
Hardy, Mayor of Casterbridge
Thomas Hardy, Far from the
Madding Crowd
Tobias Smollett, The
Expedition of Humphry Clinker
Upton Sinclair, The Jungle
W. H. Hudson, Green Mansions
W. Somerset Maugham, The
Moon and Sixpence
Walter Scott,
Ivanhoe
Washington Irving, The
Legend of Sleepy Hollow
Wilkie
Collins, The Moonstone
Willa Cather, My Antonia
William Dean Howells, The Man
of Letters as a Man of Business
William
Makepeace Thackeray, Vanity Fair
Zane Grey, The Heritage of the
Desert
Total
Word-Count
8,694,588
CIVIC AMERICANS
Alexander
Hamilton, John Jay, and James Madison, The Federalist Papers
Benjamin
Franklin, Autobiography
Benjamin
Franklin, Poor Richard’s Almanack
Booker T. Washington, Up
From Slavery
Franklin Delano, Roosevelt Inaugural
Speech 1933
Frederick Douglass, My
Bondage and My Freedom
H. L. Mencken, In Defense
of Women
Henry
David Thoreau, Walden/Civil Disobedience
James Russell Lowell, Abraham
Lincoln
Jane
Addams, Twenty Years at Hull House
John Dewey, Democracy and
Education
John
Muir, Steep Trails
Martin Luther King, Jr. I
have a Dream
Oliver Wendell Holmes, The
Autocrat of the Breakfast-Table
Ralph
Waldo Emerson, Essays
Thomas
Jefferson, Autobiography
Thomas
Paine, Common Sense
Thorstein Veblen, The
Theory of Business Enterprise/ The Theory of the Leisure Class
W. E. B. DuBois, The Souls
of Black Folk
William James, The
Varieties of Religious Experience
Total
Word-Count 1,793,554
BRITISH ACADEMICS
Adam
Smith, An Inquiry into the Nature and Causes of the Wealth of Nations
Bertrand Russell, Proposed
Roads To Freedom
Charles Babbage, Reflections
on the Decline of Science in England
Charles Darwin, The Origin
of Species/ The Voyage of the Beagle
David Hume, An Enquiry
Concerning Human Understanding.
David Ricardo, On the
Principles of Political Economy and Taxation
Francis Bacon, Essays
Frederick Engels, The
Origin of the Family, Private Property, and the State
Herbert Spencer The Man
Versus the State
James Boswell, The Life of Samuel Johnson,
LL.D.
Jeremy
Bentham,
Defence of Usury
John Locke, An Essay
Concerning Human Understanding
John Maynard Keynes, The
Economic Consequences of the Peace
John Milton, Areopagitica
John
Stuart Mill, The Principles of Political
Economy
Karl Marx, The Poverty of Philosophy
Lytton
Strachey, Queen Victoria
Philip
Sidney, Defence of Poesie
Robert Southey, The Life of
Horatio Lord Nelson
Thomas Carlyle, The French
Revolution
Thomas Henry Huxley, On the
Reception of the ‘Origin of Species’
Thomas Hobbes, Leviathan
Thomas More, Utopia
William Godwin, Thoughts on
Man
Total 3,026,566; with
American Civics 4,820,120
Total
Words in All Corpora 13,514,708