Syntacticus provides easy access to around a million morphosyntactically annotated sentences from a range of early Indo-European languages.
Syntacticus is an umbrella project for the PROIEL Treebank, the TOROT Treebank and the ISWOC Treebank, which all use the same annotation system and share similar linguistic priorities. In total, Syntacticus contains 91,723 sentences or 1,054,796 tokens in 10 languages.
We are constantly adding new material to Syntacticus. The ultimate goal is to have a representative sample of different text types from each branch of early Indo-European. We maintain lists of texts we are working on at the moment, which you can find on the PROIEL Treebank and the TOROT Treebank pages, but this is extremely time-consuming work so please be patient!
The focus for Syntacticus at the moment is to consolidate and edit our documentation so that it is easier to approach. We are very aware that the current documentation is inadequate! But new features and better integration with our development toolchain are also on the horizon in the near future.
|Old English||29,406 tokens|
|Old French||2,340 tokens|
|Classical Armenian||23,513 tokens|
|Ancient Greek||250,455 tokens|
|Old Church Slavonic||140,276 tokens|
|Old Russian||235,275 tokens|
In Syntacticus each text has been split into words, and then each word has been
- lemmatised (i.e. linked to its dictionary entry),
- assigned a part of speech (i.e. classified as noun, verb etc.),
- assigned morphological features (e.g. tagged with its case form or its tense), and
- given a syntactic function and linked to one or more other words (e.g. the subject of a verb has been labelled a subject and linked to the verb).
You can use this information in a number of ways. For example, if you know Latin but need help understanding the structure of a complex sentence, you can look up the specialist's analysis of that sentence.
The lemmatisation, parts of speech and morphology broadly speaking follow the same principles as standard reference grammars of Indo-European languages. In some situations we have adopted a different approach, which is more in line with modern formal linguistic thinking. This is the case in particular for various function words (such as subordinators, subjunctions, particles and interjections), which reference grammars tend to disagree on.
The syntactic annotation is based on the principles of dependency grammar. Each word is assigned a function, called a relation, and then linked to its head. For the English sentence John loves Mary, for example, John would have the relation subject and its head would be the verb loves because it is the subject of that verb. Mary would be object and its head would also be loves.
Our version of dependency grammar is heavily influenced by Lexical-Functional Grammar. This concerns in particular the granularity of argument and non-argument relations and how to distinguish between them, but we have also imported principles for annotating more complex linguistic structures such as raising and control.
The annotation system is documented in our handbook. (Note that the present handbook is a compilation of several individual documents, some of which were written quite some time ago. We are in the process of editing and updating these documents, but for now they have to do!)
The New Testamanent texts in Syntacticus have been aligned with the Ancient Greek original. This means that you can browse them side-by-side and see how each word in a translation relates to the Ancient Greek original. This feature is not fully implemented on syntacticus.org, and if you cannot wait for the complete implementation to be ready you should consult our raw data releases.
All treebank data and other linguistic resources available from Syntacticus have been made available to you by the copyright holders under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 or Creative Commons Attribution-NonCommercial-ShareAlike 4.0 license. In practice, this means you are free to use the data in a non-commercial setting as long as you provide complete attribution. You may also extract a subset of the data or derive a new data set by processing data from Syntacticus, but you must then make it freely available under the same license.
You can also link directly to texts, sentences, dictionaries and lemmas. To do this, click on the yellow Details button and copy the permanent link to the page. This link includes information about the version of the data that you have accessed.
The linguistic data you find here is the product of many people's work. Some of it has been supported by funding bodies, other parts are the product of volunteer efforts by specialists. You can find detailed information about contributors and copyright holders for each linguistic resource by clicking on the yellow Details button on text and dictionary pages. This also explains the provenance of electronic text that the resource builds on and any restrictions associated with it.
Raw data and developer resources
Raw data can be downloaded from the pages of the constituent treebanks, and some of the data has also been converted to Universal Dependencies 2.0. We also provide a toolchain and libraries for reading and manipulate raw treebank data. Some of this is documented in our handbook, and the code is found in out GitHub repositories https://github.com/proiel and https://github.com/mlj. (If you're curious the code for the Syntacticus website is also available.)