Welcome to Syntacticus! Syntacticus is still in development and not quite ready for general use yet. Please be patient while we iron out the last issues! (All errors are logged and we will try to address them as soon as possible.)

About Syntacticus

Syntacticus provides easy access to around a million morphosyntactically annotated sentences from a range of early Indo-European languages.

Syntacticus is an umbrella project for the PROIEL Treebank, the TOROT Treebank and the ISWOC Treebank, which all use the same annotation system and share similar linguistic priorities. In total, Syntacticus contains 80,138 sentences or 936,874 tokens in 10 languages.

We are constantly adding new material to Syntacticus. The ultimate goal is to have a representative sample of different text types from each branch of early Indo-European. We maintain lists of texts we are working on at the moment, which you can find on the PROIEL Treebank and the TOROT Treebank pages, but this is extremely time-consuming work so please be patient!

The focus for Syntacticus at the moment is to consolidate and edit our documentation so that it is easier to approach. We are very aware that the current documentation is inadequate! But new features and better integration with our development toolchain are also on the horizon in the near future.

Language Size
Ancient Greek 250,449 tokens
Latin 202,140 tokens
Classical Armenian 23,513 tokens
Gothic 57,211 tokens
Portuguese 36,595 tokens
Spanish 54,661 tokens
Old English 29,406 tokens
Old French 2,340 tokens
Old Russian 209,334 tokens
Old Church Slavonic 71,225 tokens

Annotation principles

In Syntacticus each text has been split into words, and then each word has been

  1. lemmatised (i.e. linked to its dictionary entry),
  2. assigned a part of speech (i.e. classified as noun, verb etc.),
  3. assigned morphological features (e.g. tagged with its case form or its tense), and
  4. given a syntactic function and linked to one or more other words (e.g. the subject of a verb has been labelled a subject and linked to the verb).
This has all been done manually by a language specialist and then verified by another specialist.

You can use this information in a number of ways. For example, if you know Latin but need help understanding the structure of a complex sentence, you can look up the specialist's analysis of that sentence.

The lemmatisation, parts of speech and morphology broadly speaking follow the same principles as standard reference grammars of Indo-European languages. In some situations we have adopted a different approach, which is more in line with modern formal linguistic thinking. This is the case in particular for various function words (such as subordinators, subjunctions, particles and interjections), which reference grammars tend to disagree on.

The syntactic annotation is based on the principles of dependency grammar. Each word is assigned a function, called a relation, and then linked to its head. For the English sentence John loves Mary, for example, John would have the relation subject and its head would be the verb loves because it is the subject of that verb. Mary would be object and its head would also be loves.

Our version of dependency grammar is heavily influenced by Lexical-Functional Grammar. This concerns in particular the granularity of argument and non-argument relations and how to distinguish between them, but we have also imported principles for annotating more complex linguistic structures such as raising and control.

The annotation system is documented in our handbook. (Note that the present handbook is a compilation of several individual documents, some of which were written quite some time ago. We are in the process of editing and updating these documents, but for now they have to do!)

Some of the texts also have information-structure annotation. It is not yet possible to browse or query this from syntacticus.org but the annotation is available in our raw data releases.

The New Testamanent texts in Syntacticus have been aligned with the Ancient Greek original. This means that you can browse them side-by-side and see how each word in a translation relates to the Ancient Greek original. This feature is not fully implemented on syntacticus.org, and if you cannot wait for the complete implementation to be ready you should consult our raw data releases.

Licensing

All treebank data and other linguistic resources available from Syntacticus have been made available to you by the copyright holders under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 or Creative Commons Attribution-NonCommercial-ShareAlike 4.0 license. In practice, this means you are free to use the data in a non-commercial setting as long as you provide complete attribution. You may also extract a subset of the data or derive a new data set by processing data from Syntacticus, but you must then make it freely available under the same license.

If you use this data in academic work, we ask that you cite the publication that the treebank editor has listed on their website. Please see the pages for the PROIEL Treebank, the TOROT Treebank and the ISWOC Treebank for this information.

You can also link directly to texts, sentences, dictionaries and lemmas. To do this, click on the yellow Details button and copy the permanent link to the page. This link includes information about the version of the data that you have accessed.

The linguistic data you find here is the product of many people's work. Some of it has been supported by funding bodies, other parts are the product of volunteer efforts by specialists. You can find detailed information about contributors and copyright holders for each linguistic resource by clicking on the yellow Details button on text and dictionary pages. This also explains the provenance of electronic text that the resource builds on and any restrictions associated with it.

Raw data and developer resources

Raw data can be downloaded from the pages of the constituent treebanks, and some of the data has also been converted to Universal Dependencies 2.0. We also provide a toolchain and libraries for reading and manipulate raw treebank data. Some of this is documented in our handbook, and the code is found in out GitHub repositories https://github.com/proiel and https://github.com/mlj. (If you're curious the code for the Syntacticus website is also available.)

Learn more!

The definitive reference manual for Syntacticus is our handbook. If you have questions you can talk to us on Gitter and we will try to reply as soon as possible.