我们将讨论一种存储和处理信息的方法,并分享有关在此新范例中创建开发平台的一些想法。 做什么的? 要更快地开发并缩短迭代时间:勾画您的项目,确保它符合您的想法,对其进行优化,然后继续优化结果。 (We will discuss an approach to storing and processing information and share some thoughts on creating a development platform in this new paradigm. What for? To develop faster and in shorter iterations: sketch your project, make sure it is what you thought of, refine it, and then keep refining the result.)

The quintet has properties: type, value, parent, and order among the peers. Thus, there are 5 components including the identifier. This is the simplest universal form to record information, a new standard that could potentially fit any programming demands. Quintets are stored in the file system of the unified structure, in a continuous homogeneous indexed bulk of data. The quintet data model — a data model that describes any data structure as a single interconnected list of basic types and terms based on them (metadata), as well as instances of objects stored according to this metadata (data).

五重奏具有以下属性:同级之间的类型,值,父级和顺序。 因此,存在包括标识符的5个成分。 这是记录信息的最简单的通用形式,这是一种可能符合任何编程要求的新标准。 五重态存储在统一结构的文件系统中,并在连续的均匀索引的大量数据中存储。 五重态数据模型-一种数据模型,该模型将任何数据结构描述为基于基本类型和术语(元数据)以及根据此元数据(数据)存储的对象实例的单个互连列表。

半分钟的歌词 (Half a minute of lyrics)

Quintet is not only information, it could also represent executable code. But above all, it is the data that you want to record, store, and retrieve. Since in our case quintets are directly addressable, interconnected and indexed, we will store them in a kind of database.

五重奏不仅是信息,它还可以表示可执行代码。 但最重要的是,它是您想要记录,存储和检索的数据。 由于在我们的情况下,五重奏是可直接寻址,互连和索引的,因此我们会将它们存储在一种数据库中。

为什么用五重奏代替字节? (Why Quintet instead of Byte?)

不会使磁性自旋定向的位或电子脉冲。 (Not bit or electronic impulse that orient the magnetic spin.)

We are accustomed to measure the data in bytes, whether it is a document or photo size, internet traffic limit, or available space on your mobile device. We propose another measure — Quintet — which does not have a fixed size like Byte does, but represents an atomic amount of data, which is of some value to the user.

我们习惯于以字节为单位测量数据,无论它是文档还是照片大小,互联网流量限制或移动设备上的可用空间。 我们提出了另一种度量标准-五重奏(Quintet),它没有Byte那样的固定大小,但是代表了原子级的数据量,对用户来说具有一定的价值。

For example, you can say that your database occupies 119 megabytes of the storage or you can state that this database stores 1.37 mega-quintets. You do not care much what a byte is in this context, but you understand that this database contains 1.37 million of your term descriptions, objects, their attributes, links, events, queries with their details, etc. To possess 1.37 million valuable pieces of data sounds sexier than having 119 megabytes of stuff on you.

例如,您可以说您的数据库占用了119 MB的存储空间,或者您可以声明该数据库存储了1.37 MB的五重态。 您不太在意这种情况下的字节是什么,但是您了解该数据库包含137万个术语描述,对象,其属性,链接,事件,带有其详细信息的查询等。拥有137万个有价值的内容数据听起来比您拥有119兆字节的内容更性感。

Thus, this is not to replace the way the information is stored on the data medium, but to shift to another level of abstraction.


五重奏结构 (Quintet structure)

The main idea of ​​this article is to replace machine types with human terms and replace variables with objects. Not by those objects that need a constructor, destructor, interfaces, and a garbage collector, but by crystal-clear units of information that a customer handles. That is, if the customer says «Client», then to save the essence of this statement on the medium would not require expertise of a programmer.

本文的主要思想是用人工术语替换机器类型,并用对象替换变量。 不是通过那些需要构造函数,析构函数,接口和垃圾收集器的对象,而是通过客户处理的清晰的信息单元。 也就是说,如果客户说“客户”,那么要在媒体上保存此声明的本质,就不需要程序员的专业知识。

It makes sense to focus the user’s attention only on the value of the object, while its type, parent, order (among equals in subordination) and identifier should be obvious from the context or simply hidden. This means that the user does not know anything about quintets at all, he simply gives out a task, makes sure that it is accepted correctly, and then starts its execution.

将用户的注意力仅集中在对象的值上是有意义的,而对象的类型,父对象,顺序(在从属关系中)和标识符在上下文中应该是显而易见的,或者仅仅是隐藏的即可。 这意味着用户根本对五重音一无所知,他只是给出一个任务,确保正确接受了五重奏,然后开始执行。

基本概念 (Basic concepts)

There is a set of data types everyone understands: string, number, file, text, date, and so on. Such a simple set is quite enough to sketch the solution, and to «program» it along with the terms necessary for its implementation. The basic types represented by quintets may look like this:

每个人都可以理解一组数据类型:字符串,数字,文件,文本,日期等。 这样简单的设置足以勾勒解决方案,并对解决方案及其实施所需的术语进行“编程”。 五重奏表示的基本类型可能如下所示:

In this case, some of the components of the quintet are not used, while the quintet itself is used as the basic type. This makes the system kernel easier to navigate when collecting metadata.

在这种情况下,不使用五重奏的某些组件,而将五重奏本身用作基本类型。 这使系统内核在收集元数据时更易于导航。

背景 (The background)

Due to the analytic gap between the user and the programmer, a significant deformation of concepts occurs at the stage of outlining a project. The understatement, incomprehensibility and unsolicited initiative often turns a simple and reasonable idea of the customer into a logically impossible mess, if being evaluated from the the user’s point of view.

由于用户和程序员之间的分析差距,在概述项目的阶段会发生概念的重大变形。 轻描淡写,难以理解和主动提出的建议,如果从用户的角度进行评估,通常会使客户的简单合理的想法变成逻辑上不可能的混乱。

Knowledge transfer should occur without loss and distortion. Furthermore, organizing the storage of this knowledge, you should better get rid of the restrictions imposed by the data management system chosen.

知识转移应该没有损失和扭曲。 此外,组织这些知识的存储,您最好摆脱所选数据管理系统施加的限制。

我们现在如何存储数据 (How we store the data now)

Typically, there are many databases on the server; each of them contains a description of the data scheme with a specific set of details — logically interconnected data. They are stored on the data medium in a specific order, ideally — optimal to reduce the retrieval efforts.

通常,服务器上有许多数据库。 它们每个都包含对数据方案的描述,并带有一组特定的详细信息-逻辑互连的数据。 理想情况下,它们以特定顺序存储在数据介质上-最适合减少检索工作。

The proposed information storage system is a compromise between various well-known methods: column-oriented, relational and NoSQL. It is designed to solve the tasks usually performed by one of these approaches.

所提出的信息存储系统是各种众所周知的方法之间的折衷:面向列,关系和NoSQL。 它旨在解决通常由这些方法之一执行的任务。

For example, the theory of column-oriented DBMS looks beautiful: we read only the desired column, but not all the rows of records as a whole. However, in practice, it is unlikely that data will be placed on the media so that it is convenient to retrieve dozens of different analytic dimensions. Note that attributes and analytical metrics can be added and removed, sometimes faster than we can rebuild our columnar storage. Not to mention that the data in the database can be amended, which will also violate the beauty of the storage schema due to inevitable fragmentation.

例如,面向列的DBMS的理论看起来很漂亮:我们只读取所需的列,而不读取整个记录的所有行。 但是,实际上,将数据放置在介质上的可能性不大,以方便检索许多不同的分析维度。 请注意,可以添加和删除属性和分析指标,有时快于我们重建列存储的速度。 更不用说数据库中的数据可以修改,由于不可避免的碎片化,这也将违反存储模式的美。

元数据 (Metadata)

We introduced a concept — a term — to describe any objects that we operate with: entity, property, request, file, etc. We will define all the terms that we use in our business area. And with their help, we will describe all entities that have details, including the form of relationships between entities. For example, an attribute — a link to a status dictionary entry. The term is written as a quintet of data.

我们引入了一个概念(一个术语)来描述与我们一起使用的任何对象:实体,属性,请求,文件等。我们将定义我们在业务领域中使用的所有术语。 在它们的帮助下,我们将描述所有具有详细信息的实体,包括实体之间关系的形式。 例如,一个属性-状态字典条目的链接。 该术语被写为数据五重奏。

A set of term descriptions is metadata like the same represented by the structure of tables and fields in a regular database. For example, there is the following data structure: a service request on some date that has the content (request description) and a status, to which the participants of a production process add comments indicating the date. In a traditional database constructor it will look something like this:

术语描述集是类似于常规数据库中表和字段结构所表示的元数据。 例如,具有以下数据结构:在某个日期的服务请求,该请求具有内容(请求描述)和状态,生产过程的参与者向其中添加指示日期的注释。 在传统的数据库构造函数中,它将如下所示:

Since we decided to hide from the user all non-essential details, such as binding IDs, for example, the scheme will be somewhat simplified: the mentions of IDs are removed and the names of entities and their key values ​​are combined.


The user «draws» the task: a request from today’s date which has a state (reference value) and to which you can add comments indicating the date:


Now we see 6 different data fields instead of 9, and the whole scheme offers us to read and comprehend 7 words instead of 13. Although this is not the main thing, of course.


The following are the quintets generated by the quintet-processing kernel to describe this structure:


Explanations in place of quintet values ​​highlighted in gray are provided for clarity. These fields are not filled out, because all the necessary information is unambiguously determined by the remaining components.

为了清楚起见,提供了代替以灰色突出显示的五重奏值的说明。 这些字段未填写,因为所有必需的信息均由其余组件明确确定。

了解五重奏之间的关系 (See how quintets are related)

What we have here:


  • the attributes with IDs 80, 81, 83 has the same parent — Request

  • quintet #82 is the attribute of Comment, which is in turn an attribute of Request

  • attribute #74 is a reference to the type described by quintet #73 and is used as attribute #81 of Request


This might look slightly complicated for humans, but the good news is — a human will never see this. The kernel will represent the metadata as comprehensible diagrams and the data as simple flat tables.

对于人类来说,这可能看起来有些复杂,但是好消息是-人类将永远不会看到这一点。 内核将元数据表示为可理解的图表,并将数据表示为简单的平面表。

用户数据 (User data)

Let me show how we store such a data set for the above task:


The data itself is stored in quintets according to the metadata. We can visualize them the same way we did above:

数据本身根据元数据存储在五重奏中。 我们可以像上面一样可视化它们:

We see a familiar hierarchical structure written down using something like the Adjacency List method.


物理存储 (Physical storage)

The data is written to the memory as a sequence of quintet items in bytes of data. In order to search by index the kernel treats those bytes of data according to the data type defined for them by basic types.

数据以五位字节数据序列的五重音序列的形式写入内存。 为了按索引搜索,内核会根据基本类型为其定义的数据类型来处理这些字节的数据。

That’s it: a huge list of five of data items.


The storage principles are not much different from the same in RDBMS, which enables us building SQL queries to the data to make data retrieval, JOINs, aggregate functions and other things we like in relational databases.


为了测试基于五重存储系统的开发平台的原型,我们使用关系数据库。 (To test the prototype of a development platform based on the quintet storage system we use a relational database.)

性能 (Performance)

The above example is very simple, but what will be when the structure is thousand times more complex and there are gigabytes of data?


What we need:


  1. The discussed hierarchical structure — 1 pc.

  2. B-tree for searching by ID, parent and type — 3 pcs.


Thus, all records in our database will be indexed, including both data and metadata. Such indexing is necessary to get the benefits of a relational database — the simplest and most popular tool. The parent index is actually composite (parent ID + type). The index by type is also composite (type + value) for quick search of objects of a given type.

因此,我们数据库中的所有记录都将被索引,包括数据和元数据。 这种索引对于获得关系数据库的好处是必要的-关系数据库是最简单和最受欢迎的工具。 父索引实际上是复合索引(父ID +类型)。 按类型的索引也是复合的(类型+值),用于快速搜索给定类型的对象。

Metadata allows us to get rid of recursion: for example, to find all the details of a given object, we use the index by parent ID. If you need to search for objects of a certain type, then we use the index by type ID. Type is an analog of a table name and a field in a relational DBMS.

元数据使我们摆脱了递归:例如,要查找给定对象的所有详细信息,我们可以通过父ID使用索引。 如果您需要搜索某种类型的对象,那么我们将按类型ID使用索引。 类型是关系DBMS中表名和字段的类似物。

In any case, we do not scan the entire data set, and even with a large number of values ​​of any type, the desired value can be found in a small number of steps.


开发平台的基础 (The basis for the development platform)

In itself, such a database is not self-sufficient for application programming, and is not complete, as they say, according to Turing. However, we are talking here not only about the database, but are trying to cover all aspects: objects are, among other things, arbitrary control algorithms that can be launched, and they will work.

根据图灵的说法,这样的数据库本身并不能自给自足地进行应用程序编程,并且不完整。 但是,我们在这里不仅在谈论数据库,而且还试图涵盖所有方面:除其他事项外,对象是可以启动的任意控制算法,它们将起作用。

As a result, instead of complex database structures and separately stored source code of control algorithms, we get a uniform information field, limited by the volume of the storage space and governed with metadata. The data itself is presented to the user in an understandable form to him — the structure of the subject area and the corresponding entries in it. The user arbitrarily changes the structure and data, including making bulk operations with them.

结果,我们得到的是统一的信息字段,而不是复杂的数据库结构和单独存储的控制算法的源代码,受存储空间量的限制并受元数据控制。 数据本身以用户可以理解的形式呈现给用户-主题区域的结构和其中的相应条目。 用户可以随意更改结构和数据,包括对其进行批量操作。

我们没有发明任何新东西:所有数据已经​​存储在文件系统中,并且使用B树在文件系统或数据库中进行搜索。 我们只是重新组织了数据的表示形式,以使使用起来更加容易和清晰。 (We did not invent anything new: all the data is already stored in the file system and the search in them is carried out using B-trees, either in the file system, or in the database. We just reorganized the presentation of the data so that it is easier and clear to work with.)

To work with this data representation, you will need a very compact kernel software — our database engine is of the size smaller than a computer BIOS, and, therefore, it can be made if not in hardware, then at least as fast and bug-free as possible. For security reasons, it also could be read-only.

要使用此数据表示形式,您将需要一个非常紧凑的内核软件-我们的数据库引擎的大小小于计算机BIOS的大小,因此,如果不使用硬件,则可以做到这一点,至少速度和bug一样快。尽可能免费。 出于安全原因,它也可以是只读的。

Adding a new class to an assembly in my favorite .Net, we can observe the loss of 200-300 MB of RAM only on the definition of this class. These megabytes will not fit into the cache of the proper level, causing the system to swap on disk with all the consequent overhead. A similar situation is with Java. The description of the same class with quintets will take tens or hundreds of bytes, since the class uses only primitive operations for working with data that the kernel already knows.

在我最喜欢的.Net中向程序集中添加新类,我们仅在定义该类时即可观察到200-300 MB RAM的丢失。 这些兆字节将无法放入适当级别的缓存中,从而导致系统在磁盘上进行交换,并因此而产生所有开销。 Java也有类似情况。 使用五重字节对同一类的描述将占用数十或数百个字节,因为该类仅使用原始操作来处理内核已经知道的数据。

您可能会认为这种方法已经在各种应用程序中实现了很多次,但事实并非如此。 (You might think that this approach is already implemented many times in various applications, but that is not true.)

We made a deep search in both internet and intellectual property (patents) bases, and no one claims to do exactly the same solution to break the performance limit of constructors, , and other EAV-based systems. Nevertheless, we put hundreds of gigabytes in such quintet application and found it working quite well. In case you want to see evidences, create and test your own instance, feel free to visit our github account.

我们在互联网和知识产权(专利)方面都进行了深入的搜索,没有人声称要采取完全相同的解决方案来打破构造函数, 和其他基于EAV的系统的性能极限。 但是,我们在这种五重奏应用程序中放入了数百GB的数据,发现它运行良好。 如果您想查看证据,创建和测试自己的实例,请随时访问我们的github帐户。

The prototype of the platform we built has four components:


  1. Visual


    类型编辑器 (Type editor)

    to define the metadata


  2. 数据导航工具 (Data navigation tool)

    like a simple SQL navigator


  3. Visual


    报表设计者 (Report designer)

    to build SQL queries to the data


  4. A


    模板处理器 (Template processor)

    to combine templates with data retrieved by queries


As it was intended, working with the prototype no user would think there are quintets inside — this looks just like an ordinary constructor.


您可以通过本文的链接来测试有效的原型实现。 (You may test a working prototype implementation by the link in the to this article.)

如何处理不同的格式:RDBMS,NoSQL,列库 (How to deal with different formats: RDBMS, NoSQL, column bases)

Thus, for a columnar DB, we can significantly save the space occupied by quintets: use only one or two of its components to store useful data instead of five, and also use the index only to indicate the beginning of data chains. In many cases, only the index will be used for sampling from our analogue of a columnar base, without the need to access the data of the quintet list itself.

因此,对于柱状DB,我们可以大大节省五重奏所占用的空间:仅使用其五个组成部分中的一个或两个来存储有用的数据,而不是五个,并且还仅使用索引来指示数据链的开始。 在许多情况下,仅索引将用于从我们的柱状基础类似物中进行采样,而无需访问五重奏列表本身的数据。

It should be noted that the idea is not intended to collect all the advanced developments from these three types of databases. On the contrary, the engine of the new system will be reduced as much as possible, embodying only the necessary minimum of functions — everything that covers DDL and DML requests in the concept described here.

应该注意的是,这种想法并非旨在从这三种类型的数据库中收集所有高级开发。 相反,新系统的引擎将尽可能减少,仅体现必要的最少功能-此处描述的概念涵盖DDL和DML请求的所有内容。

编程范例 (Programming paradigm)

The described approach is not limited only to the usage of quintets, but promotes a different paradigm than the one that programmers are used to. Instead of an imperative, declarative, or object language, we propose the query language as more familiar to humans and allowing us to set the task directly to the computer, bypassing programmers and the impenetrable layer of existing development environments.

所描述的方法不仅限于五重奏的使用,而且促进了与程序员习惯的范式不同的范式。 我们提出查询语言而不是命令性,声明性或对象语言,而是人类更熟悉的语言,并允许我们直接将任务设置给计算机,从而绕过程序员和现有开发环境的不可穿透层。

当然,在大多数情况下,仍然需要从外行用户语言翻译成明确要求的语言。 (Of course, a translator from a layman user language to a language of clear requirements will still be necessary in most cases.)

This topic will be described in more detail in separate articles with examples and existing developments.


So, shortly, it works as follows:


  1. We once described primitive data types using quintets: string, number, file, text and others, and also trained the kernel to work with them. Training means the correct presentation of data and the implementation of simple operations with them.

    我们曾经使用五重字符描述了原始数据类型:字符串,数字,文件,文本等,并且还训练了内核来使用它们。 培训意味着正确呈现数据并执行简单的操作。
  2. Now we describe user terms (data types) — in the form of metadata. The description is just specifying a primitive data type for each user type and determining the relations.

    现在,我们以元数据的形式描述用户术语(数据类型)。 该描述仅针对每种用户类型指定原始数据类型并确定关系。
  3. We enter data quintets according to the structure specified by metadata. Each quintet of data contains a link to its type and parent, which allows you to quickly find it in the data storage.

    我们根据元数据指定的结构输入数据五重奏。 每个数据五重体都包含一个指向其类型和父级的链接,这使您可以在数据存储中快速找到它。
  4. The kernel tasks come down to fetching data and performing simple operations with them to implement arbitrarily complex algorithms defined by the user.

  5. The user manages data and algorithms using a visual interface that presents both of them.


The Turing completeness of the entire system is ensured by the embodiment of the basic requirements: the kernel can do sequential operations, conditionally branch, process the data and stop work when a certain result is achieved.


For a person, the benefit is simplicity of perception, for example, instead of declaring a cycle involving variables


for (i = 0; i 

a more understandable form is used, like


with every A, that match a condition, do something

We dream of abstracting from the low-level subtleties of information systems: loops, constructors, functions, manifests, libraries — all these take up too much space in the brain of a programmer, leaving little room for creative work and development.


可扩展性 (Scalability)

An application is often useless without means of scaling: an unlimited ability to expand the load capacity of an information system is required. In the described approach, taking into consideration the extreme simplicity of data organization, scaling turns out to be organized no more complicated than in existing architectures.

没有扩展的应用程序通常是无用的:扩展信息系统负载能力的能力是无限的。 在所描述的方法中,考虑到数据组织的极端简单性,结果证明扩展的组织方式不比现有体系结构复杂。

In the above example with the service requests, you can separate them, for example, by their ID, making the generation of ID with fixed HIGH bytes for different servers. That is, when using 32 bits for storing ID, the left two-three-four or more bits, as needed, will indicate the server on which these applications are stored. Thus, each server will have its own pool of IDs.

在上述带有服务请求的示例中,您可以例如通过它们的ID分隔它们,从而为不同的服务器生成具有固定HIGH字节的ID。 就是说,当使用32位存储ID时,根据需要,剩下的23-4位将指示存储这些应用程序的服务器。 因此,每个服务器将具有自己的ID池。

The kernel of a single server can function independently of other servers, without knowing anything about them. When creating an object, it will be given a high priority to the server with the minimum number of IDs used, to ensure the even load distribution.

单个服务器的内核可以独立于其他服务器运行,而无需了解它们。 创建对象时,将使用最少的ID数为服务器赋予较高的优先级,以确保均匀的负载分配。

Given a limited set of possible variations of requests and responses in such data organization, you will need a fairly compact dispatcher that distributes requests across servers and aggregates their results.





