There is a set of variables-tensors, connected to modules-nodes, which are able to calculate some output based on these variables and input data. Modules can be combined into small containers-graphs - in parallel or sequence. This restriction is partially caused by certain restrictions in the knowledge of which architectures are effective and sufficient. The other restriction cause is the need to provide full control of training when working with exotic architectures, including the training schemes without a trainer with several networks.
The standard method of operation generally requires:
1. Data (input and expected output) 2. Model-network 3. The criterion of training
If these elements are present, then some optimizer is selected, which always implements the gradient descent, the only difference is how much additional memory-information can be spared for gradient correction. This information is based on gradient statistics, for example, because higher-order optimization methods in Deep Learning are not very practical, as they do not carry anything fundamentally different, are not friendly for hardware acceleration, and can be replaced with powerful modules-regularization practices to speed up learning.