Advancements in machine learning techniques have promoted the use of deep neural networks (DNNs) for supervised speech enhancement. However, the DNN’s benefits of non-explicit noise statistics and nonlinear modelling capacity come at the expense of increased computational complexity for training and inference which is an issue for real-time restricted applications, like hearing aids. Contrary to the conventional approach which separately models the feature extraction and temporal dependency through a sequence of convolutional layers followed by a fully-connected recurrent layer, this work promotes the use of convolutional recurrent network layers for single-channel speech enhancement. Thereby, temporal correlations among inherently extracted spectral feature vectors are exploited, while further reducing the parameter set to be estimated relative to the conventional method. The proposed method is compared to a recent low algorithmic delay architecture. The models were trained in a speaker independent fashion on the NSDTSEA data set composed of different environmental noises. While objective speech quality and intelligibility measures of the two architectures are similar, the number of network parameters in the suggested enhancement method could be reduced by 50-60%. This reduction is highly beneficial for storage and computation constraint applications.