Recovering 3D voxelized shapes with fine details from single-view 2D images is an extremely challenging and ill-conditioned problem. Most of the existing methods learn the 3D reconstruction process by encoding the 3D shapes and the 2D images into the same low-dimensional latent vector, which lacks the capacity to capture detailed features in the surface of the 3D object shapes. To address this issue, we propose to explore rich intermediate representation for 3D shape reconstruction by using a newly designed network architecture. We first use a two-steam network to infer the depth map and the topology-specific mean shape from the given 2D image, which forms the intermediate representation prediction branch. The intermediate representations capture the global spatial structure and the visible surface geometric structure, which are important for reconstructing high-quality 3D shapes. Based on the obtained intermediate representation, a novel shape transformation network is then proposed to reconstruct the fine details of the whole 3D object shapes. The experimental results on the challenging ShapeNet and Pix3D datasets show that our approach outperforms the existing state-of-the-art methods.