Face parsing aims to assign pixel-wise semantic labels to different facial components (e.g., hair, brows, and lips) in given face images. However, directly predicting pixel-level labels for each facial component over the whole face image would obtain limited accuracy, especially for tiny facial components. To address this problem, some recent works propose to first crop tiny patches from the whole face image and then predict masks for each facial component. However, such cropping-and-segmenting strategy consists of two independent stages, which cannot be jointly optimized. Besides, as one valuable piece of information for parsing the highly structured facial components, context cues are not elaborately explored by the existing works. To address these issues, we propose a component-level refinement network (CLRNet) for precisely segmenting out each facial component. Specifically, we introduce an attention mechanism to bridge the two independent stages together and form an end-to-end trainable pipeline for face parsing. Furthermore, we incorporate the global context information into the refining process for each cropped facial component patch, providing informative cues for accurate parsing. Extensive experiments are carried out on two benchmark datasets, LFW-PL and HELEN. The results demonstrate the superiority of the proposed CLRNet over other state-of-the-art methods, especially for tiny facial components.