Introduction: N-back tasks are commonly used in functional neuroimaging studies to identify the neural mechanisms supporting working memory (WM). Despite widespread use, the clinical utility of these tasks is not well specified. This study compared N-back performance during functional magnetic resonance imaging (fMRI) with task data acquired outside of the scanner as a measure of reliability across environment. N-back task validity was examined in relation to performance and rater-based measures used clinically to assess working memory.Method: Forty-three healthy adults completed verbal and object N-back tasks during fMRI scanning and outside the scanner. Task difficulty was varied parametrically (0, 1, and 2-back conditions). Order of N-back task completion was stratified by modality (verbal/object) and environment. Participants completed the Digit Span (DS) and provided self-ratings using the Behavior Rating Inventory of Executive Function (BRIEF-WM).Results: Mean verbal and object N-back accuracy was above 95% across load conditions; task difficulty was effectively manipulated across load conditions. Performance accuracy did not significantly differ by environment. N-back reaction time was slower during fMRI (F = 6.52, p = .01, ηp(2) = .13); participants were faster when initially completing tasks outside the scanner (ηp(2) = .10-.15). Verbal 2-back accuracy was significantly related to DS performance (r = .36, p = .02). N-back performance was not related to BRIEF-WM.Conclusions: Our results provide evidence for reliability of N-back accuracy during fMRI scanning; however, reliability of reaction time data is affected by order of task presentation. Data regarding construct validity are inconsistent and emphasize the need to consider clinical utility of behavioral measures in the design and interpretation of functional neuroimaging studies.