E supposedly spoken by people in a scene and asked to use the words to construct a sentence that could have been used in the pictured situation), and Figurative Language (participants are asked to tell in their own words what a person meant when saying an expression in a given situation; participants then choose which of four expressions was closest in meaning to the conversational statement). Raw HMPL-013MedChemExpress HMPL-013 scores for each subtest were calculated according to the procedures in the test manual and were used for all statistical analyses because the age range for standardized scores for this measure is 18 years and a number of the adult participants had chronological ages above that level. All four subtests were Procyanidin B1 site combined into a sum of subtests raw score for each participant. Of note, raw scores from subtests 2 (Making Inferences) and 4 (Figurative Language) are most closely related to the inference making constructs examined by the PIT, and were included in subsequent analyses. Assessment of Reliability–Inter-rater reliability was calculated for the PIT to ensure that all testers objectively scored test responses in the same way. Approximately 10 of tests (n=17) were randomly sampled and scored by two experienced examiners. The observed agreement between the two raters was nearly unanimous (Cohen’s kappa = .99). All of the TLC-E and ToM protocols were rescored by a second tester and any scoring or calculation errors were corrected.Author Manuscript Author Manuscript Author Manuscript Results Author ManuscriptAnalytical Approach Because of the increased backlash against using null hypothesis significance tests (NHSTs) as a vehicle for statistical inference (Anderson 1997; Cumming 2012; Cumming 2014; Kirk 2003; Wagenmakers 2007), we do not report these tests or the p values associated with them. Rather, we report Bayes factors (BFs; see Hoijtink et al. 2008; Jeffreys 1961; Kass Raftery 1995) to state evidence in favor of or against statistical models, an approach that hasJ Autism Dev Disord. Author manuscript; available in PMC 2016 September 01.Bodner et al.Pagebeen advocated repeatedly (Berger Berry 1988; Edwards et al. 1963; Gallistel 2009; Kass 1993; Morey et al. 2014; Myung Pitt 1997; Raftery 1995; Rouder et al. 2009; Wagenmakers 2007). This approach differs from traditional NHSTs because Bayes factors permit a method of model comparison in which models including main effects and interactions are pitted against models that systematically exclude them. Bayesian analysis was therefore chosen because it can simultaneously address our hypotheses and allow evidence to be considered continuously rather than dichotomously. In the sections that examine PIT outcome measures ?overall weighted total scores, physical scores, other-ToM scores, and emotion-ToM scores ?we use the general linear model in which main effects and interactions are assessed (Table 2 contains descriptive statistics for all PIT outcome measures). Nineteen models were assessed for each PIT outcome: the null model in which there are no effects; a model including group diagnosis only; a model including Verbal IQ only; a model including age only; three additive models in which two of the three main effects only are included; an additive model in which only the three main effects are included; ten models including all possible combinations of the selective presence or absence of the 2-way interactions (with the constraint that the terms that comprise an interaction term also appear a.E supposedly spoken by people in a scene and asked to use the words to construct a sentence that could have been used in the pictured situation), and Figurative Language (participants are asked to tell in their own words what a person meant when saying an expression in a given situation; participants then choose which of four expressions was closest in meaning to the conversational statement). Raw scores for each subtest were calculated according to the procedures in the test manual and were used for all statistical analyses because the age range for standardized scores for this measure is 18 years and a number of the adult participants had chronological ages above that level. All four subtests were combined into a sum of subtests raw score for each participant. Of note, raw scores from subtests 2 (Making Inferences) and 4 (Figurative Language) are most closely related to the inference making constructs examined by the PIT, and were included in subsequent analyses. Assessment of Reliability–Inter-rater reliability was calculated for the PIT to ensure that all testers objectively scored test responses in the same way. Approximately 10 of tests (n=17) were randomly sampled and scored by two experienced examiners. The observed agreement between the two raters was nearly unanimous (Cohen’s kappa = .99). All of the TLC-E and ToM protocols were rescored by a second tester and any scoring or calculation errors were corrected.Author Manuscript Author Manuscript Author Manuscript Results Author ManuscriptAnalytical Approach Because of the increased backlash against using null hypothesis significance tests (NHSTs) as a vehicle for statistical inference (Anderson 1997; Cumming 2012; Cumming 2014; Kirk 2003; Wagenmakers 2007), we do not report these tests or the p values associated with them. Rather, we report Bayes factors (BFs; see Hoijtink et al. 2008; Jeffreys 1961; Kass Raftery 1995) to state evidence in favor of or against statistical models, an approach that hasJ Autism Dev Disord. Author manuscript; available in PMC 2016 September 01.Bodner et al.Pagebeen advocated repeatedly (Berger Berry 1988; Edwards et al. 1963; Gallistel 2009; Kass 1993; Morey et al. 2014; Myung Pitt 1997; Raftery 1995; Rouder et al. 2009; Wagenmakers 2007). This approach differs from traditional NHSTs because Bayes factors permit a method of model comparison in which models including main effects and interactions are pitted against models that systematically exclude them. Bayesian analysis was therefore chosen because it can simultaneously address our hypotheses and allow evidence to be considered continuously rather than dichotomously. In the sections that examine PIT outcome measures ?overall weighted total scores, physical scores, other-ToM scores, and emotion-ToM scores ?we use the general linear model in which main effects and interactions are assessed (Table 2 contains descriptive statistics for all PIT outcome measures). Nineteen models were assessed for each PIT outcome: the null model in which there are no effects; a model including group diagnosis only; a model including Verbal IQ only; a model including age only; three additive models in which two of the three main effects only are included; an additive model in which only the three main effects are included; ten models including all possible combinations of the selective presence or absence of the 2-way interactions (with the constraint that the terms that comprise an interaction term also appear a.