Advanced tutorials ================== Bootstrapping uncertainties/confidence limits --------------------------------------------- To create confidence limits on binned cumulative, statistical measures and parameters, pyWitness uses the bootstrap method. This method takes :math:`N` random participants from the original data *with replacement*. pyWitness can then proceed to compute any quantity (ROC, CAC, pAUC, fit parameters). This is repeated :math:`M` times and the distribution of the computed quantity used to calculate a confidence interval with a user definable range. .. tabs:: .. code-tab:: python import pyWitness dr = pyWitness.DataRaw("test1.csv") dp = dr.process() dp.calculateConfidenceBootstrap(nBootstraps=200, cl=95) dp.printRates() .. code-tab:: R pyw <- import("pyWitness") dr <- pyw$DataRaw("./test1.csv") dr$collapseContinuousData(column = "confidence",bins = c(-1,60,80,100),labels=py_none()) dp <- dr$process() dp$calculateConfidenceBootstrap(nBootstraps=as.integer(200), cl=95) dp$printRates() After calling ``calculateConfidenceBootstrap`` the rates table is populated with the 95% confidence limit data .. code-block :: console confidence confidence 1 2 3 variable type cac central 0.861702 0.955614 0.969432 high 0.898148 0.969998 0.982672 low 0.819782 0.934552 0.946858 confidence central 45.873016 74.866469 95.630252 high 47.962794 75.440947 96.258406 low 44.128214 74.301501 94.949277 dprime central 1.975221 1.940776 1.585873 high 2.120760 2.101158 1.834337 low 1.815447 1.753813 1.328038 rf 0.315436 0.428412 0.256152 targetAbsent fillerId 0.284424 0.108352 0.031603 fillerId_high 0.321175 0.134259 0.048269 fillerId_low 0.240872 0.082314 0.016915 rejectId 0.715576 0.521445 0.241535 rejectId_high 0.752949 0.560668 0.277414 rejectId_low 0.672739 0.472804 0.202311 suspectId 0.047404 0.018059 0.005267 suspectId_high 0.053529 0.022377 0.008045 suspectId_low 0.040145 0.013719 0.002819 targetPresent fillerId 0.093960 0.046980 0.013423 fillerId_high 0.118805 0.067776 0.024778 fillerId_low 0.067489 0.025640 0.002278 rejectId 0.286353 0.176734 0.082774 rejectId_high 0.329399 0.211019 0.106384 rejectId_low 0.238167 0.140321 0.055782 suspectId 0.619687 0.438479 0.165548 suspectId_high 0.668216 0.493503 0.196507 suspectId_low 0.567298 0.392393 0.128108 zL central -1.670562 -2.095603 -2.557781 high -1.611559 -2.006968 -2.406877 low -1.749027 -2.205230 -2.768544 zT central 0.304658 -0.154827 -0.971908 high 0.434992 -0.016287 -0.854162 low 0.169499 -0.273088 -1.135387 If a plot function (``plotROC``, ``plotCAC``) is callled after calling ``calculateConfidenceBootstrap`` then the confidence interval is drawn as error bars, as shown in the ROC plot and CAC plot, respectively, below. .. tabs:: .. code-tab:: python import pyWitness dr = pyWitness.DataRaw("test1.csv") dp = dr.process() dp.calculateConfidenceBootstrap(nBootstraps=200, cl=95) dp.plotROC() .. code-tab:: R pyw <- import("pyWitness") dr <- pyw$DataRaw("./test1.csv") dr$collapseContinuousData(column = "confidence",bins = c(-1,60,80,100),labels=py_none()) dp <- dr$process() dp$calculateConfidenceBootstrap(nBootstraps=as.integer(200), cl=95) dp$plotROC() .. figure:: images/test1ROCbinErr.png :alt: ROC for test1.csv with error bars .. tabs:: .. code-tab:: python import pyWitness dr = pyWitness.DataRaw("test1.csv") dp = dr.process() dp.calculateConfidenceBootstrap(nBootstraps=200, cl=95) dp.plotCAC() .. code-tab:: R pyw <- import("pyWitness") dr <- pyw$DataRaw("./test1.csv") dr$collapseContinuousData(column = "confidence",bins = c(-1,60,80,100),labels=py_none()) dp <- dr$process() dp$calculateConfidenceBootstrap(nBootstraps=as.integer(200), cl=95) dp$plotCAC() .. figure:: images/test1CACbinErr.png :alt: CAC for test1.csv with error bars Loading raw data excel format ----------------------------- If the file is in ``excel`` format you will need to specify which sheet the raw data is stored in .. tabs:: .. code-tab:: python import pyWitness dr = pyWitness.DataRaw("test2.xlsx",excelSheet = "raw data") .. code-tab:: R pyw <- import("pyWitness") dr <- pyw$DataRaw("./test2.xlsx",excelSheet = "raw data") Transforming data into common format ------------------------------------ The raw experimental data does not have to be in the internal format used by pyWitness. As the data is loaded is it possible to replace the name of the data columns and the values stored. .. tabs:: .. code-tab:: python import pyWitness dr = pyWitness.DataRaw("test2.csv", dataMapping = {"lineupSize":"lineup_size", "targetLineup":"culprit_present", "targetPresent":"present", "targetAbsent":"absent", "responseType":"id_type", "suspectId":"suspect", "fillerId":"filler", "rejectId":"reject", "confidence":"conf_level"})) .. code-tab:: R pyw <- import("pyWitness") dr <- pyw$DataRaw("./test2.csv", dataMapping = list("lineupSize"="lineup_size", "targetLineup"="culprit_present", "targetPresent"="present", "targetAbsent"="absent", "responseType"="id_type", "suspectId"="suspect", "fillerId"="filler", "rejectId"="reject", "confidence"="conf_level")) Processing data for two conditions -------------------------------------- A single data file might have different experimental condtions. Imagine your data file has a column labelled ``Condition`` and the values for each participant is either ``Control`` or ``Verbal``. To proccess only the ``Control`` participants the following options are required for DataRaw.process() .. tabs:: .. code-tab:: python :linenos: :emphasize-lines: 4 import pyWitness dr = pyWitness.DataRaw("test2.csv") dr.cutData(column="previouslyViewedVideo",value=1,option="keep") dpControl = dr.process(column="group", condition="Control") .. code-tab:: R :linenos: :emphasize-lines: 4 pyw <- import("pyWitness") dr <- pyw$DataRaw("./test2.csv") dr$cutData(column="previouslyViewedVideo",value=1,option="keep") dpControl = dr$process(column="group", condition="Control") If you have a file with multiple conditions it is straightforward to make multiple ``DataProcessed`` for each condition, as in the following .. tabs:: .. code-tab:: python :linenos: :emphasize-lines: 5 import pyWitness dr = pyWitness.DataRaw("test2.csv") dr.cutData(column="previouslyViewedVideo",value=1,option="keep") dpControl = dr.process(column="group", condition="Control") dpVerbal = dr.process(column="group", condition="Verbal") .. code-tab:: R :linenos: :emphasize-lines: 5 pyw <- import("pyWitness") dr <- pyw$DataRaw("./test2.csv") dr$cutData(column="previouslyViewedVideo",value=1,option="keep") dpControl <- dr$process(column="group", condition="Control") dpVerbal <- dr$process(column="group", condition="Verbal") Statistical (pAUC) comparision between two conditions ----------------------------------------------------- One way to compare pAUC values of two conditions is use the following code on the test2 data. You can check out the script we wrote called pAUCexample.py. .. tabs:: .. code-tab:: python import pyWitness dr = pyWitness.DataRaw("test2.csv") dr.cutData(column="previouslyViewedVideo",value=1,option="keep") dpControl = dr.process(column="group", condition="Control") dpVerbal = dr.process(column="group", condition="Verbal") .. code-tab:: R pyw <- import("pyWitness") dr <- pyw$DataRaw("./test2.csv") dr$cutData(column="previouslyViewedVideo",value=1,option="keep") dpControl <- dr$process(column="group", condition="Control") dpVerbal <- dr$process(column="group", condition="Verbal") To find the lowest false ID rate from both conditions, .. tabs:: .. code-tab:: python :linenos: :emphasize-lines: 6 import pyWitness dr = pyWitness.DataRaw("test2.csv") dr.cutData(column="previouslyViewedVideo",value=1,option="keep") dpControl = dr.process(column="group", condition="Control") dpVerbal = dr.process(column="group", condition="Verbal") minRate = min(dpControl.liberalTargetAbsentSuspectId,dpVerbal.liberalTargetAbsentSuspectId) .. code-tab:: R :linenos: :emphasize-lines: 6 pyw <- import("pyWitness") dr <- pyw$DataRaw("./test2.csv") dr$cutData(column="previouslyViewedVideo",value=1,option="keep") dpControl <- dr$process(column="group", condition="Control") dpVerbal <- dr$process(column="group", condition="Verbal") minRate <- min(dpControl$liberalTargetAbsentSuspectId,dpVerbal$liberalTargetAbsentSuspectId) You have to process the data again, with this ``minRate`` .. tabs:: .. code-tab:: python :linenos: :emphasize-lines: 7-11 import pyWitness dr = pyWitness.DataRaw("test2.csv") dr.cutData(column="previouslyViewedVideo",value=1,option="keep") dpControl = dr.process(column="group", condition="Control") dpVerbal = dr.process(column="group", condition="Verbal") minRate = min(dpControl.liberalTargetAbsentSuspectId,dpVerbal.liberalTargetAbsentSuspectId) dpControl = dr.process("group","Control",pAUCLiberal=minRate) dpControl.calculateConfidenceBootstrap(nBootstraps=200) dpVerbal = dr.process("group","Verbal",pAUCLiberal=minRate) dpVerbal.calculateConfidenceBootstrap(nBootstraps=200) dpControl.comparePAUC(dpVerbal) .. code-tab:: R :linenos: :emphasize-lines: 7-11 pyw <- import("pyWitness") dr <- pyw$DataRaw("./test2.csv") dr$cutData(column="previouslyViewedVideo",value=1,option="keep") dpControl = dr$process(column="group", condition="Control") dpVerbal = dr$process(column="group", condition="Verbal") minRate = min(dpControl$liberalTargetAbsentSuspectId,dpVerbal$liberalTargetAbsentSuspectId) dpControl = dr$process("group","Control",pAUCLiberal=minRate) dpControl$calculateConfidenceBootstrap(nBootstraps=as.integer(200)) dpVerbal = dr$process("group","Verbal",pAUCLiberal=minRate) dpVerbal$calculateConfidenceBootstrap(nBootstraps=as.integer(200)) dpControl$comparePAUC(dpVerbal) To plot the ROC curves, use ``DataProcess.plotROC`` .. tabs:: .. code-tab:: python dpControl.plotROC(label = "Control data", relativeFrequencyScale=400) dpVerbal.plotROC(label = "Verbal data", relativeFrequencyScale=400) .. code-tab:: R dpControl$plotROC(label = "Control data", relativeFrequencyScale=400) dpVerbal$plotROC(label = "Verbal data", relativeFrequencyScale=400) .. note:: The symbol size is the relative frequency and can be changed by setting ``dp.plotROC(relativeFrequencyScale = 400)`` And your plot will look like this one: .. figure:: images/test2ROCs.png The shaded regions are the pAUCs that were compared. You can see that they both used the same minimum false ID rate. The error bars are 95% confidence intervals. The dashed black line represents chance performance. .. note:: The uncertainities can be changed by setting them to .68, for example ``dpControl.calculateConfidenceBootstrap(nBootstraps=200,cl=68)`` and ``dpVerbal.calculateConfidenceBootstrap(nBootstraps=200,cl=68)`` Loading processed data ---------------------- You might already have processed the raw data, or you only have a table of data. It is possible to load a file to perform model fits etc. The processed data need to be in the following CSV format. This is basically the same format as the pivot table stored in ``DataProcessed``. .. list-table:: Processed data columns and allowed values :widths: 35 15 15 15 15 15 15 15 15 15 15 15 :header-rows: 0 * - confidence - 0 - 10 - 20 - 30 - 40 - 50 - 60 - 70 - 80 - 90 - 100 * - targetAbsent fillerId - 2 - 7 - 5 - 8 - 10 - 20 - 26 - 20 - 14 - 8 - 6 * - targetAbsent rejectId - 2 - 5 - 5 - 6 - 9 - 24 - 35 - 56 - 68 - 43 - 64 * - targetPresent fillerId - 0 - 0 - 2 - 3 - 5 - 6 - 5 - 10 - 5 - 4 - 2 * - targetPresent rejectId - 3 - 1 - 0 - 6 - 10 - 20 - 9 - 19 - 23 - 16 - 21 * - targetPresent suspectId - 2 - 1 - 4 - 4 - 10 - 18 - 43 - 68 - 54 - 33 - 41 .. note :: If the ``targetAbsent suspectId`` row is not present it is estimated by ``(targetAbsent fillerId)/lineupSize`` The data are stored in ``data/tutorials/test1_processed.csv`` .. tabs:: .. code-tab:: python :linenos: :emphasize-lines: 2 import pyWitness dp = pyWitness.DataProcessed("test1_processed.csv", lineupSize = 6) .. code-tab:: R :linenos: :emphasize-lines: 2 pyw <- import("pyWitness") dp = pyw$DataProcessed("./test1_processed.csv", lineupSize = 6) Using instances of raw data, processed data and model fits ---------------------------------------------------------- Using an object orientated approach allows multiple instances (objects) to be created and manipulated. This allows many different data file variations on the processed data and model fits to be manipulated simultanuously in a single Python session. A good example is collapsing data, one might want to check the effect of rebinning the data. In the following example, the ``test1.csv`` is processed twice, once with the original binning (``dr1`` and ``dp1``) and one with 3 confidence bins (``dr2`` and ``dp2``) .. tabs:: .. code-tab:: python import pyWitness dr1 = pyWitness.DataRaw("test1.csv") dr2 = pyWitness.DataRaw("test1.csv") dr2.collapseContinuousData(column = "confidence",bins = [-1,60,80,100],labels=None) dp1 = dr1.process() dp2 = dr2.process() dp1.plotCAC() dp2.plotCAC() .. code-tab:: R pyw <- import("pyWitness") dr1 <- pyw$DataRaw("./test1.csv") dr2 <- pyw$DataRaw("./test1.csv") dr2$collapseContinuousData(column = "confidence",bins = c(-1,60,80,100),labels=py_none()) dp1 <- dr1$process() dp2 <- dr2$process() dp1$plotCAC() dp2$plotCAC() Overlaying plots ---------------- In general, each ``plotXXX`` function does not create a canvas, so to overlay plots the functions need to be called sequentially in order. To make a legend the plots need to be given a label. So this example is the same as the .. tabs:: .. code-tab:: python :linenos: :emphasize-lines: 10-14 import pyWitness dr1 = pyWitness.DataRaw("test1.csv") dr2 = pyWitness.DataRaw("test1.csv") dr2.collapseContinuousData(column = "confidence",bins = [-1,60,80,100],labels=None) dp1 = dr1.process() dp2 = dr2.process() dp1.plotCAC(label = "11 bins") dp2.plotCAC(label = "3 bins") import matplotlib.pyplot as _plt _plt.legend() .. code-tab:: R pyw <- import("pyWitness") dr1 <- pyw$DataRaw("./test1.csv") dr2 <- pyw$DataRaw("./test1.csv") dr2$collapseContinuousData(column = "confidence",bins = c(-1,60,80,100),labels=py_none()) dp1 <- dr1$process() dp2 <- dr2$process() dp1$plotCAC(label="11 bins") dp2$plotCAC(label = "3 bins") mpl$pyplot$legend() invisible(mpl$pyplot$ylim(0.50,1.00)) After overlaying plots it maybe important to change the plot axis ranges this can be done with ``xlim`` and ``ylim`` .. tabs:: .. code-tab:: python xlim(0,100) ylim(0.50,1.00) .. code-tab:: R invisible(mpl$pyplot$xlim(0,100)) invisible(mpl$pyplot$ylim(0.50,1.0)) .. figure:: images/test1Overlay.png :alt: CAC for test1.csv with two different binning Generating data from signal detection model ------------------------------------------- Raw and processed data can be generated simply from a signal detection model. .. tabs:: .. code-tab:: python :linenos: :emphasize-lines: 8 import pyWitness dr = pyWitness.DataRaw("test1.csv") dr.collapseContinuousData(column = "confidence",bins = [-1,60,80,100],labels=None) dp = dr.process() mf = pyWitness.ModelFitIndependentObservation(dp, debug=True) mf.setEqualVariance() mf.fit() dr1 = mf.generateRawData(nGenParticipants=10000) .. code-tab:: R :linenos: :emphasize-lines: 8 pyw <- import("pyWitness") dr <- pyw$DataRaw("./test1.csv") dr$collapseContinuousData(column = "confidence",bins = c(-1,60,80,100),labels=py_none()) dp <- dr1$process() mf <- pyw$ModelFitIndependentObservation(dp, debug=TRUE) mf$setEqualVariance() mf$fit() dr1 = mf$generateRawData(nGenParticipants=10000) ``dr1`` is a ``DataRaw`` object and is simulated data for 10,000 participants. ``dr1`` can be used for any pyWitness analysis so ROC, CAC, pAUC, etc. The raw data can also be written to disk to either preserve and/or share with colleagues. .. tabs:: .. code-tab:: python :linenos: :emphasize-lines: 1-2 dr1.writeCsv("fileName.csv") dr1.writeExcel("fileName.xlsx") .. code-tab:: R :linenos: :emphasize-lines: 1-2 dr1$writeCsv("./fileName.csv") dr1$writeExcel("./fileName.xlsx") Having performed a fit on ``dr`` and generated ``dr1`` a synthetic dataset .. tabs:: .. code-tab:: python # Need to process the synthetic data dp1 = dr1.process() # calculate uncertainties using bootstrap dp.calculateConfidenceBootstrap() dp1.calculateConfidenceBootstrap() # plot ROCs dp.plotROC(label="Experimental data") dp1.plotROC(label="Simulated data") mf.plotROC(label="Model fit") import matplotlib.pyplot as _plt _plt.legend() .. code-tab:: R # Need to process the synthetic data dp1 <- dr1$process() # calculate uncertainties using bootstrap dp$calculateConfidenceBootstrap() dp1$calculateConfidenceBootstrap() # plot ROCs dp$plotROC(label="Experimental data") dp1$plotROC(label="Simulated data") mf$plotROC(label="Model fit") mpl$pyplot$legend() mpl$pyplot$show() .. figure:: images/test1GenEx.png :alt: Generated data comparision example Power analysis -------------- By having the ability to generate data from a model it is possible to vary the number of generated participants. This is not too dissimilar to bootstrapping. Instead of generating new samples (with replacement) from the data, new samples with variable numbers of participants is possible. For each sample all the analysis can be performed and dependence on sample size can be explored. .. tabs:: .. code-tab:: python for nGen in numpy.linspace(500, 5000, 9+1) : drSimulated = mf.generateRawData(nGenParticipants = nGen) dpSimulated = drSimulated.process() dpSimulated.calculateConfidenceBootstrap(nBootstraps=2000) print(nGen, dpSimulated.liberalTargetAbsentSuspectId,dpSimulated.pAUC, dpSimulated.pAUC_low, dpSimulated.pAUC_high) .. code-tab:: R for (nGen in list(500,1000,1500)) { drSimulated <- mf$generateRawData(nGenParticipants = nGen) dpSimulated <- drSimulated$process() dpSimulated$calculateConfidenceBootstrap(nBootstraps=as.integer(2000)) print(nGen, dpSimulated$liberalTargetAbsentSuspectId,dpSimulated$pAUC, dpSimulated$pAUC_low, dpSimulated$pAUC_high) }