class: center, middle, inverse, title-slide # How to see in 100 dimensions ## Transforming your data to better understand it ### Max Turgeon ### University of Manitoba ### 19 June 2020 --- class: middle, center, inverse # Every day tons of data are being collected and analysed. ??? Changes in technology make this faster, easier, and cheaper. Every time you visit a website, go to the hospital. More complex data, e.g. pictures, videos, text. --- # Improving Health Care <img src="figures/medical_encounter.png" width="625px" height="460px"/> ??? Examples of problems we can try to solve Registration asks questions, physician assessment in ER, tests being done, admission to internal medicine, diagnoses. Use this information to improve health care: - How can we reduce wait times? - Are we doing too many tests? - Where should we put more beds? - How likely is a patient to come back? --- # Sports Analytics ![](figures/baseball_shift.jpg) .small[WSJ, March 2014] ??? Collect data on type/speed of pitch, strike/ball, type of hit, where does the ball fall. Use this information to improve performance: - What type of pitch works best under different circumstances? - Who should try to steal a base and when? - Where should we put defensive players? --- class: center # Take-home message ### Data science for .strong[complex data] ??? DS = Emerging discipline Complex = unstructured, high-volume, etc. -- ### Data science to gain .strong[insight] and .strong[solve problems] -- ### .strong[Transforming data] can help with understanding ??? Dimension reduction solves real-life problems, and we'll see examples below. --- ## Data Science - Combines **statistics**, **computer science** and **subject matter** .center[![](figures/venn_diag.png)] - Look for evidence of relationships between variables ??? DS = Emerging discipline - Statistics = understanding properties of sampling and data - Computer Science = extract and manipulate data - Subject matter = know which questions to ask --- ## Dimension reduction .center[![](figures/red-arrow.png)] .pull-left[![](figures/penguin_pre.png)] ??? Dataset on penguins, different species with body measurements Dimension reduction = Finding "hidden" structures in data Rotation, reflection, rescaling, and projection onto lower plane Can help find relationship between 100s of variables OR After transformation, relationships may be easier to see -- .pull-right[![](figures/penguin_post.png)] --- # Two examples #### 1. Population structure - Genetic epidemiology -- #### 2. Canadian politics and Twitter - Natural language processing - *Work by Joshua Hamilton (Joint CS and Statistics)* --- # Genotype data .center[![](figures/red-arrow.png)] .pull-left[![](figures/DNA_nucleotide.jpg)] ??? DNA sequence is made of nucleotides (ATCG) Technology "measures" or "estimates" what the nucleotide is at genomic locations Millions or billions of locations, on thousands of people -- .pull-right[ ``` ## Pos1 Pos2 Pos3 Pos4 Pos5 Pos6 Pos7 ## Ind1 0 1 0 2 2 2 1 ## Ind2 1 1 1 2 2 0 0 ## Ind3 1 1 1 0 0 1 2 ## Ind4 1 0 1 1 0 0 2 ``` ] ??? Can we find some hidden relationship in the data? --- ### **Estimating population structure** .center[![](figures/europe_pca.jpg)] .small[Novembre *et al* (2008). "Genes mirror geography within Europe."] ??? DNA is transmitted from parents to offspring DNA contains information about ancestry Ancestry is correlated with geography --- # Why? #### 1. Both disease and genetic information vary geographically. -- #### 2. When looking for genetic source of diseases, this can lead to spurious results. -- .center[**Possible solution**: Account for population structure.] --- # Unstructured text data .center[![](figures/red-arrow.png)] .pull-left[![](figures/Wordplot.png)] ??? Natural language are very hard to analyse because of all implicit rules and context that humans have We can still extract information by looking at which words appear in a given tweet Bag of words -- .pull-right[ ``` ## people ahead scheer trudeau taxes ## Tweet1 0 0 0 0 1 ## Tweet2 0 1 1 0 1 ## Tweet3 0 1 1 1 0 ## Tweet4 0 0 1 1 0 ## Tweet5 0 0 1 1 0 ## Tweet6 1 0 1 0 0 ``` ] ??? For each word, we keep track of whether it appeared in a given tweet --- ### **Estimating semantic structure** .center[![](figures/PCA Scatter All Issues.png)] ??? Choice of words alone captures information about political stance --- # Why? #### Computers are not very good at understanding natural languages - Unlike humans! -- #### It can be difficult to analyse large volumes of text data -- .center[**Possible solution**: Transform text data into numerical values.] ??? With example above, we can look at whether there are more or less differences when looking at specific topics --- # Conclusion #### We are surrounded by complex data - Also: images, videos, networks, etc. #### Data science provides us with tools to tackle these challenges #### Dimension reduction is one way to find hidden structure in data --- class: title-slide-final, middle, center # Slides can be found at ### [maxturgeon.ca/talks](https://maxturgeon.ca/talks) # Questions? Go to ### [www.sli.do](https://www.sli.do) (Event #2370)