class: center, middle, inverse, title-slide # How to see in 100 dimensions ## Transforming your data to better understand it ### Max Turgeon ### University of Manitoba ### 19 June 2020 --- class: middle, center, inverse # Every day tons of data are being collected and analysed. ??? Changes in technology make this faster, easier, and cheaper. Every time you visit a website, go to the hospital. More complex data, e.g. pictures, videos, text. --- # Improving Health Care <img src="figures/medical_encounter.png" width="625px" height="460px"/> ??? Examples of problems we can try to solve Registration asks questions, physician assessment in ER, tests being done, admission to internal medicine, diagnoses. Use this information to improve health care: - How can we reduce wait times? - Are we doing too many tests? - Where should we put more beds? - How likely is a patient to come back? --- # Sports Analytics data:image/s3,"s3://crabby-images/8d5e8/8d5e86c93ae88847bfaa45135da2515339837e71" alt="" .small[WSJ, March 2014] ??? Collect data on type/speed of pitch, strike/ball, type of hit, where does the ball fall. Use this information to improve performance: - What type of pitch works best under different circumstances? - Who should try to steal a base and when? - Where should we put defensive players? --- class: center # Take-home message ### Data science for .strong[complex data] ??? DS = Emerging discipline Complex = unstructured, high-volume, etc. -- ### Data science to gain .strong[insight] and .strong[solve problems] -- ### .strong[Transforming data] can help with understanding ??? Dimension reduction solves real-life problems, and we'll see examples below. --- ## Data Science - Combines **statistics**, **computer science** and **subject matter** .center[data:image/s3,"s3://crabby-images/2b3cc/2b3cc96e3024f8663c929c736de0b8bcd92effad" alt=""] - Look for evidence of relationships between variables ??? DS = Emerging discipline - Statistics = understanding properties of sampling and data - Computer Science = extract and manipulate data - Subject matter = know which questions to ask --- ## Dimension reduction .center[data:image/s3,"s3://crabby-images/418e1/418e1d81ad2be738e580e48b0df0bd3144da0763" alt=""] .pull-left[data:image/s3,"s3://crabby-images/73fd8/73fd80de8e954e69977ee034a3abf742ccb0f3f4" alt=""] ??? Dataset on penguins, different species with body measurements Dimension reduction = Finding "hidden" structures in data Rotation, reflection, rescaling, and projection onto lower plane Can help find relationship between 100s of variables OR After transformation, relationships may be easier to see -- .pull-right[data:image/s3,"s3://crabby-images/afbc7/afbc70b4c916b44e7487420a196d9c5b9dc2f244" alt=""] --- # Two examples #### 1. Population structure - Genetic epidemiology -- #### 2. Canadian politics and Twitter - Natural language processing - *Work by Joshua Hamilton (Joint CS and Statistics)* --- # Genotype data .center[data:image/s3,"s3://crabby-images/418e1/418e1d81ad2be738e580e48b0df0bd3144da0763" alt=""] .pull-left[data:image/s3,"s3://crabby-images/f9eb6/f9eb6b460f396119b51bddc3fe7eb927e39c2f2b" alt=""] ??? DNA sequence is made of nucleotides (ATCG) Technology "measures" or "estimates" what the nucleotide is at genomic locations Millions or billions of locations, on thousands of people -- .pull-right[ ``` ## Pos1 Pos2 Pos3 Pos4 Pos5 Pos6 Pos7 ## Ind1 0 1 0 2 2 2 1 ## Ind2 1 1 1 2 2 0 0 ## Ind3 1 1 1 0 0 1 2 ## Ind4 1 0 1 1 0 0 2 ``` ] ??? Can we find some hidden relationship in the data? --- ### **Estimating population structure** .center[data:image/s3,"s3://crabby-images/175cc/175cc57246ddf8a735bf45c2260f262fe4a3ade4" alt=""] .small[Novembre *et al* (2008). "Genes mirror geography within Europe."] ??? DNA is transmitted from parents to offspring DNA contains information about ancestry Ancestry is correlated with geography --- # Why? #### 1. Both disease and genetic information vary geographically. -- #### 2. When looking for genetic source of diseases, this can lead to spurious results. -- .center[**Possible solution**: Account for population structure.] --- # Unstructured text data .center[data:image/s3,"s3://crabby-images/418e1/418e1d81ad2be738e580e48b0df0bd3144da0763" alt=""] .pull-left[data:image/s3,"s3://crabby-images/e82dc/e82dc040c23d45c758a9ffdb84cfe1564d8164e4" alt=""] ??? Natural language are very hard to analyse because of all implicit rules and context that humans have We can still extract information by looking at which words appear in a given tweet Bag of words -- .pull-right[ ``` ## people ahead scheer trudeau taxes ## Tweet1 0 0 0 0 1 ## Tweet2 0 1 1 0 1 ## Tweet3 0 1 1 1 0 ## Tweet4 0 0 1 1 0 ## Tweet5 0 0 1 1 0 ## Tweet6 1 0 1 0 0 ``` ] ??? For each word, we keep track of whether it appeared in a given tweet --- ### **Estimating semantic structure** .center[data:image/s3,"s3://crabby-images/46e94/46e94755373bd6a3223c34e103db1d6952f2ac45" alt=""] ??? Choice of words alone captures information about political stance --- # Why? #### Computers are not very good at understanding natural languages - Unlike humans! -- #### It can be difficult to analyse large volumes of text data -- .center[**Possible solution**: Transform text data into numerical values.] ??? With example above, we can look at whether there are more or less differences when looking at specific topics --- # Conclusion #### We are surrounded by complex data - Also: images, videos, networks, etc. #### Data science provides us with tools to tackle these challenges #### Dimension reduction is one way to find hidden structure in data --- class: title-slide-final, middle, center # Slides can be found at ### [maxturgeon.ca/talks](https://maxturgeon.ca/talks) # Questions? Go to ### [www.sli.do](https://www.sli.do) (Event #2370)