Week 04

Thursday announcements

Project brainstorm due today
Reminder that your project topic does need to be tied loosely to “data for social good.” Sports is not off limits, but should look different from a typical “sports analytics” project. Some ideas:
- Pay equity in sports
- Does participation in sports predict desirable social outcomes? E.g., higher likelihood of graduating, lower likelihood of committing crimes, long-term health outcomes, etc
- CTE among football (or other) athletes
- Societal effects of sports betting
- How does access to / participation in various sports differ across socio-economic lines?
- Etc etc.
Choose your partner by Tuesday class-time - email me your choice
Project proposal will be due Saturday Feb 21
Next two weeks, the Lecture Recording on Perusall will include short readings from Communicating with Data: The Art of Writing For Data Science
- For next week, there is ~30 minutes of lecture + ~15pages

Tidyverse style guidelines

Tidyverse style guidelines for line breaks, white space, etc.

Have space before |> and a linebreak after

# Good
iris |>
  group_by(Species) |>
  summarize_if(is.numeric, mean) |>
  ungroup() |>
  gather(measure, value, -Species) |>
  arrange(value)

# Bad
iris |> group_by(Species) |> summarize_all(mean) |>
ungroup |> gather(measure, value, -Species) |>
arrange(value)

ggplot() layers each go on their own line, space before +, all but first layer indented.

# Good
ggplot(aes(x = Sepal.Width, y = Sepal.Length, color = Species)) +
  geom_point() +
  labs(
    x = "Sepal width, in cm",
    y = "Sepal length, in cm",
    title = "Sepal length vs. width of irises"
  ) 

# Bad
ggplot(aes(x = Sepal.Width, y = Sepal.Length, color = Species))+
  geom_point() + labs(x = "Sepal width, in cm", y = "Sepal length, in cm", title = "Sepal length vs. width of irises")

If arguments to a function don’t fit on one line, put each argument on its own line

iris |>
  group_by(Species) |>
  summarise(
    Sepal.Length = mean(Sepal.Length),
    Sepal.Width = mean(Sepal.Width),
    Species = n_distinct(Species)
  )

iris |>
  group_by(Species) |>
  summarise(Sepal.Length = mean(Sepal.Length),
            Sepal.Width = mean(Sepal.Width),
            Species = n_distinct(Species))

Spaces on each side of all operators (e.g. +, =, &, <-,) and after commas

# Good
iris |> 
  mutate(ratio = Sepal.Length/Sepal.Width) |> 
  ggplot(aes(x = Sepal.Width, y = Sepal.Length)) +
  geom_point()

# Bad
iris|> 
  mutate(ratio=Sepal.Length/Sepal.Width) |> 
  ggplot(aes(x=Sepal.Width,y = Sepal.Length)) +
  geom_point()

Data Scientist of the Week

Mike Dairyko

Tuesday Announcements

Due Thursday: Lab 03, Project Brainstorm post, Annotations

Once again, thanks for great engagement on Annotations

PS, I consider this one of the most important parts of the class!
I’m curating a lot of the questions you’re raising, and we’ll have time to discuss them more in depth in Week 8
From your peers:
- “Although a data scientists job may be mostly done behind a screen, it can be jarring to think about the true affect that their discoveries can have on the average human being. As intertwined as our cultural, political, and economic systems are, I believe that data science is equally as important to consider from a wider point of view.”
- “The smartest technical people aren’t necessarily the ones who are in charge of social change (they have different skillsets and jobs). The problem is we have people making models who do not have an obligation to know about the social impacts.”
Does Villanova have a data ethics course?

Questions?

Will all the quizzes be one week behind after missing the first quiz?

Let’s take a vote!

quizzes cover new material covered in lecture videos & Tuesday AE
quizzes cover material from the previous week, so you’ve had more time practicing on the lab

Can we work on the project independently?

Short answer, no.

Longer answer: there’s a few reasons for this:

Data science is collaborative in most workplaces
Especially in light of many of the readings, it’s important to have multiple people speak into an analysis. We all have blind spots - collaboration & intentionaly seeking feedback make us better!
Less important substantively, but helpful logistically: having less groups will allow us to NOT have to say here until May 11 :)

Who are you rooting for in the super bowl?

Application Exercise

The remainder of class will be spent on AE-04.
You can access it from GitHub.
It is due at the end of class today.
To turn it in, you should upload your .html file to Blackboard.

Tidyverse style guidelines

Tidyverse style guidelines for line breaks, white space, etc.

Have space before |> and a linebreak after

# Good
iris |>
  group_by(Species) |>
  summarize_if(is.numeric, mean) |>
  ungroup() |>
  gather(measure, value, -Species) |>
  arrange(value)

# Bad
iris |> group_by(Species) |> summarize_all(mean) |>
ungroup |> gather(measure, value, -Species) |>
arrange(value)

ggplot() layers each go on their own line, space before +, all but first layer indented.

# Good
ggplot(aes(x = Sepal.Width, y = Sepal.Length, color = Species)) +
  geom_point() +
  labs(
    x = "Sepal width, in cm",
    y = "Sepal length, in cm",
    title = "Sepal length vs. width of irises"
  ) 

# Bad
ggplot(aes(x = Sepal.Width, y = Sepal.Length, color = Species))+
  geom_point() + labs(x = "Sepal width, in cm", y = "Sepal length, in cm", title = "Sepal length vs. width of irises")

If arguments to a function don’t fit on one line, put each argument on its own line

iris |>
  group_by(Species) |>
  summarise(
    Sepal.Length = mean(Sepal.Length),
    Sepal.Width = mean(Sepal.Width),
    Species = n_distinct(Species)
  )

iris |>
  group_by(Species) |>
  summarise(Sepal.Length = mean(Sepal.Length),
            Sepal.Width = mean(Sepal.Width),
            Species = n_distinct(Species))

Spaces on each side of all operators (e.g. +, =, &, <-,) and after commas

# Good
iris |> 
  mutate(ratio = Sepal.Length/Sepal.Width) |> 
  ggplot(aes(x = Sepal.Width, y = Sepal.Length)) +
  geom_point()

# Bad
iris|> 
  mutate(ratio=Sepal.Length/Sepal.Width) |> 
  ggplot(aes(x=Sepal.Width,y = Sepal.Length)) +
  geom_point()