Uncategorized

# Boston College Climate Prediction Data Analysis Project 2 answering the short answers in the pdf in another world documentswhen there’s a question, there w

Boston College Climate Prediction Data Analysis Project 2 answering the short answers in the pdf in another world documentswhen there’s a question, there will be something like Question1.1 project2
December 10, 2019
In [148]: # Initialize OK
from client.api.notebook import Notebook
ok = Notebook(‘project2.ok’)
=====================================================================
Assignment: project2
OK, version v1.14.15
=====================================================================
1
1.1
Project 2: Climate Prediction
Due date: December 11, 5PM.
In [149]: # Don’t change this cell; just run it.
import numpy as np
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use(‘fivethirtyeight’)
from client.api.notebook import Notebook
Comment-out the cell below to make sure that you create a backup of your project on okpy.org.
In [150]: _ = ok.submit()
Saving notebook… Saved ‘project2.ipynb’.
Submit… 100% complete
1
Submission successful for user: chenchenfeng@ucsb.edu
URL: https://okpy.org/ucsb/int5/fa19/project2/submissions/nRWPmY
1.2
Predicting Temperatures
In this exercise, we will try to predict the weather in California using the prediction method discussed in section 8.1 of the textbook. Much of the code is provided for you; you will be asked to
understand and run the code and interpret the results.
The US National Oceanic and Atmospheric Administration (NOAA) operates thousands of climate observation stations (mostly in the US) that collect information about local climate. Among
other things, each station records the highest and lowest observed temperature each day. These
data, called “Quality Controlled Local Climatological Data,” are publicly available here and described here.
temperatures.csv contains an excerpt of that dataset. Each row represents a temperature
reading in Fahrenheit from one station on one day. (The temperature is actually the highest temperature observed at that station on that day.) All the readings are from 2015 and from California
stations.
Here is a scatter plot:
In [60]: temperatures.scatter(“Date”, “Temperature”)
_ = plots.xticks(np.arange(0, max(temperatures.column(‘Date’)), 100), rotation=65)
2
Each entry in the column “Date” is a number in MMDD format, meaning that the last two
digits denote the day of the month, and the first 1 or 2 digits denote the month. Thus, December
31st (12/31) is represented as 1231 and January 31st (1/31) is 131.
Question 1.1 Why do the data form vertical bands with gaps?
We will convert each date to the number of days since the start of the year.
Question 1.2 Implement the get_day_in_month function, as described below. The result
should be an integer. Hint: Use the remainder operator.
In [10]: def get_month(date):
“””The month in the year for a given date.
>>> get_month(315)
3
“””
return int(date / 100) # Divide by 100 and round down to the nearest integer
3
def get_day_in_month(date):
“””The day in the month for a given date.
>>> get_day_in_month(315)
15
“””
return int(date % 100)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests
——————————————————————–Test summary
Passed: 2
Failed: 0
[ooooooooook] 100.0% passed
Next, we’ll compute the day of the year for each temperature reading, which is the number of
days from January 1 until the date of the reading. Note that we are not dealing with a leap year (a
year with 366 days).
In [103]: # You don’t need to change this cell, but you are strongly encouraged
# to read all of the code and understand it.
days_in_month = make_array(31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31)
# A table with one row for each month. For each month, we have
# the number of the month (e.g. 3 for March), the number of
# days in that month in 2015 (e.g. 31 for March), and the
# number of days in the year before the first day of that month
# (e.g. 0 for January or 59 for March).
days_into_year = Table().with_columns(
“Month”, np.arange(12)+1,
“Days until start of month”, np.cumsum(days_in_month) – days_in_month)
# First, compute the month and day-of-month for each temperature.
months = temperatures.apply(get_month, “Date”)
day_of_month = temperatures.apply(get_day_in_month, “Date”)
with_month_and_day = temperatures.with_columns(
“Month”, months,
“Day of month”, day_of_month
)
4
# Then, compute how many days have passed since
# the start of the year to reach each date.
t = with_month_and_day.join(‘Month’, days_into_year)
day_of_year = t.column(‘Days until start of month’) + t.column(‘Day of month’)
with_dates_fixed = t.drop(0, 6, 7).with_column(“Day of year”, day_of_year)
with_dates_fixed
Out[103]: Temperature |
71
|
61
|
56
|
55
|
67
|
69
|
67
|
79
|
73
|
70
|
… (990 rows
Date | Latitude
127 | 32.7
102 | 34.1167
126 | 40.9781
111 | 37.3591
127 | 36.3189
130 | 33.6267
117 | 32.7
124 | 33.8222
116 | 35.2372
128 | 39.1019
omitted)
|
|
|
|
|
|
|
|
|
|
|
Longitude
-117.2
-119.117
-124.109
-121.924
-119.629
-116.159
-117.2
-116.504
-120.641
-121.568
|
|
|
|
|
|
|
|
|
|
|
Station name
San Diego
Point Mugu
Arcata/Eureka
San Jose
Hanford
Palm Springs
San Diego
Palm Springs
San Luis Obispo
Marysville
|
|
|
|
|
|
|
|
|
|
|
Day of year
27
2
26
11
27
30
17
24
16
28
Do we have values for all days in a year? Let’s find out by checking if any days are missing.
Question 1.3 Set missing to an array of all the days of the year (integers from 1 through 365)
that do not have any temperature readings. Hint: One possible strategy (but not the only one) is to
start with a table of all days in the year, then use either the predicate are.not_contained_in (docs)
or the method exclude (docs) to eliminate all of the days of the year that do have a temperature
In [132]: all_days_with_data = with_dates_fixed.group(‘Day of year’).column(‘Day of year’) # ge
missing = Table().with_column(‘DAYS’,np.arange(1,366,1)).where(‘DAYS’,are.not_contain
missing
Out[132]: array([ 14, 33, 35, 57, 60, 76, 80, 81, 85, 96, 102, 103, 130,
143, 178, 181, 186, 210, 215, 227, 247, 258, 264, 270, 272, 294,
319, 344, 354, 358])
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests
——————————————————————–Test summary
Passed: 3
Failed: 0
[ooooooooook] 100.0% passed
Using with_dates_fixed, we can make a better scatter plot.
5
In [134]: with_dates_fixed.scatter(“Day of year”, “Temperature”)
Let’s do some prediction. For any reading on any day, we will predict its value using all the
readings from the week before and after that day. A reasonable prediction is that the predicted
reading will be the average of all those readings. Let’s package our code in a function.
In [135]: def predict_temperature(day):
“””
A prediction of the temperature (in Fahrenheit) on a given day at some station.
Relies on a `with_dates_fixed` table that has the “Day of year” and “Temperature”
“””
nearby_readings = with_dates_fixed.where(“Day of year”, are.between_or_equal_to(d
Let’s see how well it works by looking at the values on a random day and seeing what is the
predicted temperature on that day. For example, day 223 is August 11 (8/11, hence the Date as
811) and the temperature varied a lot on that day, depending on the location.
In [136]: with_dates_fixed.where(“Day of year”, 223)
6
Out[136]: Temperature
63
75
92
91
|
|
|
|
|
Date
811
811
811
811
|
|
|
|
|
Latitude
38.3208
34.2008
40.5175
38.5552
|
|
|
|
|
Longitude
-123.075
-119.207
-122.299
-121.418
|
|
|
|
|
Station name
Bodega
Oxnard
Redding
Sacramento
|
|
|
|
|
Day of year
223
223
223
223
In [137]: predict_temperature(223)
Out[137]: 88.7872340425532
Question 1.4 Suppose you’re planning a trip to Yosemite for the day after your INT 5 final, Dec
11, and you’d like to predict the temperature on Dec 11. Use predict_temperature to compute a
prediction for a temperature reading on that day.
Hint: You can figure out what day of the year Dec 11 is by searching with_dates_fixed for a
similar date or by using the fact that Dec 31st is day 365.
In [ ]:
In [ ]:
In [41]: # convert the date for Dec 11 into a “day of year” value
dec_11_day = 365-20
dec_11_prediction = predict_temperature(dec_11_day)
dec_11_prediction
Out[41]: 63.54545454545455
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests
——————————————————————–Test summary
Passed: 2
Failed: 0
[ooooooooook] 100.0% passed
Below we have computed a predicted temperature for each reading in the table and plotted
both. (It may take a minute or two to run the cell.)
In [43]: with_predictions = with_dates_fixed.with_column(
“Predicted temperature”,
with_dates_fixed.apply(predict_temperature, “Day of year”))
with_predictions.select(“Day of year”, “Temperature”, “Predicted temperature”)
.scatter(“Day of year”)
7
Question 1.5 The scatter plot is called a graph of averages. In the example in the textbook, the
graph of averages roughly followed a straight line. Is that true for this one? Using your knowledge
about the weather, explain why or why not.
Question 1.6 According to the Wikipedia article on California’s climate, “[t]he climate of California varies widely, from hot desert to subarctic.” Suppose we limited our data to weather stations in a smaller area whose climate varied less from place to place (for example, the state of
Vermont, or the San Francisco Bay Area). If we made the same graph for that dataset, in what
ways would you expect it to look different? In what ways would you expect it to look the same?
1.3
Rainfall and Temperatures
In this exercise we will now look at data from the United States on Rainfall and Temperature. The
goal is to find out if there is a correlation between temperature change and rainfall.
Take a look at the predicting temperatures and visualizing them as a spiral in this article:
https://www.climatecentral.org/news/temperature-spiral-update-20399. As you can see, investigating the affect of the temperature change is an important important!
Merging Datasets The data for rainfall and temperature are in two different datasets, so we will
need to merge them before we do any prediction. Below we begin by reading in the data tables.
The dataset has extra columns that we don’t need, and also column labels that have a misleading
space, so we select only the data we need and relabel the columns.
8
.relabel(” Year”, “Year”)
.relabel(” Statistics”, “Statistics”)
.select(“Rainfall – (MM)”,”Year”,”Statistics”)
usa_rainfall
Out[13]: Rainfall – (MM) | Year | Statistics
39.1881
| 1901 | Jan Average
40.6421
| 1901 | Feb Average
46.521
| 1901 | Mar Average
50.2258
| 1901 | Apr Average
53.4599
| 1901 | May Average
59.6511
| 1901 | Jun Average
56.1925
| 1901 | Jul Average
66.0832
| 1901 | Aug Average
58.5535
| 1901 | Sep Average
37.158
| 1901 | Oct Average
… (1382 rows omitted)
.relabel(” Year”, “Year”)
.relabel(” Statistics”, “Statistics”)
.select(“Temperature – (Celsius)”,”Year”,”Statistics”)
usa_temperatures
Out[14]: Temperature – (Celsius)
-5.7112
-6.5577
-0.0045
4.78677
12.084
16.9349
20.8416
19.0919
13.9167
8.83625
… (1382 rows omitted)
|
|
|
|
|
|
|
|
|
|
|
Year
1901
1901
1901
1901
1901
1901
1901
1901
1901
1901
| Statistics
| Jan Average
| Feb Average
| Mar Average
| Apr Average
| May Average
| Jun Average
| Jul Average
| Aug Average
| Sep Average
| Oct Average
In [15]: usa_rain_and_temp = usa_rainfall.with_column(“Temperature – (Celsius)”,
usa_temperatures.column(“Temperature – (C
usa_rain_and_temp
Out[15]: Rainfall – (MM) | Year | Statistics
39.1881
| 1901 | Jan Average
40.6421
| 1901 | Feb Average
46.521
| 1901 | Mar Average
50.2258
| 1901 | Apr Average
53.4599
| 1901 | May Average
59.6511
| 1901 | Jun Average
9
|
|
|
|
|
|
|
Temperature – (Celsius)
-5.7112
-6.5577
-0.0045
4.78677
12.084
16.9349
56.1925
| 1901 |
66.0832
| 1901 |
58.5535
| 1901 |
| 1901 |
37.158
… (1382 rows omitted)
Jul
Aug
Sep
Oct
Average
Average
Average
Average
|
|
|
|
20.8416
19.0919
13.9167
8.83625
Since we know the data is arranged the exact same way between both tables, we can just join
by copying over the Temperature column. It is much easier that way.
In [16]: usa_rain_and_temp = usa_rainfall.with_column(“Temperature – (Celsius)”,
usa_temperatures.column(“Temperature – (C
usa_rain_and_temp
Out[16]: Rainfall – (MM) | Year | Statistics
39.1881
| 1901 | Jan Average
40.6421
| 1901 | Feb Average
46.521
| 1901 | Mar Average
50.2258
| 1901 | Apr Average
53.4599
| 1901 | May Average
59.6511
| 1901 | Jun Average
56.1925
| 1901 | Jul Average
66.0832
| 1901 | Aug Average
58.5535
| 1901 | Sep Average
37.158
| 1901 | Oct Average
… (1382 rows omitted)
|
|
|
|
|
|
|
|
|
|
|
Temperature – (Celsius)
-5.7112
-6.5577
-0.0045
4.78677
12.084
16.9349
20.8416
19.0919
13.9167
8.83625
Question 2.1 Create a scatterplot with the Temperature on the x-axis and Rainfall on the y-axis.
In [21]: # Enter code here to visualize Temperature vs. Rainfall
usa_rain_and_temp.scatter(‘Temperature – (Celsius)’,’Rainfall – (MM)’)
10
Comment on your observations about the plot. Do you think there is a relationship between
Temperature and Rainfall? If so, what kind of relationship do you see?
Question 2.2 Using the standard units function from the book in chapter 15.1, compute the
rainfall, and temperate in standard units.
In [45]: def standard_units(any_numbers):
“Convert any array of numbers to standard units.”
return (any_numbers – np.mean(any_numbers))/np.std(any_numbers)
In [145]: usa_rt_su = Table().with_columns(‘rainfall’,standard_units(usa_rain_and_temp.column(‘
usa_rt_su
Out[145]: rainfall
-1.24231
-1.11881
-0.619447
-0.304758
-0.0300496
0.495837
|
|
|
|
|
|
|
temperature
-1.3775
-1.46924
-0.758983
-0.239688
0.551213
1.07697
11
0.20206
| 1.50039
1.04219
| 1.31075
0.402606
| 0.749848
-1.41475
| 0.19921
… (1382 rows omitted)
Question 2.3 Plot the scatterplot in standard units below. Your plot should look the same
except with different units.
In [139]: # Enter code here to visualize Temperature vs. Rainfall in standard units
usa_rt_su.scatter(“temperature”, “rainfall”)
Why does it look the same? What is the significance of converting to standard units? What
value do you think r will take?
Question 2.4 Calculate the correlation coefficient, r.
Hint: Use the usa_rt_su table. Section 15.2 explains how to do this.
In [140]: usa_r = np.mean(usa_rt_su.column(‘rainfall’)*usa_rt_su.column(‘temperature’))
usa_r
12
Out[140]: 0.6045966246360028
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests
——————————————————————–Test summary
Passed: 1
Failed: 0
[ooooooooook] 100.0% passed
Question 2.5 Find the regression line and create a function that uses the regression line to give
a prediction on a given temperature in Celsius. In this case the regression line should take the
form of:
rainfall (MM) = slope × temperature (Celsius).
Hint: Look at a previous lab, or section 15.2 if you need a reminder on finding the regression
line.
In [142]: def predict_rainfall(pred_temp):
temperature_sd = np.std(usa_rain_and_temp.column(‘Temperature – (Celsius)’))
rainfall_sd = np.std(usa_rain_and_temp.column(‘Rainfall – (MM)’))
slope = usa_r*rainfall_sd/temperature_sd
temperature_mean = np.mean(usa_rain_and_temp.column(‘Temperature – (Celsius)’))
rainfall_mean = np.mean(usa_rain_and_temp.column(‘Rainfall – (MM)’))
intercept = rainfall_mean-slope*temperature_mean
prediction = slope*pred_temp+intercept
return prediction
Below we plot the regression line using the prediction function we wrote!
In [143]: def plot_data_and_line(dataset, x, y, point_0, point_1):
“””Makes a scatter plot of the dataset, along with a line passing through two poi
dataset.scatter(x, y, label=”data”)
xs, ys = zip(point_0, point_1)
plots.plot(xs, ys, label=”regression line”)
plots.legend(bbox_to_anchor=(1.5,.8))
In [144]: temp_min = np.min(usa_rain_and_temp.column(“Temperature – (Celsius)”))
temp_max = np.max(usa_rain_and_temp.column(“Temperature – (Celsius)”))
p_0 = [temp_min, predict_rainfall(temp_min)]
p_1 = [temp_max, predict_rainfall(temp_max)]
plot_data_and_line(usa_rain_and_temp, “Temperature – (Celsius)”, “Rainfall – (MM)”,p_
13
Question 2.6 Using your prediction function, what would you predict the rainfall to be at 30◦
celsius?
In [146]: pred_30 = predict_rainfall(30)
pred_30
Out[146]: 71.55854601215052
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests
——————————————————————–Test summary
Passed: 1
Failed: 0
[ooooooooook] 100.0% passed
1.3.1 Bonus: Do your own Analysis!
Keep in mind that this prediction has been made using data from the United States! Try out your
own regression analysis on another country in a similar fashion. More data can be downloaded
14
on the climate knowledge portal! Based on your results, how do you feel that the rainfall and
temperature are related for that country? How do you think that relationship holds when looking
at the world as a whole?
To submit:
1. Select Run All from the Cell menu to ensure that you have executed all cells, including the
test cells.
2. Save and Checkpoint from the File menu.
3. Read through the notebook to make sure everything is fine.
4. Submit using the cell below.
2
Submit
Make sure you have run all cells in your notebook in order before running the cell below, so that
all images/graphs appear in the output. Please save before submitting!
In [ ]: # Save your notebook first, then run this cell to submit.
import jassign.to_pdf
jassign.to_pdf.generate_pdf(‘project2.ipynb’, ‘project2.pdf’)
ok.submit()
15