讲解:COM4509-6509、Python、pipeline、PythonHaskell|SPSS

COM4509-6509_Assignment_1_UCard_XXXXXXXXX(1)Assignment 1 BriefDeadline: Tuesday, October 29, 2019 at 14:00 hrsNumber of marks available: 20Scope: Sessions 1 to 5How and what to submitA. Submit a Jupyter Notebook named COM4509-6509_Assignment_1_UCard_XXXXXXXXX.ipynb whereXXXXXXXXX refers to your UCard number.B. Upload the notebook file to MOLE before the deadline above.C. NO DATA UPLOAD: Please do not upload the data files used. We have a copy already.Assessment CriteriaBeing able to express an objective function and its gradients in matrix form.Being able to use numpy and pandas to preprocess a dataset.Being able to use numpy to build a machine learning pipeline for supervised learning.Late submissionsWe follow Departments guidelines about late submissions, i.e., a deduction of 5% of the mark each workingday the work is late after the deadline. NO late submission will be marked one week after the deadline becausewe will release a solution by then. Please read this link(https://sites.google.com/sheffield.ac.uk/compgtstudenthandbook/menu/assessment/late-submission?pli=1&authuser=1).Use of unfair meansAny form of unfair means is treated as a serious academic offence and action may be taken under theDiscipline Regulations. (from the MSc Handbook). Please carefully read this link(https://sites.google.com/sheffield.ac.uk/compgtstudenthandbook/menu/referencing-unfair-means?pli=1&authuser=1) on what constitutes Unfair Means if not sure.Regularisation for Linear RegressionRegularisation is a technique commonly used in Machine Learning to prevent overfitting. It consists on addingterms to the objective function such that the optimisation procedure avoids solutions that just learn the trainingdata. Popular techniques for regularisation in Supervised Learning include Lasso Regression, RidgeRegression and the Elastic Net.In this Assignment, you will be looking at Ridge Regression and devising equations to optimise the objectivefunction in Ridge Regression using two methods: a closed-form derivation and the update rules for stochasticgradient descent. You will then use those update rules for making predictions on a Air Quaility dataset.2019/10/27 COM4509-6509_Assignment_1_UCard_XXXXXXXXX(1)localhost:8888/notebooks/COM4509-6509_Assignment_1_UCard_XXXXXXXXX(1).ipynb 2/7Ridge RegressionLet us start with a data set for training , where the vector and is the designmatrix from Lab 3, this is,Our predictive model is going to be a linear modelwhere .The objetive function we are going to use has the following formwhere is known as the regularisation parameter.The first term on the right-hand side (rhs) of the expression for is very similar to the least-squaresobjective function we have seen before, for example in Lab 3. The only difference is on the term that we useto normalise the objective with respect to the number of observations in the dataset.The first term on the rhs is what we call the fitting term whereas the second term in the expression is theregularisation term. Given , the two terms in the expression have different purposes. The first term is lookingfor a value of that leads the squared-errors to zero. While doing this, can take any value and lead to asolution that it is only good for the training data but perhaps not for the test data. The second term isregularising the behavior of the first term by driving the towards zero. By doing this, it restricts the possibleset of values that might take according to the first term. The value that we use for will allow a compromisebetween a value of that exactly fits the data (first term) or a value of that does not grow too much (secondterm).This type of regularisation has different names: ridge regression, Tikhonov regularisation or normregularisation.Question 1: in matrix form (2 marks)Write the expression for in matrix form. Include ALL the steps necessary to reach the expression.Question 1 AnswerWrite your answer to the question in this box.Optimising the objective function with respect toThere are two ways we can optimise the objective function with respect to . The first one leads to a closedform expression for and the second one using an iterative optimisation procedure that updates the value ofat each iteration by using the gradient of the objective function with respect to ,2019/10/27 COM4509-6509_Assignment_1_UCard_XXXXXXXXX(1)localhost:8888/notebooks/COM4509-6509_Assignment_1_UCard_XXXXXXXXX(1).ipynb 3/7at eac te at o by us g t e g ad e t o t e object e u ct o t espect to ,where is the learning rate parameter and is the gradient of the objective function.Question 2: Derivative of wrt (2 marks)Find the closed-form expression for by taking the derivative of with respect to , equating to zeroand solving for . Write the expression in matrix form.Also, write down the specific update rule for by using the equation above.Question 2 AnswerWrite your answer to the question in this box.Using ridge regression to predict air qualityOur dataset comes from a popular machine learning repository that hosts open source datasets for educationaland research purposes, the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/index.php). We aregoing to use ridge regression for predicting air quality. The description of the dataset can be found here(https://archive.ics.uci.edu/ml/datasets/Air+Quality).In [ ]:In [ ]:We can see some of the rows in the datasetIn [ ]:The target variable corresponds to the CO(GT) variable of the first column. The following columns correspondto the variables in the feature vectors, e.g., PT08.S1(CO) is up until AH which is . The original datasetalso has a date and a time columns that we are not going to use in this assignment.Before designing our predictive model, we need to think about three stages: the preprocessing stage, thetraining stage and the validation stage. The three stages are interconnected and it is important to rememberthat the testing data that we use for validation has to be set aside before preprocessing. Any preprocessing thatyou do has to be done only on the training data and several key statistics need to be saved for the test stage.𝑥1 𝑥𝐷import podspods.util.download_url(https://archive.ics.uci.edu/ml/machine-learning-databases/00360/AirQualityUCimport zipfilezip = zipfile.ZipFile(./AirQualityUCI.zip, r)for name in zip.namelist(): zip.extract(name, .)# The .csv version of the file has some typing issues, so we use the excel versionimport pandas as pdair_quality = pd.read_excel(./AirQualityUCI.xlsx, usecols=range(2,15))air_quality.sample(5)2019/10/27 COM4509-6509_Assignment_1_UCard_XXXXXXXXX(1)localhost:8888/notebooks/COM4509-6509_Assignment_1_UCard_XXXXXXXXX(1).ipynb 4/7Separating the dataset into training and test before any preprocessing has happened help us to recreate thereal world scenario where代写COM4509-6509、代做Python、代写pipe we will deploy our system and for which the data will come without anypreprocessing.We are going to use hold-out validation for testing our predictive model so we need to separate the dataset intoa training set and a test set.Question 3: Splitting the dataset (1 mark)Split the dataset into a training set and a test set. The training set should have 70% of the total observationsand the test set, the 30%. For making the random selection make sure that you use a random seed thatcorresponds to the last five digits of your student UCard. Make sure that you comment your code.Question 3 AnswerIn [ ]:Preprocessing the dataThe dataset has missing values tagged with a -200 value. Before doing any work with the training data, we wantto make sure that we deal properly with the missing values.Question 4: Missing values (3 marks)Make some exploratory analysis on the number of missing values per column in the training data.Remove the rows for which the target feature has missing values. We are doing supervised learning so weneed all our data observations to have known target values.Remove features with more than 20% of missing values. For all the other features with missing values, usethe mean value of the non-missing values for imputation.Question 4 AnswerIn [ ]:Question 5: Normalising the training data (2 marks)Now that you have removed the missing data, we need to normalise the input vectors.Explain in a sentence why do you need to normalise the input features for this dataset.Normalise the training data by substracting the mean value for each feature and dividing the result by thestandard deviation of each feature. Keep the mean values and standard deviations, you will need them attest time.Question 5 AnswerWrite your explanation in this box# Write your code here# Write your code here2019/10/27 COM4509-6509_Assignment_1_UCard_XXXXXXXXX(1)localhost:8888/notebooks/COM4509-6509_Assignment_1_UCard_XXXXXXXXX(1).ipynb 5/7In [ ]:Training and validation stagesWe have now curated our training data by removing data observations and features with a large amount ofmissing values. We have also normalised the feature vectors. We are now in a good position to work ondeveloping the prediction model and validating it. We will use both the closed form expression for andgradient descent for iterative optimisation.We first organise the dataframe into the vector of targets and the design matrix .In [ ]:Question 6: training with closed form expression for (3 marks)To find the optimal value of using the closed form expression that you derived before, we need to know thevalue of the regularisation parameter in advance. We will determine the value by using part of the trainingdata for finding the parameters and another part of the training data to choose the best from a set ofpredefined values.Use np.logspace(start, stop, num) to create a set of values for in log scale. Use the followingparameters start=-3 , stop=2 and num=20 .Randomly split the training data into what is properly called the training set and the validation set. Asbefore, make sure that you use a random seed that corresponds to the last five digits of your studentUCard. Use 70% of the data for the training set and 30% of the data for the validation set.For each value that you have for from the previous step, use the training set to compute and thenmeasure the mean-squared error (MSE) over the validation data. After this, you will have num=20 MSEvalues. Choose the value of that leads to the lower MSE and save it. You will use it at the test stage.What was the best value of ? Is there any explanation for that?Question 6 AnswerIn [ ]:Write your answer to the last question here.Question 7: validation with the closed form expression for (2 marks)We are going to deal now with the test data to perform the validation of the model. Remember that the test datamight also contain missing values in the target variable and in the input features.Remove the rows of the test data for which the labels have missing values.𝐰# Write your code here# Write your code here to get y and Xy =X =# Write your code here2019/10/27 COM4509-6509_Assignment_1_UCard_XXXXXXXXX(1)localhost:8888/notebooks/COM4509-6509_Assignment_1_UCard_XXXXXXXXX(1).ipynb 6/7If you remove any feature at the training stage, you also need to remove the same features from the teststage.Replace the missing values on each feature variables with the mean value you computed in the trainingdata.Normalise the test data using the means and standard deviations computed from the training dataCompute again for the value of that best performed on the validation set using ALL the training data(not all the training set).Report the MSE on the preprocessed test data and an histogram with the absolute error.Does the regularisation have any effect on the model? Explain your answer.Question 7 Answer𝐰 𝛼In [ ]:Write the explanation to your answer here.Question 8: training with gradient descent and validation (5marks)Use gradient descent to iteratively compute the value of . Instead of using all the training set to computethe gradient, use a subset of datapoints in the training set. This is sometimes called minibatch gradientdescent where is the size of the minibacth. When using gradient descent with minibatches, you need to findthe best values for three parameters: , the learning rate, , the number of datapoints in the minibatch and ,the regularisation parameter.As you did on Question 6, create a grid of values for the parameters and using np.logspace and agrid of values for using np.linspace. Because you need to find three parameters, start with num=5 andsee if you can increase it.Use the same training set and validation set that you used in Question 6.For each value that you have of , and from the previous step, use the training set to compute usingminibatch gradient descent and then measure the MSE over the validation data. For the minibatch gradientdescent choose to stop the iterative procedure after iterations.Choose the values of , and that lead to the lower MSE and save them. You will use them at the teststage.3 marks of out of the 5 marksUse the test set from Question 7 and provide the MSE obtained by having used minibatch training with thebest values for , and over the WHOLE training data (not only the training set).Compare the performance of the closed form solution and the minibatch solution. Are the performancessimilar? Are the parameters and similar in both approaches? Please comment on both questions.2 marks of out of the 5 marksQuestion 8 AnswerIn [ ]:# Write your code here# Write the code for your answer here2019/10/27 COM4509-6509_Assignment_1_UCard_XXXXXXXXX(1)localhost:8888/notebooks/COM4509-6509_Assignment_1_UCard_XXXXXXXXX(1).ipynb 7/7Write the answer to your last question here.转自:http://www.3daixie.com/contents/11/3444.html

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 217,907评论 6 506
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,987评论 3 395
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 164,298评论 0 354
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,586评论 1 293
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,633评论 6 392
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,488评论 1 302
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,275评论 3 418
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,176评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,619评论 1 314
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,819评论 3 336
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,932评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,655评论 5 346
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,265评论 3 329
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,871评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,994评论 1 269
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,095评论 3 370
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,884评论 2 354

推荐阅读更多精彩内容