Thank you.
Thank you.
Everyone good morning welcome to this talk.
Just before we get started, I've added a short link to the content here.
I'Ll show you again later when, when we actually need it and when we're going to start working on the exercises, I see a few people with a laptop, so you might have to work with your neighbors I'll figure out something together so yeah the talk today, maintainable Coding in data sense, so we'll try to write data sense code, a bit differently than just just everything in a notebook and a list of a list of comments there.
So I hope you all had some good coffee.
I appreciate that it's quite early in the morning.
So we'll start with a little bit of philosophy and try to define well what do we even mean by maintainable code? Does someone want to start with the definition? Nobody so I'll give mine directly it's early, so my definition would be just just this really really simple by maintainable code.
The main thing that we that we expect is that it will be easy to modify what code has to be modified all the time.
So maintainable code is the code you you hope, can be easily modified.
That can mean all sort of things and that's why we all that's.
Why we'll see a bit later right, so one of their many many principles that exist in the software development world to to help with defining what what is maintainable code? One of them is the single responsibility principle that just states that any any function or class that you're writing should really try to try to do.
One thing shouldn't you: shouldn't: have a good function of good class that is doing all sort of multiple things and everything at once.
Instead, it should really it should really be focusing on one thing and do it really well and that allows to do all sort of all sort of nice things with it later on.
So that's really what we all try to keep in mind in this talk, but then in data science, it's a little bit more complicated like there are a lot of other constraints that appear there because of the nature of what we're trying to achieve, and that's also What what we'll discuss here so in order to do that, I'd like to first go through the domain domain stages of data science project, so you'll have at some point to process some data to engineer some new features from from existing existing data that you've got available.
You will have to tune some hyper parameters of your model to try to find the best in the best values.
You will have to train your model on some data that you have already pre processed and after you will have to generate predictions on our new data.
So, let's go through all those steps and see what what do we care about? What do we need to care about to make to make those maintainable so for the first one process, some data where we will care about is how easy it is to modify to modify those pre-processing steps.
So maybe you want to add some steps.
Maybe you want to to add scaling scaling step to your pre-processing pipeline.
Maybe you want to remove some.
Maybe you want to modify some.
Maybe you'll have to to change a little bit.
The way you get categorical features out of some complex, complex data.
So all of those you should you should be able to do it as easily as possible.
If your code is maintainable, then tune hyperparameters, so your data will change every time.
You will change your features and you will you will change your pre-processing steps.
You will have to retune the whole thing, so you would hope that your your way of tuning the hyper parameters and finding the finding the best parameters for your algorithm is quite easy to run, and you can rerun it every time.
Anything anything changes to get the best parameters directly and then retrain your model same thing for training your model.
Every time you will change, you will change something in the data or even if you don't change something yourself, the data itself will change over time.
So you will have to you will need a good process to retrain, retrain it constantly and finally, which I think is the most interesting one generating predictions on new data.
So here you will have to do all the same steps that you've done in your pre-processing stage.
On the training, data you'll have to apply the exact same transformations, but ideally without repeating any code, because if you start repeating code, then it becomes much less maintainable.
You have to modify things in two different places at the same time.
So how? How do you do that? That'S that's something we'll see we'll see later in industry, so actually right now, so the first.
The first thing I'd like to talk about is how how do you make your you feature: engineering data processing stage, more maintainable.
So again, let's, let's focus a little bit on on what we what we need to do here so for pre-processing.
You will need first to load some data.
You will need to split into some training data, some test data.
You will probably need to drop some features.
You don't need you don't need everything there.
You will need to clean some of them so remove the null values.
Maybe you'll need to to remove some outliers as well all sort of operations like this.
You need to engineer some features, so that's probably where all the creative part of the data scientist job comes in where you take your old features and you figure out well, how can I convert this? This data, to have something that is, is more meaningful for my algorithm.
You will probably have to apply one hot encoding on the catechol features unless you use an algorithm.
That is that support, categorical features and that's usually one of the big problems, and you also have to sometimes, depending on the algorithm you use scale and transform the numerical features that you've got.
That'S quite a few.
Quite a few things you can do.
You can do on your data, so, let's see, let's take a closer look, maybe at number six.
So how how how would you apply all those steps on both the training set and the test set? So how do you? How do you make sure you don't duplicate code here? So the first idea that would come to mind would be to create a just, create a reusable function, apply that function and test on the training sets apply.
That function and test sets yeah, that's good, but what, if that transformation needs to remember a state which is the case for one hot encoding, for example, for scaling and those kind of those kinds of operations that learn something from the training data? Just to make sure we, on the same page with that i've got an example with with one hot encoding, to make to make clear what i mean by that so say here: you've got this training data with some countries for for multiple, multiple people, and then you Apply one hot and coding on it.
So that's what that's what you get out of it so country UK for the first row, country, italy, and then you have ones and zeros there now, if you want to do the same thing on the test data, if you just apply one hot encoding again, so Let'S say: we've got this, this data in in the test, and sometimes you'll have just one single row that you want to to apply this transformation on.
Well, if you've got this data in the test and you apply one hot encoding - that's the that's the output.
You would get so it would from scratch, learn again the new columns they it needs to build and build it in the same order as it appears in the test data, so that here there is a big mismatch.
That'S something that you you shouldn't do and it's going to break your your algorithm.
So here that's a good example of when you need to to have a pre-processing steps that need to memorize some transformation to be applied later and that's what that's where we would have liked.
Instead, right, we would have liked the same columns as the previous as what has been learnt on the on the training data and mapped to the new to the new data set.
So here we see it actually that the first one, the row friends when wasn't in the training data we've got the rows row of zeros, which is here where we want great.
So in that case, we we will need to be held somehow class.
That is able to save, to save the state and that's what the the secular transformers will allow you to to do, and it facilitates all all that process.
So there will be really our main focus for this tutorial.
Look at all those tools that are provided by Cyclone that allow you to write, to write code in a more modular way and overcome all those issues right, so some transformers that you surely have used in the past.
So the label encoder the famous when the who one hot encoder, which we've talked about already, that I suppose everyone has used the standard scalar as well to do a scaling transformation.
So all of those will have the same API, the same interface.
They will have a fit method that just learns a transformation and written and saves it in its internal States.
So for a standard scaler, it learns what what is the weights of the scaling it has to do and then, whenever you call transform it doesn't change anything in its internal state.
It just applies.
The same transformation has learned before on any new data right.
So that's this this interface.
They will reuse over and over again, some other examples of of transformers, so PCA is actually implemented as a as a transformer with recycled lands, it will have a dot, Fitz method that allows you to learn the linear combination to go from from your original features to The principal components, and then when you call dot transform it doesn't learn anything again.
It just applies that transformation.
Another one days, maybe less known, is the feature Union that is quite useful to build, to apply functions on on some numerical data and get and get there get a new array with all the all.
The outputs, for example, here feature Union of PCA and polynomial feature.
Will look like this? I'Ve got a data frame with some value.
I apply a feature Union of PCA and polynomial features, and I get a new data frame with a lot of new columns that are generated.
For me, some of them will be the output of the PCA, so here I keep PCA, with n components equal to so I'll, keep the two principal components and then the polynomial features will just generate all the polynomial values for each of my columns in my previous Data set, so here that's interesting because I've I've used this transformer that takes a list of other transformers as as input and is itself a transformer that can generate this new data.
So here, starting combining things together and that's what we'll do more and more later in this tutorial, so now I'm able to generate two first components of PCA and putting the middle component of your features all at once.
Then you can also build your own transformer objects.
So, to do that, you will need to inherit from two different classes that are implemented for you in cyclotron, so you'll have the transformer mixing in the base the base estimator.
So those just contains some some hidden methods that scikit-learn knows to work with needs to work with your tools and then all you have to do is implement the fit method and the transform method, and then you can use you can use this class everywhere.
You want, as as a transform object, so you could combine it with the Future Union, like we've done here or with all the other things we will see later.
If you need hyper parameters and that will become really handy later, you can just add an init method in there, pass it some parameters and and use that as a hyper parameter later right.
So here that that's an example, I've got this custom scalar, and I have built that, as I said earlier, inherits from transformer mixin and base estimator.
It has here Fitz method that is called on some training data, where it computes the median and the interquartile range of my training data.
It memorizes it with the self the self keyword here and then, whenever I call transform it doesn't need to learn anything anymore.
It can take any any data of with the same number of columns as my training data and we'll just apply that transformation here, which is removing the median and dividing by the interquartile range right.
So that's super easy to actually build your own transformations like this.
They will memorize memorize the state, whatever state you need and in the exercise that we have coming soon.
We'Ll will do that together with some transformations there, a bit that are really really common to do them on Dana.
Actually, the exercises right now, sir.
So if you go to them, so I so if you go to that URL that will bring you to github repository where you've got all of that, including here an exercise.
So if you clone, if you clone that repository, you can then open what you want to you all yeah sorry anyway, so once you've cloned it you can.
You can then open in the exercise the exercise, notebook that that is where all the instructions are.
So that's not where we're going to write the code, it's where the instructions are gee.
Okay, give a minute to do that you're.
So multiple things need to show here.
So when you open the notebook, so here you'll have instructions.
The first thing we will do here is set set.
This not work magic, magic method, which I don't know if everyone has used before, but it's a really useful trick.
If you want to develop codes in an editor on other in another place and still load it in the notebook, so here, if I run those two, those two comments, they will set the auto reload mode that is whatever whatever function.
I import from a file that I'm writing.
If I modify the file itself, then the notebook automatically reloads reloads the latest version that I've got so I don't need to restart the notebook every time I make a modification in my file that allows me to live updates.
The update the file in my editor and see test it and prototype it in the notebook, so you get the best of the two worlds.
So if I run that here which provide a function that loads, the data sets oops that doesn't work.
So you need to install job lip-lip, install job Lib, we'll talk about this library later.
What still nuts, anyway I'll just uncommon right so in the meantime, was I'm solving that I can at least explain what the exercise is.
So here what I'm fighting to load is the data set.
So that's yep, and here I'm just zooming on the arrow, I'm getting sir.
So here what I'm trying to do it is well actually I can just load yeah.
So that's that's.
What she'd ever loaded with the comment below and if you have job, Laban and scikit-learn and installed properly, that should work for all of you.
So that's the that's the data I'm trying to load.
So that's the Kickstarter data set.
There is quite a common one in like machine learning, competitions and things like that, so it contains some information about Kickstarter campaigns.
Kickstarter campaign can be launched on the website a certain time with certain certain attributes it so here you've got information like the description.
The name of the campaign they are trying to launch the goal, so they are trying to raise money so how much money they are trying to raise.
You'Ve got things about from which country they're trying to raise it in what currency, and you also have the currency exchange rate.
So you can actually try to convert from a currency to another and you've got information about again the location, the category of the campaign and all sort of informations like this, and what we're trying to predict is whether that campaign will be successful or not.
So we've got that information in the other in the other file.
Why train that's app so here? If does that comment run for everyone else, but me great.
So it's just meeting perfect.
So that's the first thing I wanted to explain, and the second thing is we're providing some first skeleton for a foreign models.
So here what we've decided to do is divide divide our model into two main files, so you'll have the Transformers here there will be the place where you can define all your Transformers objects.
So here all the pre-processing steps that we want to do.
We are saving them in this Transformers, dot PI as a class, with the dot v, dot, transform method.
So all of those you can, we will implement it together.
So here you will see that we one which is categories xtrade extractor, which is supposed to extract a category from the JSON this one as well yeah, all of them alright, so yeah.
So here this class here is the transform and I will allow you to extract some JSON that is contained in the category information here.
So we'll we'll take a closer look later.
You'Ve got this one.
There is a rule adjuster that allows you to convert that once we've implemented, it will allow you to convert a goal that is given as dollars into.
I mean all the goals that are given in anything else than dollars into dollars, using the using the the exchange rate.
You also have this other preprocessor transformer here that will implement that we'll just load some some dates from the from the data and compute the difference between two days to see how many days did the person live between, say the launch of the campaign and whatever deadline.
They'Re working towards and this one is a transformative, allow you to extract the country from from the data set and map it into a more larger area.
So you can say, for example, here we'll group, together all the countries in in Europe.
So those are the Transformers they will, they will implement and there is a model, the p/y, which is where you'll put all the things together, and so here we've got that load data set method.
That was a function that wasn't working for me that I'll fix in a second, then you've got a build model.
A build model function that real, we'll explain a bit more later and other functions that are that work as entry points to tune, train your model, etc.
So that's for the general architecture of our model here now we'll focus on the Transformers.
Only so if you take, maybe ten minutes should be enough to go through the part one of this network that explains how to implement all those custom.
All those custom transformers that I've that I've talked about so the category extractor first and as you can see here in a notebook, we only have code to test your implementation and the implementation itself is done in in the file.
So if you open the file Transformers dot by edit, it's on this side and then go to the notebook to test your implementation, it should all work perfectly fine right, so take ten five to ten minutes to do that, perfect! All right have it working this week and so remember as well.
The main structure to build your own, your own transformer.
Something important here is that your fit method, even even if you don't want to do anything in the fits stage, which is the case for the first, the first few transformers we're building it needs to return itself.
That'S just what what Sackett learn expects.
So if you don't have it return itself, you have some some nasty bugs all right.
So let's try to do that.
Category is extractor.
So, as I explained earlier, this one is meant to to read that field here called category and extract a certain category.
Out of it, so first, let's let me actually show that field a bit more.
So if it X train that category, you can see how it looks like so that's some some JSON, some string in a JSON format.
Here, let me just get the Oh cuz.
They IDs to do it great, so it's so here should take a look at at that.
It will have this feel-good slug, which is the only one we we care about here.
That has this information, so the first, the first part of the the field slug will be our generic category for for the campaign and the second one will be more precise category.
So here we won't act out of this specific column.
We want to extract those two.
This two fields, so let's implement transformer for that.
So here we are already providing two two methods: the first one, the init is actually defining a hyper parameter here that is use all there is just a trick.
They added that allows to decide whether you want to create whether you want to extract all the categories or, if you want to extract only categories that you've hard-coded initially so here.
The idea is that there might be so many random categories in all those JSON fields.
So I've pre-selected, the main ones that I was interested in and then I've added a default for those that I'm not interested interested in.
So if you look at the help of the helper method that I've written here that basically loads loads, the file loads, this string here as using JSON into a dictionary it's getting the slug method.
And then it's getting the two different.
The two different values as a tuple and then here I'm just saying that if, if you're not specifying the use, all parameter, you will only you will filter.
So if, if the first category is not in your list of categories you care about, then you return.
You return a default if your second category is not in your list of things, you actually care about.
You return a default so just a way to make sure that we don't have that many.
We don't have too many dummy features later on right.
So now we can implement our tomatoes that we all need so the fit methods method here.
Nothing to do.
We don't need to learn any any internal state.
We just extract the information from the JSON, so here I can just do return self directly for the transform.
So for the transform method I have I get the data frame and I need to get the first column category so I'll call category equal X category, so it's easier to work with and then I need to return a new, a new data frame.
Out of that.
So I'll do return, PD data frame and this data frame needs a column called gen cat for the generic category, and that will be my category to which I apply.
I will look like lambda that takes X and returns self gets logged, so it applies.
It basically applies this method on my on each of the JSON each of the JSON strings and gets the first value here right.
So it's fetchers do get slug X and I get right.
So that's here should should work.
If I ran that here, the first column works works, fine, so I can create the precise cat and that will be the exact same thing, but returning the second value instead and here, if I run that now I get my two columns right.
So that's that's pretty pretty powerful here I've got a nice class that is easy to maintain.
I can add more functionality there.
If I want later, I can change the way computing those columns fairly fairly easily by just modifying my transform method.
So it's quite easy to maintain and as far as I'm concerned in my top-level code, I'm just calling this this extractor always the same way because it always has the same interface.
So then the goal adjuster so here that should be fairly easy wants the same thing.
I don't need to do anything with fits, so I can do return self here to transform.
I just need to return a data frame that has adjusted goal and the address table will so there will be the where's it deadline.
No, not really.
Oh, is it the goal where's the book, the goals there will be X that goal x static, used the rate right, so that should do if I run that again again what I was expecting.
So I think you you get the point with with those.
For the time transformer here that that one is actually, it does a bit more work, so I'm just gon na copy paste, the others from the solution directly, so right, so for the time transformer we load all our dates into a date/time object.
Here you see that we actually multiplying the timestamp by a constant value because they were not given in in the right format.
So that allows you to get three timestamp objects, the deadlines, the created date and the launch date, and then we can start creating a bit more creative features.
So that's that's.
Some nice feature engineering here where we say well what we care about really is how much time there was between when we launched the campaign and in the deadline that you had so here, we can easily compute it as number of days and same thing.
How much time you spent between, when you created the page and and the launch time of your campaign? So let's look at this one, so that returns two nice columns with that that information and then the last one, which is if you, if you wanted to group together some countries in as as groups for example.
So here I've created a map where I'm saying well: all those countries should go in the following groups and that's a fairly simple one, where we spend, as I'm just calling the dot map method to to map all of those together.
So here country transformer it returns returns if I do dot sample the battery, and here you've got all the other groups right.
So that's for custom transformers.
Next, we will look at a bit more advanced thing, so one one really cool thing that was added in second learn: 21, so the latest version is the column transformer, so how many of you have used it before already? Okay? So it's it's really useful, and so it allows you to to map different transformers to different to to the columns that you're interested in.
So that allows you to say well, this specific transformation should be applied to say all my numerical columns.
This specific transformation should be applied to all my cat Oracle columns and that allows you to start creating rules on what transformation should be applied to what part of your data frame.
So here's an example: we've defined some numerical columns, so the agent celery, some categorical columns, the country and the gender, for example, and we create a new object.
We say: that's a column transformer.
It takes a list of tuples and in that tuple you'll have three different elements.
The first one is the name you want to give it.
The second one is the actual transformation and the third one is the list of columns to apply it to so.
Here I'm saying: well, I want to apply PCA, not on my whole data frame, but only on those numerical columns, and I want to apply one hot encoding on only those those columns.
So that's really powerful if you combine it with what we've seen before.
So you can start building transformations and decide, then, to put them all together in one single object that map's them to different parts of your of your data frame.
So, using all of that, you're able to build a master processor object that those all the transformations at once and memorizes all the states that it needs to.
So that's a more visual example here, where we're doing doing the same thing as I explained, so we apply PCA on numerical columns, only that's where we get in a new data frame and we apply one heart and coding on the catechol columns only and then the Column, transformer will just work as a normal transformer, so once you've created that object, you can call dot fit on some training data.
It will learn the the internal state for all the transformations that you've you've provided inside it, then dot transform will apply that on any new data.
So, of course, you can do it on your training data, but you can do it on some test data that you want to process or later on.
You can do it on some new data that you need to generate predictions for on the fly.
So it's a really nice very nice tool.
Another one pipeline there is older, allows you here not to do things in parallel, like the column, transform, will do but sequential operations.
So, for example, you might want to scale some data first and then apply some pca transformation.
On it so pipeline you'll just call the pipeline object again pass a list of tuples where you define what what is it that you want to do so here i want to create a an object, called scalar that scales my data and then pca.
So here all those transformations will be done one by one on all the all, the other columns that you've got available in your in your data so that fit on the data.
Then the transform, that's how it will look like.
So if you combine those two, you can already anticipate that you can start building pretty pretty complex pipelines by having this pipeline that those sequential operations and then the current transformers allows you to map it to specific specific columns only yeah, so it can be used together With the column transformer so here that's an example of that here we're going to steps so we're first, creating a pipeline object that we want to use only for our numerical features, so that pipeline object is for scaling.
The data then applying some pca transformation on it.
So that our first transformer and then we build a top-level column, transformer object.
That applies this specific transformation that we've we've created on the numerical features only and then for the categorical features.
It applies one hot encoding right, so you can start building more complex, more complex.
So the great thing here is really this.
The second point is that you you're making everything really really modular like every every sort of operation.
Every functionality of your code will be a separate class, specific transformer.
That does something something specific really well, you can test it later and then you you're just building pipelines by combining them together.
So that's allows you to keep your code quite quite clean here.
The only thing we are missing so far is the predictive.
The predictive model part of of our code - well, actually, that's not completely true, because pipeline can be used for for transformers, but it can also be used combined with a predictive model.
So if you use as a last step of your pipeline, not a transformer but an actual model that has a dot predict method, the pipeline will automatically inherit that that predict method and be able to generate predictions on your on your data directly.
So that means that you can, you can do something like that.
So you'll have your processor that we've mentioned before, which here is a column, transformer object.
That does something on the numerical columns and do something on the catechol columns, and then we put that inside another pipeline that has a preprocessor stage.
So that's your that's your your first, your first level here and then the second level, which is whatever algorithm you want.
So here we've chosen a decision tree.
The great thing about that is that your model isn't isn't just an algorithm anymore and all your all, your hyper parameters are not just on your model anymore.
So it's not just about choosing the best parameters for your decision tree.
It'S about choosing the best parameters for all your pre-processing steps, together with with your final algorithm, the decision tree here, so maybe changing the values that you have in your the components that you keep in your PCA will affect the best parameters to choose in your decision Tree so it's normal to have all of those tuned together and trained together.
So here that allows you to do that to represent a model as not just an algorithm, but a bunch of pre-processing steps, plus an algorithm at the end, and that also that's also really good to avoid data leakage, because you, you have all the steps run on Any batch of data you provide so if you're doing cross-validation all the pre-processing steps, everything will be done only on that single batch that you're providing instead of leaking some information, potentially from other batches.
If you did the pre-processing before the cross-validation and you can easily apply that pipeline to any new data, once you've managed to train it, it's just a matter of calling dot predicts and it will go through all the pre-processing steps, plus the predictive, the predictive part and Finally, yeah it's more modular, so it will be much easier to test and maintain so we've got an exercise for this as well, which should be fairly short.
So maybe, let's take five minutes for you to take a look so here it will be in the model dot buy file where we've defined functions that that puts all our transformer objects together.
So here the first one, the most important one is the function that builds the model.
So it basically so here two first steps that we're providing.
It creates a pipeline for categories that we've extracted with that categories extract object.
So, where does is that it first calls transformer and then it calls the one hot and cutter to create dummies out of that, and we also have another pipeline that is for the country, the Contra transformer.
So same thing.
It'S just creating dummies out of the catechol features, so those are just the steps that are provided.
Then the exercise here is to create a column, transformer object that has all the steps all the steps we want.
So it will have this one: the it will have the category processor, it will have the country processor it needs to have as well the other, the time transformer and the gold adjuster yep.
So yeah.
If you take a look for the next, you all right.
So, let's, let's do that that part together now so the first.
The first step here is to create your your main column, transform an object.
Will they will basically put everything together at this stage, so we'll just create a new column, transform how I even imported it.
Yep so I'll do preprocessor column transformer, so here I need a list of tuples, so the first one will be for my categories.
So here I will just apply the cats processor and I've got a list of my categorical also here.
This one is just the column category right, so here my categories I actually extracted from from only the category category column.
So if I provide this processor, this transformer object only the category column that should be enough, then I will need countries.
There will be my country, processor, and I just need the column country there.
Then I will need to use this.
What else I got the goal: adjuster that needs the goal: column and the static USD rate column.
So I'll call that how do they call it? The solution that call it goal so there will be the goal I just herbs go register and that will need the goal, column and Static USD rate.
And finally, I want as well the time transformer so here I'll call that all the time yeah called time fly.
The time transformer and I will need those two columns - the stay columns, the deadlines created that and launch that deadline, creating that just right, so oops still missing one thing, so that's for a free processor object.
So now I've got my pre-processing stage now.
I need to put it together with some predictive model, so I'll just call it a cold model.
There will be a pipeline where I basically have two stages, so pre processor will be my pre processor.
I'D define just above that's my first stage and then my second stage will just be some algorithm, sir.
Let'S call it model, and here I think I've already imported the decision tree somewhere yeah.
So, let's just do decision tree decision tree classifier with just default default parameters for now cuz we're gon na churn it anyway.
So did I do something wrong? Is it going to work? Doesn'T work, I need oh yeah.
I need to pass a bitch okay.
That'S my foot need to pass.
Everything is a list here: okay, right and now it's working.
So that's that's pretty cool here, because we we've got all all our Moodle now can be built with this single function and we can return a new instance of our model, including all the steps that it needs with one single one single function.
So here that's quite easy to change to change the pipelines.
If I want to the last step, creates a preprocessor preprocessor with older all the things all the features that I need to build and the really last one puts it together with some algorithm.
So here, from from a top-level point of view, I just need to instantiate a new model calling that function.
I can fit it on some on some data and then I call predict and it works just like any any algorithm, even though it has all the people sing steps included to it right.
So next, let's go back to slides right.
So now we've got about everything and with everything we need to to build the model and we're ready to start thinking about training it and tuning it and all of those.
So here, the first thing that we need to we will need to to think is how, when we're training our model, that's something that can take a while and when we're generating predictions, we need somehow to be able to load a model that we've trained we've trained.
Previously so here we will want to be able to save somehow that model on disk and be able to out train model on disk and be able to load it later to use it say if you want to use it on put in production just to generate Predictions you don't want to retrain it every time, so here that's what model persistence will mean so they're two main ways of doing it.
With with button of sterilizing objects, you have pickle.
That is quite quite well known and job Lib.
There is a bit less known today, pretty much pretty much equivalent in terms of performance in their documentation.
Job Lib says that they tend to be more efficient with with larger arrays, so we might prefer that for working with benders and in gnome pies with object that work we spend as and don't buy.
Something cool with job Lib is that the interface is just a bit easier to use.
So that's how that's how you pickle your model using the pickle library so you'd have to open the file in binary, then dump it and then open the file again in binary and load it.
If you wanted to do that with job Lib, it's just a bit easier because they abstract from you the loading the file in binary mode.
So you can just drop lib dump here.
You'Ve got your model, so that's your single single object pipeline that has been trained.
That has you, you have gone teed there.
It has everything you need in it and then you specify your file and it will just dump the whole object on that file and then later on.
If you want to load it, it's really easy.
You just call dot load and past the path that you want to load.
Just keep in mind.
It has some limitations, so people and table will just be Co that specific object that you're specifying, which is why actually it's important to to have one single object, but it does not save the dependencies.
So it will not.
You will still need the code, that's that it relies on to to run.
So that means all the libraries that it's calling, but also your custom, your custom classes, everything you will need the files to be able to run it.
Also, the data is not saved.
So if, like you need to retrain your model, you also need to make sure that you have snapshots of of your data.
Other other thing to keep in mind when you're using those libraries is that the versions are really really important.
So since it's relying on whatever whatever libraries you've got installed when you're loading the model, if it has been saved with different versions, then it can often have half some prime.
So it's really important here that you you're keeping track of what versions you've pick your mode.
Always and make sure that whatever environments you load, the people from will have the exact same versions and dependencies installed and the last.
The last comment here is that yeah you can pickle anything any any object in Python, so make sure that you don't load some people that has been provided by some in some places that you don't trust, because it could contain anything, including including some scripts, that mess Up everything so keep that in mind, so we've got an exercise for that where we are going to here implement the last.
So we only need three things now for our model to be able to run it to run it on the command line.
So we need a method to tune it.
We need a method to train it and we need a method to test it.
So we've got let's yeah, we got time.
So let's take another five minutes to look at that should be so here.
It should be fairly easy.
You only need two for tuning.
For example, you will need to call your your your function.
That builds a new instance of the model that gives you a new instance.
Then you need to you, can use grid search directly on it, so grid search support supports the suppose.
The pipeline, the pipeline objects.
So here what we have is defined in a conflict that by file your great parameters, so you can uncomment that or just come up with your own parameters that you want to tune.
So here, for example, we're saying that we want to change the max depth of the model stage of the pipeline.
So here the max depth of the model stage will be the max depth of our decision tree.
We want to tune the min sample split of that decision tree as well.
So model underscore underscore the name of the attribute in the decision tree, and then we also want to attune some of the some of the parameters of preprocessor object.
So here are preprocessor object.
We can change use all from false to true, so those are a few example.
If I do that, actually I can just give you tip.
So if you do model that get params, that gives you all the parameters that are in your pipeline object in your model.
So that till it tells you all the steps that you've got a zoom in a bit more all the steps that you've got there.
So it's quite a big one.
If you do that keys, it tells you all the parameters that you can, that you can tune that you can tune there.
So that's where you see that so you've got the things on your.
Where is it? You'Ve got all the parameters that are on on your model, so like class weights, so all of those are defined on on my decision tree, so I can tune all of those, but I can also tune everything that is defined on my preprocessor.
So that's the that's.
The one that I've chosen to tune here right, so you can access all those parameters and provide them in the in the grid search.
Actually, it's maybe better.
If I just do it straight away all right, so that's that's just do it together, so here first step for tuning the tuning.
The model is well load, the data.
So a lot ex train.
Why train load data set to have the data set here? No, it's! Okay, I'll just copy the part from here extreme weight, okay right so first I load the training data.
Then I'm creating instantiating a new model, so I'll do build model and then I can call grid session.
So have I loaded the growth research? Yes, I have it.
There so I can define a new grid search object here.
There will be GS, equal grid, search CV.
I pass my model directly.
I will pass the parameters that I've defined in config so I'll first uncommon them.
They said we imports all of those great poems, and I okay here I'll, do CV equal, say three and underscore jobs equal minus one.
Then I do GS fits on extra in white rain and I can just print Vasco will be so here you don't necessarily want you to print.
Maybe you want to actually get those parameters and automatically retrain your model with those parameters, but keep things simple.
I just do it like this vesko, let's see if that works here.
So if I call that know, I need to imports as well extreme white rain right, it's nice tunic, and here it's telling me the best guy can also add it.
To tell me the best.
The night tells me as well what are the best parameters that I've got in, not in my algorithm only but in my whole, in my whole pipeline.
So it's telling me that actually I should use max depth of 9 of 9 and those parameters for the decision tree, but also it's answering my question I had at the beginning of this tutorial, which was should I use all all the categories that I get from The data frame from the category column, or should I use only a subset of those well, he is telling me that I should use all instead with those parameters of the tree.
So that's that's quite nice that allows you to add some extra parameters to to tune in there.
So we'll do the same for training so for training.
We'Ll need those two first then we'll do a model that fits X, train y train.
So that's quite simple here and then the the last step is to save it onto a file.
So we'll do a job Lib dump, we'll save just have one single object to say, which is our model and let's give it a name model job Lib.
So here I do not have a model dojo blip yet so I can see if it worked.
If I run that extreme train, if I run that oops still not import job lab no good, I commented out creating traps for myself and I strained it and you see that it has saved it here.
What'S binary file, so I can't open it, but it has saved it on this, so I'm able to load it later and actually that's the next exercise, loading that and generating predictions.
So here test model, I'm just but all the job lab.
That would be my model.
Then I need to load some data as well so I'll use that here, but it's not extreme its X test by test, and I need to import that as well at the top right.
So once I've got that now I can easily just compute accuracy score classification report.
Anything I want.
I just do.
White bread will be equal to model that predicts next tests, and then I can just do prints.
Accuracy will be format.
Y tests, white bread.
You can try that here, that's the wrong one here.
It is right so right so now I'm also able to easily test test my model and I can add, nope, okay right and now I've got the classification report as well right.
So that's that's! Quite nice! Now I've got pretty much all the code.
I need properly modular and separated, so I can easily make modifications so I can modify the existing transformers that I've got here in this file by changing the way say: I'm computing the category or I'm computing anything here.
I just need to focus on the transform method inside my class.
If I want to add an extra step, I just create a new transformer.
I can test it individually, making sure it works fine and then in the model, dot PI.
I just need to add it here in the build model as an extra step so either I have to create a pipeline object or I added directly to my top-level column transformer.
If I want to change the algorithm, I'm using it's quite easy as well to do here and then I'm just modifying the parameters that I want to tune to correspond to to my to my both algorithm and and steps, and it's all I can do all of That together here so also providing some command line tool out of that, because it's easier to just run it directly from the command line.
Instead of having to start the shell and call the commands the functions yourself.
So here we've just defined rounded by that those.
Basically, imports are three main entry points, so that thun model train model and test model and then just use the arc pass library in order to to have that as arguments on the command line.
So if I test that now did you see the exercise and yes i zoom in here: if I do Titan rounded pie stay tuned, they will turn my model as it says, maybe yeah.
So here you can reach unity easily.
So, every time you make a new step, you're changing the steps and you're in your code.
You actually don't have to go through the notebook anymore.
You can just run it here and see how that improve, how that improved, your your cross-validation score and how? What what are the best parameters that you should use when you train your final model? I can also do train if I want to train my model again and save it as a job Lib cellulous file, and then, whenever I run the test, I can see the final final value of my score classification report.
Anything I'm interested in really and we could implement another another entry point that would be for predict that could be predicting loading, the serialized the sailors model and generate predictions on any any new data that you provided.
So that would be.
That would be the command you would use if you use that model in production right.
So if you a few comments things that I have skipped a little bit so first, it's quite important.
So we're going to be loading data that will they will change.
Sometimes you will have large training data sets.
Sometimes you'll have a single row, so here it's missus, it's it can be quite risky to to just rely on pandas inferring.
What type, what types your columns should be.
So it's really important to whenever yoga you, you want to make it to make maintainable version of your of your of your codes to use to fix the tea types.
So that's where we've done here: we've just specified the D types for all the columns and whenever we actually loading the data with that function, that's why I've provided that function here we make sure that every time we load the data base, if the same the same D-Types that way, we sure that something won't break, because some some data has been loaded has been inferred as a different type than it was important to have a requirements that PI as well.
So the text.
Sorry, since we here, we are cellulite in our model and we expecting our model to work for to work in the in the future and support to make sure we can replicate exactly the same, the same environment over and over again so requirements.
The text with the actual versions of the library will help you doing that yeah.
If you're, making incremental changes make sure you you're checking the the disco in the tuning steps of the cross-validation score and not the not disc, on the test set, because that would obviously be massive massive massive of overfitting to do that.
So just do all your feature.
Engineering using the the tuning, cross-validation score and yeah, the really cool thing is that this this is quite modular, so you can easily write tests.
So that means you really should now it's it's.
It'S quite straightforward.
Actually, to just take this all those transformers in isolation and build build test for it, and actually that's what that's what the next section is about yeah and with all of that hopefully you're.
Updating your test should be your model story, so it should be less stressful.
Just yet to finish, we we have a booth somewhere tomorrow.
So five minutes for questions you have any yep, oh yeah, you could use numpy, so it it relies on.
So the question was: does it work only with spenders or with non pay as well? So yeah? It since, like all of all of the Transformers from from psychic, learn except either numpy array or PEMDAS array here, turns out that the actual transformers that I've built I'm using some some pandas method.
So my own code will not work with spenders with oh yeah.
You need to specify the column there.
Yes, so it's not it's not this one! It'S not going to work with with number, because it's relying on you specifying what column it should be, but the rest should do yeah any more questions, yeah.
So with the pickle or the other, one was called yep, so you mentioned a couple of limitations.
Is there any other way to store circuit in your model? The model yeah? So you say I want to.
I turn a model and then they pass that to someone else.
So here did that that's actually the recommended way on on scikit-learn on their documentation, they're using so they don't have a custom way of doing it.
They'Re using pickers and job Lib which are like just generate by then once if you wanted to store that just just just the data then does has a binary formats that are that they're using so you could do that, but the model itself
Thank you.
Everyone good morning welcome to this talk.
Just before we get started, I've added a short link to the content here.
I'Ll show you again later when, when we actually need it and when we're going to start working on the exercises, I see a few people with a laptop, so you might have to work with your neighbors I'll figure out something together so yeah the talk today, maintainable Coding in data sense, so we'll try to write data sense code, a bit differently than just just everything in a notebook and a list of a list of comments there.
So I hope you all had some good coffee.
I appreciate that it's quite early in the morning.
So we'll start with a little bit of philosophy and try to define well what do we even mean by maintainable code? Does someone want to start with the definition? Nobody so I'll give mine directly it's early, so my definition would be just just this really really simple by maintainable code.
The main thing that we that we expect is that it will be easy to modify what code has to be modified all the time.
So maintainable code is the code you you hope, can be easily modified.
That can mean all sort of things and that's why we all that's.
Why we'll see a bit later right, so one of their many many principles that exist in the software development world to to help with defining what what is maintainable code? One of them is the single responsibility principle that just states that any any function or class that you're writing should really try to try to do.
One thing shouldn't you: shouldn't: have a good function of good class that is doing all sort of multiple things and everything at once.
Instead, it should really it should really be focusing on one thing and do it really well and that allows to do all sort of all sort of nice things with it later on.
So that's really what we all try to keep in mind in this talk, but then in data science, it's a little bit more complicated like there are a lot of other constraints that appear there because of the nature of what we're trying to achieve, and that's also What what we'll discuss here so in order to do that, I'd like to first go through the domain domain stages of data science project, so you'll have at some point to process some data to engineer some new features from from existing existing data that you've got available.
You will have to tune some hyper parameters of your model to try to find the best in the best values.
You will have to train your model on some data that you have already pre processed and after you will have to generate predictions on our new data.
So, let's go through all those steps and see what what do we care about? What do we need to care about to make to make those maintainable so for the first one process, some data where we will care about is how easy it is to modify to modify those pre-processing steps.
So maybe you want to add some steps.
Maybe you want to to add scaling scaling step to your pre-processing pipeline.
Maybe you want to remove some.
Maybe you want to modify some.
Maybe you'll have to to change a little bit.
The way you get categorical features out of some complex, complex data.
So all of those you should you should be able to do it as easily as possible.
If your code is maintainable, then tune hyperparameters, so your data will change every time.
You will change your features and you will you will change your pre-processing steps.
You will have to retune the whole thing, so you would hope that your your way of tuning the hyper parameters and finding the finding the best parameters for your algorithm is quite easy to run, and you can rerun it every time.
Anything anything changes to get the best parameters directly and then retrain your model same thing for training your model.
Every time you will change, you will change something in the data or even if you don't change something yourself, the data itself will change over time.
So you will have to you will need a good process to retrain, retrain it constantly and finally, which I think is the most interesting one generating predictions on new data.
So here you will have to do all the same steps that you've done in your pre-processing stage.
On the training, data you'll have to apply the exact same transformations, but ideally without repeating any code, because if you start repeating code, then it becomes much less maintainable.
You have to modify things in two different places at the same time.
So how? How do you do that? That'S that's something we'll see we'll see later in industry, so actually right now, so the first.
The first thing I'd like to talk about is how how do you make your you feature: engineering data processing stage, more maintainable.
So again, let's, let's focus a little bit on on what we what we need to do here so for pre-processing.
You will need first to load some data.
You will need to split into some training data, some test data.
You will probably need to drop some features.
You don't need you don't need everything there.
You will need to clean some of them so remove the null values.
Maybe you'll need to to remove some outliers as well all sort of operations like this.
You need to engineer some features, so that's probably where all the creative part of the data scientist job comes in where you take your old features and you figure out well, how can I convert this? This data, to have something that is, is more meaningful for my algorithm.
You will probably have to apply one hot encoding on the catechol features unless you use an algorithm.
That is that support, categorical features and that's usually one of the big problems, and you also have to sometimes, depending on the algorithm you use scale and transform the numerical features that you've got.
That'S quite a few.
Quite a few things you can do.
You can do on your data, so, let's see, let's take a closer look, maybe at number six.
So how how how would you apply all those steps on both the training set and the test set? So how do you? How do you make sure you don't duplicate code here? So the first idea that would come to mind would be to create a just, create a reusable function, apply that function and test on the training sets apply.
That function and test sets yeah, that's good, but what, if that transformation needs to remember a state which is the case for one hot encoding, for example, for scaling and those kind of those kinds of operations that learn something from the training data? Just to make sure we, on the same page with that i've got an example with with one hot encoding, to make to make clear what i mean by that so say here: you've got this training data with some countries for for multiple, multiple people, and then you Apply one hot and coding on it.
So that's what that's what you get out of it so country UK for the first row, country, italy, and then you have ones and zeros there now, if you want to do the same thing on the test data, if you just apply one hot encoding again, so Let'S say: we've got this, this data in in the test, and sometimes you'll have just one single row that you want to to apply this transformation on.
Well, if you've got this data in the test and you apply one hot encoding - that's the that's the output.
You would get so it would from scratch, learn again the new columns they it needs to build and build it in the same order as it appears in the test data, so that here there is a big mismatch.
That'S something that you you shouldn't do and it's going to break your your algorithm.
So here that's a good example of when you need to to have a pre-processing steps that need to memorize some transformation to be applied later and that's what that's where we would have liked.
Instead, right, we would have liked the same columns as the previous as what has been learnt on the on the training data and mapped to the new to the new data set.
So here we see it actually that the first one, the row friends when wasn't in the training data we've got the rows row of zeros, which is here where we want great.
So in that case, we we will need to be held somehow class.
That is able to save, to save the state and that's what the the secular transformers will allow you to to do, and it facilitates all all that process.
So there will be really our main focus for this tutorial.
Look at all those tools that are provided by Cyclone that allow you to write, to write code in a more modular way and overcome all those issues right, so some transformers that you surely have used in the past.
So the label encoder the famous when the who one hot encoder, which we've talked about already, that I suppose everyone has used the standard scalar as well to do a scaling transformation.
So all of those will have the same API, the same interface.
They will have a fit method that just learns a transformation and written and saves it in its internal States.
So for a standard scaler, it learns what what is the weights of the scaling it has to do and then, whenever you call transform it doesn't change anything in its internal state.
It just applies.
The same transformation has learned before on any new data right.
So that's this this interface.
They will reuse over and over again, some other examples of of transformers, so PCA is actually implemented as a as a transformer with recycled lands, it will have a dot, Fitz method that allows you to learn the linear combination to go from from your original features to The principal components, and then when you call dot transform it doesn't learn anything again.
It just applies that transformation.
Another one days, maybe less known, is the feature Union that is quite useful to build, to apply functions on on some numerical data and get and get there get a new array with all the all.
The outputs, for example, here feature Union of PCA and polynomial feature.
Will look like this? I'Ve got a data frame with some value.
I apply a feature Union of PCA and polynomial features, and I get a new data frame with a lot of new columns that are generated.
For me, some of them will be the output of the PCA, so here I keep PCA, with n components equal to so I'll, keep the two principal components and then the polynomial features will just generate all the polynomial values for each of my columns in my previous Data set, so here that's interesting because I've I've used this transformer that takes a list of other transformers as as input and is itself a transformer that can generate this new data.
So here, starting combining things together and that's what we'll do more and more later in this tutorial, so now I'm able to generate two first components of PCA and putting the middle component of your features all at once.
Then you can also build your own transformer objects.
So, to do that, you will need to inherit from two different classes that are implemented for you in cyclotron, so you'll have the transformer mixing in the base the base estimator.
So those just contains some some hidden methods that scikit-learn knows to work with needs to work with your tools and then all you have to do is implement the fit method and the transform method, and then you can use you can use this class everywhere.
You want, as as a transform object, so you could combine it with the Future Union, like we've done here or with all the other things we will see later.
If you need hyper parameters and that will become really handy later, you can just add an init method in there, pass it some parameters and and use that as a hyper parameter later right.
So here that that's an example, I've got this custom scalar, and I have built that, as I said earlier, inherits from transformer mixin and base estimator.
It has here Fitz method that is called on some training data, where it computes the median and the interquartile range of my training data.
It memorizes it with the self the self keyword here and then, whenever I call transform it doesn't need to learn anything anymore.
It can take any any data of with the same number of columns as my training data and we'll just apply that transformation here, which is removing the median and dividing by the interquartile range right.
So that's super easy to actually build your own transformations like this.
They will memorize memorize the state, whatever state you need and in the exercise that we have coming soon.
We'Ll will do that together with some transformations there, a bit that are really really common to do them on Dana.
Actually, the exercises right now, sir.
So if you go to them, so I so if you go to that URL that will bring you to github repository where you've got all of that, including here an exercise.
So if you clone, if you clone that repository, you can then open what you want to you all yeah sorry anyway, so once you've cloned it you can.
You can then open in the exercise the exercise, notebook that that is where all the instructions are.
So that's not where we're going to write the code, it's where the instructions are gee.
Okay, give a minute to do that you're.
So multiple things need to show here.
So when you open the notebook, so here you'll have instructions.
The first thing we will do here is set set.
This not work magic, magic method, which I don't know if everyone has used before, but it's a really useful trick.
If you want to develop codes in an editor on other in another place and still load it in the notebook, so here, if I run those two, those two comments, they will set the auto reload mode that is whatever whatever function.
I import from a file that I'm writing.
If I modify the file itself, then the notebook automatically reloads reloads the latest version that I've got so I don't need to restart the notebook every time I make a modification in my file that allows me to live updates.
The update the file in my editor and see test it and prototype it in the notebook, so you get the best of the two worlds.
So if I run that here which provide a function that loads, the data sets oops that doesn't work.
So you need to install job lip-lip, install job Lib, we'll talk about this library later.
What still nuts, anyway I'll just uncommon right so in the meantime, was I'm solving that I can at least explain what the exercise is.
So here what I'm fighting to load is the data set.
So that's yep, and here I'm just zooming on the arrow, I'm getting sir.
So here what I'm trying to do it is well actually I can just load yeah.
So that's that's.
What she'd ever loaded with the comment below and if you have job, Laban and scikit-learn and installed properly, that should work for all of you.
So that's the that's the data I'm trying to load.
So that's the Kickstarter data set.
There is quite a common one in like machine learning, competitions and things like that, so it contains some information about Kickstarter campaigns.
Kickstarter campaign can be launched on the website a certain time with certain certain attributes it so here you've got information like the description.
The name of the campaign they are trying to launch the goal, so they are trying to raise money so how much money they are trying to raise.
You'Ve got things about from which country they're trying to raise it in what currency, and you also have the currency exchange rate.
So you can actually try to convert from a currency to another and you've got information about again the location, the category of the campaign and all sort of informations like this, and what we're trying to predict is whether that campaign will be successful or not.
So we've got that information in the other in the other file.
Why train that's app so here? If does that comment run for everyone else, but me great.
So it's just meeting perfect.
So that's the first thing I wanted to explain, and the second thing is we're providing some first skeleton for a foreign models.
So here what we've decided to do is divide divide our model into two main files, so you'll have the Transformers here there will be the place where you can define all your Transformers objects.
So here all the pre-processing steps that we want to do.
We are saving them in this Transformers, dot PI as a class, with the dot v, dot, transform method.
So all of those you can, we will implement it together.
So here you will see that we one which is categories xtrade extractor, which is supposed to extract a category from the JSON this one as well yeah, all of them alright, so yeah.
So here this class here is the transform and I will allow you to extract some JSON that is contained in the category information here.
So we'll we'll take a closer look later.
You'Ve got this one.
There is a rule adjuster that allows you to convert that once we've implemented, it will allow you to convert a goal that is given as dollars into.
I mean all the goals that are given in anything else than dollars into dollars, using the using the the exchange rate.
You also have this other preprocessor transformer here that will implement that we'll just load some some dates from the from the data and compute the difference between two days to see how many days did the person live between, say the launch of the campaign and whatever deadline.
They'Re working towards and this one is a transformative, allow you to extract the country from from the data set and map it into a more larger area.
So you can say, for example, here we'll group, together all the countries in in Europe.
So those are the Transformers they will, they will implement and there is a model, the p/y, which is where you'll put all the things together, and so here we've got that load data set method.
That was a function that wasn't working for me that I'll fix in a second, then you've got a build model.
A build model function that real, we'll explain a bit more later and other functions that are that work as entry points to tune, train your model, etc.
So that's for the general architecture of our model here now we'll focus on the Transformers.
Only so if you take, maybe ten minutes should be enough to go through the part one of this network that explains how to implement all those custom.
All those custom transformers that I've that I've talked about so the category extractor first and as you can see here in a notebook, we only have code to test your implementation and the implementation itself is done in in the file.
So if you open the file Transformers dot by edit, it's on this side and then go to the notebook to test your implementation, it should all work perfectly fine right, so take ten five to ten minutes to do that, perfect! All right have it working this week and so remember as well.
The main structure to build your own, your own transformer.
Something important here is that your fit method, even even if you don't want to do anything in the fits stage, which is the case for the first, the first few transformers we're building it needs to return itself.
That'S just what what Sackett learn expects.
So if you don't have it return itself, you have some some nasty bugs all right.
So let's try to do that.
Category is extractor.
So, as I explained earlier, this one is meant to to read that field here called category and extract a certain category.
Out of it, so first, let's let me actually show that field a bit more.
So if it X train that category, you can see how it looks like so that's some some JSON, some string in a JSON format.
Here, let me just get the Oh cuz.
They IDs to do it great, so it's so here should take a look at at that.
It will have this feel-good slug, which is the only one we we care about here.
That has this information, so the first, the first part of the the field slug will be our generic category for for the campaign and the second one will be more precise category.
So here we won't act out of this specific column.
We want to extract those two.
This two fields, so let's implement transformer for that.
So here we are already providing two two methods: the first one, the init is actually defining a hyper parameter here that is use all there is just a trick.
They added that allows to decide whether you want to create whether you want to extract all the categories or, if you want to extract only categories that you've hard-coded initially so here.
The idea is that there might be so many random categories in all those JSON fields.
So I've pre-selected, the main ones that I was interested in and then I've added a default for those that I'm not interested interested in.
So if you look at the help of the helper method that I've written here that basically loads loads, the file loads, this string here as using JSON into a dictionary it's getting the slug method.
And then it's getting the two different.
The two different values as a tuple and then here I'm just saying that if, if you're not specifying the use, all parameter, you will only you will filter.
So if, if the first category is not in your list of categories you care about, then you return.
You return a default if your second category is not in your list of things, you actually care about.
You return a default so just a way to make sure that we don't have that many.
We don't have too many dummy features later on right.
So now we can implement our tomatoes that we all need so the fit methods method here.
Nothing to do.
We don't need to learn any any internal state.
We just extract the information from the JSON, so here I can just do return self directly for the transform.
So for the transform method I have I get the data frame and I need to get the first column category so I'll call category equal X category, so it's easier to work with and then I need to return a new, a new data frame.
Out of that.
So I'll do return, PD data frame and this data frame needs a column called gen cat for the generic category, and that will be my category to which I apply.
I will look like lambda that takes X and returns self gets logged, so it applies.
It basically applies this method on my on each of the JSON each of the JSON strings and gets the first value here right.
So it's fetchers do get slug X and I get right.
So that's here should should work.
If I ran that here, the first column works works, fine, so I can create the precise cat and that will be the exact same thing, but returning the second value instead and here, if I run that now I get my two columns right.
So that's that's pretty pretty powerful here I've got a nice class that is easy to maintain.
I can add more functionality there.
If I want later, I can change the way computing those columns fairly fairly easily by just modifying my transform method.
So it's quite easy to maintain and as far as I'm concerned in my top-level code, I'm just calling this this extractor always the same way because it always has the same interface.
So then the goal adjuster so here that should be fairly easy wants the same thing.
I don't need to do anything with fits, so I can do return self here to transform.
I just need to return a data frame that has adjusted goal and the address table will so there will be the where's it deadline.
No, not really.
Oh, is it the goal where's the book, the goals there will be X that goal x static, used the rate right, so that should do if I run that again again what I was expecting.
So I think you you get the point with with those.
For the time transformer here that that one is actually, it does a bit more work, so I'm just gon na copy paste, the others from the solution directly, so right, so for the time transformer we load all our dates into a date/time object.
Here you see that we actually multiplying the timestamp by a constant value because they were not given in in the right format.
So that allows you to get three timestamp objects, the deadlines, the created date and the launch date, and then we can start creating a bit more creative features.
So that's that's.
Some nice feature engineering here where we say well what we care about really is how much time there was between when we launched the campaign and in the deadline that you had so here, we can easily compute it as number of days and same thing.
How much time you spent between, when you created the page and and the launch time of your campaign? So let's look at this one, so that returns two nice columns with that that information and then the last one, which is if you, if you wanted to group together some countries in as as groups for example.
So here I've created a map where I'm saying well: all those countries should go in the following groups and that's a fairly simple one, where we spend, as I'm just calling the dot map method to to map all of those together.
So here country transformer it returns returns if I do dot sample the battery, and here you've got all the other groups right.
So that's for custom transformers.
Next, we will look at a bit more advanced thing, so one one really cool thing that was added in second learn: 21, so the latest version is the column transformer, so how many of you have used it before already? Okay? So it's it's really useful, and so it allows you to to map different transformers to different to to the columns that you're interested in.
So that allows you to say well, this specific transformation should be applied to say all my numerical columns.
This specific transformation should be applied to all my cat Oracle columns and that allows you to start creating rules on what transformation should be applied to what part of your data frame.
So here's an example: we've defined some numerical columns, so the agent celery, some categorical columns, the country and the gender, for example, and we create a new object.
We say: that's a column transformer.
It takes a list of tuples and in that tuple you'll have three different elements.
The first one is the name you want to give it.
The second one is the actual transformation and the third one is the list of columns to apply it to so.
Here I'm saying: well, I want to apply PCA, not on my whole data frame, but only on those numerical columns, and I want to apply one hot encoding on only those those columns.
So that's really powerful if you combine it with what we've seen before.
So you can start building transformations and decide, then, to put them all together in one single object that map's them to different parts of your of your data frame.
So, using all of that, you're able to build a master processor object that those all the transformations at once and memorizes all the states that it needs to.
So that's a more visual example here, where we're doing doing the same thing as I explained, so we apply PCA on numerical columns, only that's where we get in a new data frame and we apply one heart and coding on the catechol columns only and then the Column, transformer will just work as a normal transformer, so once you've created that object, you can call dot fit on some training data.
It will learn the the internal state for all the transformations that you've you've provided inside it, then dot transform will apply that on any new data.
So, of course, you can do it on your training data, but you can do it on some test data that you want to process or later on.
You can do it on some new data that you need to generate predictions for on the fly.
So it's a really nice very nice tool.
Another one pipeline there is older, allows you here not to do things in parallel, like the column, transform, will do but sequential operations.
So, for example, you might want to scale some data first and then apply some pca transformation.
On it so pipeline you'll just call the pipeline object again pass a list of tuples where you define what what is it that you want to do so here i want to create a an object, called scalar that scales my data and then pca.
So here all those transformations will be done one by one on all the all, the other columns that you've got available in your in your data so that fit on the data.
Then the transform, that's how it will look like.
So if you combine those two, you can already anticipate that you can start building pretty pretty complex pipelines by having this pipeline that those sequential operations and then the current transformers allows you to map it to specific specific columns only yeah, so it can be used together With the column transformer so here that's an example of that here we're going to steps so we're first, creating a pipeline object that we want to use only for our numerical features, so that pipeline object is for scaling.
The data then applying some pca transformation on it.
So that our first transformer and then we build a top-level column, transformer object.
That applies this specific transformation that we've we've created on the numerical features only and then for the categorical features.
It applies one hot encoding right, so you can start building more complex, more complex.
So the great thing here is really this.
The second point is that you you're making everything really really modular like every every sort of operation.
Every functionality of your code will be a separate class, specific transformer.
That does something something specific really well, you can test it later and then you you're just building pipelines by combining them together.
So that's allows you to keep your code quite quite clean here.
The only thing we are missing so far is the predictive.
The predictive model part of of our code - well, actually, that's not completely true, because pipeline can be used for for transformers, but it can also be used combined with a predictive model.
So if you use as a last step of your pipeline, not a transformer but an actual model that has a dot predict method, the pipeline will automatically inherit that that predict method and be able to generate predictions on your on your data directly.
So that means that you can, you can do something like that.
So you'll have your processor that we've mentioned before, which here is a column, transformer object.
That does something on the numerical columns and do something on the catechol columns, and then we put that inside another pipeline that has a preprocessor stage.
So that's your that's your your first, your first level here and then the second level, which is whatever algorithm you want.
So here we've chosen a decision tree.
The great thing about that is that your model isn't isn't just an algorithm anymore and all your all, your hyper parameters are not just on your model anymore.
So it's not just about choosing the best parameters for your decision tree.
It'S about choosing the best parameters for all your pre-processing steps, together with with your final algorithm, the decision tree here, so maybe changing the values that you have in your the components that you keep in your PCA will affect the best parameters to choose in your decision Tree so it's normal to have all of those tuned together and trained together.
So here that allows you to do that to represent a model as not just an algorithm, but a bunch of pre-processing steps, plus an algorithm at the end, and that also that's also really good to avoid data leakage, because you, you have all the steps run on Any batch of data you provide so if you're doing cross-validation all the pre-processing steps, everything will be done only on that single batch that you're providing instead of leaking some information, potentially from other batches.
If you did the pre-processing before the cross-validation and you can easily apply that pipeline to any new data, once you've managed to train it, it's just a matter of calling dot predicts and it will go through all the pre-processing steps, plus the predictive, the predictive part and Finally, yeah it's more modular, so it will be much easier to test and maintain so we've got an exercise for this as well, which should be fairly short.
So maybe, let's take five minutes for you to take a look so here it will be in the model dot buy file where we've defined functions that that puts all our transformer objects together.
So here the first one, the most important one is the function that builds the model.
So it basically so here two first steps that we're providing.
It creates a pipeline for categories that we've extracted with that categories extract object.
So, where does is that it first calls transformer and then it calls the one hot and cutter to create dummies out of that, and we also have another pipeline that is for the country, the Contra transformer.
So same thing.
It'S just creating dummies out of the catechol features, so those are just the steps that are provided.
Then the exercise here is to create a column, transformer object that has all the steps all the steps we want.
So it will have this one: the it will have the category processor, it will have the country processor it needs to have as well the other, the time transformer and the gold adjuster yep.
So yeah.
If you take a look for the next, you all right.
So, let's, let's do that that part together now so the first.
The first step here is to create your your main column, transform an object.
Will they will basically put everything together at this stage, so we'll just create a new column, transform how I even imported it.
Yep so I'll do preprocessor column transformer, so here I need a list of tuples, so the first one will be for my categories.
So here I will just apply the cats processor and I've got a list of my categorical also here.
This one is just the column category right, so here my categories I actually extracted from from only the category category column.
So if I provide this processor, this transformer object only the category column that should be enough, then I will need countries.
There will be my country, processor, and I just need the column country there.
Then I will need to use this.
What else I got the goal: adjuster that needs the goal: column and the static USD rate column.
So I'll call that how do they call it? The solution that call it goal so there will be the goal I just herbs go register and that will need the goal, column and Static USD rate.
And finally, I want as well the time transformer so here I'll call that all the time yeah called time fly.
The time transformer and I will need those two columns - the stay columns, the deadlines created that and launch that deadline, creating that just right, so oops still missing one thing, so that's for a free processor object.
So now I've got my pre-processing stage now.
I need to put it together with some predictive model, so I'll just call it a cold model.
There will be a pipeline where I basically have two stages, so pre processor will be my pre processor.
I'D define just above that's my first stage and then my second stage will just be some algorithm, sir.
Let'S call it model, and here I think I've already imported the decision tree somewhere yeah.
So, let's just do decision tree decision tree classifier with just default default parameters for now cuz we're gon na churn it anyway.
So did I do something wrong? Is it going to work? Doesn'T work, I need oh yeah.
I need to pass a bitch okay.
That'S my foot need to pass.
Everything is a list here: okay, right and now it's working.
So that's that's pretty cool here, because we we've got all all our Moodle now can be built with this single function and we can return a new instance of our model, including all the steps that it needs with one single one single function.
So here that's quite easy to change to change the pipelines.
If I want to the last step, creates a preprocessor preprocessor with older all the things all the features that I need to build and the really last one puts it together with some algorithm.
So here, from from a top-level point of view, I just need to instantiate a new model calling that function.
I can fit it on some on some data and then I call predict and it works just like any any algorithm, even though it has all the people sing steps included to it right.
So next, let's go back to slides right.
So now we've got about everything and with everything we need to to build the model and we're ready to start thinking about training it and tuning it and all of those.
So here, the first thing that we need to we will need to to think is how, when we're training our model, that's something that can take a while and when we're generating predictions, we need somehow to be able to load a model that we've trained we've trained.
Previously so here we will want to be able to save somehow that model on disk and be able to out train model on disk and be able to load it later to use it say if you want to use it on put in production just to generate Predictions you don't want to retrain it every time, so here that's what model persistence will mean so they're two main ways of doing it.
With with button of sterilizing objects, you have pickle.
That is quite quite well known and job Lib.
There is a bit less known today, pretty much pretty much equivalent in terms of performance in their documentation.
Job Lib says that they tend to be more efficient with with larger arrays, so we might prefer that for working with benders and in gnome pies with object that work we spend as and don't buy.
Something cool with job Lib is that the interface is just a bit easier to use.
So that's how that's how you pickle your model using the pickle library so you'd have to open the file in binary, then dump it and then open the file again in binary and load it.
If you wanted to do that with job Lib, it's just a bit easier because they abstract from you the loading the file in binary mode.
So you can just drop lib dump here.
You'Ve got your model, so that's your single single object pipeline that has been trained.
That has you, you have gone teed there.
It has everything you need in it and then you specify your file and it will just dump the whole object on that file and then later on.
If you want to load it, it's really easy.
You just call dot load and past the path that you want to load.
Just keep in mind.
It has some limitations, so people and table will just be Co that specific object that you're specifying, which is why actually it's important to to have one single object, but it does not save the dependencies.
So it will not.
You will still need the code, that's that it relies on to to run.
So that means all the libraries that it's calling, but also your custom, your custom classes, everything you will need the files to be able to run it.
Also, the data is not saved.
So if, like you need to retrain your model, you also need to make sure that you have snapshots of of your data.
Other other thing to keep in mind when you're using those libraries is that the versions are really really important.
So since it's relying on whatever whatever libraries you've got installed when you're loading the model, if it has been saved with different versions, then it can often have half some prime.
So it's really important here that you you're keeping track of what versions you've pick your mode.
Always and make sure that whatever environments you load, the people from will have the exact same versions and dependencies installed and the last.
The last comment here is that yeah you can pickle anything any any object in Python, so make sure that you don't load some people that has been provided by some in some places that you don't trust, because it could contain anything, including including some scripts, that mess Up everything so keep that in mind, so we've got an exercise for that where we are going to here implement the last.
So we only need three things now for our model to be able to run it to run it on the command line.
So we need a method to tune it.
We need a method to train it and we need a method to test it.
So we've got let's yeah, we got time.
So let's take another five minutes to look at that should be so here.
It should be fairly easy.
You only need two for tuning.
For example, you will need to call your your your function.
That builds a new instance of the model that gives you a new instance.
Then you need to you, can use grid search directly on it, so grid search support supports the suppose.
The pipeline, the pipeline objects.
So here what we have is defined in a conflict that by file your great parameters, so you can uncomment that or just come up with your own parameters that you want to tune.
So here, for example, we're saying that we want to change the max depth of the model stage of the pipeline.
So here the max depth of the model stage will be the max depth of our decision tree.
We want to tune the min sample split of that decision tree as well.
So model underscore underscore the name of the attribute in the decision tree, and then we also want to attune some of the some of the parameters of preprocessor object.
So here are preprocessor object.
We can change use all from false to true, so those are a few example.
If I do that, actually I can just give you tip.
So if you do model that get params, that gives you all the parameters that are in your pipeline object in your model.
So that till it tells you all the steps that you've got a zoom in a bit more all the steps that you've got there.
So it's quite a big one.
If you do that keys, it tells you all the parameters that you can, that you can tune that you can tune there.
So that's where you see that so you've got the things on your.
Where is it? You'Ve got all the parameters that are on on your model, so like class weights, so all of those are defined on on my decision tree, so I can tune all of those, but I can also tune everything that is defined on my preprocessor.
So that's the that's.
The one that I've chosen to tune here right, so you can access all those parameters and provide them in the in the grid search.
Actually, it's maybe better.
If I just do it straight away all right, so that's that's just do it together, so here first step for tuning the tuning.
The model is well load, the data.
So a lot ex train.
Why train load data set to have the data set here? No, it's! Okay, I'll just copy the part from here extreme weight, okay right so first I load the training data.
Then I'm creating instantiating a new model, so I'll do build model and then I can call grid session.
So have I loaded the growth research? Yes, I have it.
There so I can define a new grid search object here.
There will be GS, equal grid, search CV.
I pass my model directly.
I will pass the parameters that I've defined in config so I'll first uncommon them.
They said we imports all of those great poems, and I okay here I'll, do CV equal, say three and underscore jobs equal minus one.
Then I do GS fits on extra in white rain and I can just print Vasco will be so here you don't necessarily want you to print.
Maybe you want to actually get those parameters and automatically retrain your model with those parameters, but keep things simple.
I just do it like this vesko, let's see if that works here.
So if I call that know, I need to imports as well extreme white rain right, it's nice tunic, and here it's telling me the best guy can also add it.
To tell me the best.
The night tells me as well what are the best parameters that I've got in, not in my algorithm only but in my whole, in my whole pipeline.
So it's telling me that actually I should use max depth of 9 of 9 and those parameters for the decision tree, but also it's answering my question I had at the beginning of this tutorial, which was should I use all all the categories that I get from The data frame from the category column, or should I use only a subset of those well, he is telling me that I should use all instead with those parameters of the tree.
So that's that's quite nice that allows you to add some extra parameters to to tune in there.
So we'll do the same for training so for training.
We'Ll need those two first then we'll do a model that fits X, train y train.
So that's quite simple here and then the the last step is to save it onto a file.
So we'll do a job Lib dump, we'll save just have one single object to say, which is our model and let's give it a name model job Lib.
So here I do not have a model dojo blip yet so I can see if it worked.
If I run that extreme train, if I run that oops still not import job lab no good, I commented out creating traps for myself and I strained it and you see that it has saved it here.
What'S binary file, so I can't open it, but it has saved it on this, so I'm able to load it later and actually that's the next exercise, loading that and generating predictions.
So here test model, I'm just but all the job lab.
That would be my model.
Then I need to load some data as well so I'll use that here, but it's not extreme its X test by test, and I need to import that as well at the top right.
So once I've got that now I can easily just compute accuracy score classification report.
Anything I want.
I just do.
White bread will be equal to model that predicts next tests, and then I can just do prints.
Accuracy will be format.
Y tests, white bread.
You can try that here, that's the wrong one here.
It is right so right so now I'm also able to easily test test my model and I can add, nope, okay right and now I've got the classification report as well right.
So that's that's! Quite nice! Now I've got pretty much all the code.
I need properly modular and separated, so I can easily make modifications so I can modify the existing transformers that I've got here in this file by changing the way say: I'm computing the category or I'm computing anything here.
I just need to focus on the transform method inside my class.
If I want to add an extra step, I just create a new transformer.
I can test it individually, making sure it works fine and then in the model, dot PI.
I just need to add it here in the build model as an extra step so either I have to create a pipeline object or I added directly to my top-level column transformer.
If I want to change the algorithm, I'm using it's quite easy as well to do here and then I'm just modifying the parameters that I want to tune to correspond to to my to my both algorithm and and steps, and it's all I can do all of That together here so also providing some command line tool out of that, because it's easier to just run it directly from the command line.
Instead of having to start the shell and call the commands the functions yourself.
So here we've just defined rounded by that those.
Basically, imports are three main entry points, so that thun model train model and test model and then just use the arc pass library in order to to have that as arguments on the command line.
So if I test that now did you see the exercise and yes i zoom in here: if I do Titan rounded pie stay tuned, they will turn my model as it says, maybe yeah.
So here you can reach unity easily.
So, every time you make a new step, you're changing the steps and you're in your code.
You actually don't have to go through the notebook anymore.
You can just run it here and see how that improve, how that improved, your your cross-validation score and how? What what are the best parameters that you should use when you train your final model? I can also do train if I want to train my model again and save it as a job Lib cellulous file, and then, whenever I run the test, I can see the final final value of my score classification report.
Anything I'm interested in really and we could implement another another entry point that would be for predict that could be predicting loading, the serialized the sailors model and generate predictions on any any new data that you provided.
So that would be.
That would be the command you would use if you use that model in production right.
So if you a few comments things that I have skipped a little bit so first, it's quite important.
So we're going to be loading data that will they will change.
Sometimes you will have large training data sets.
Sometimes you'll have a single row, so here it's missus, it's it can be quite risky to to just rely on pandas inferring.
What type, what types your columns should be.
So it's really important to whenever yoga you, you want to make it to make maintainable version of your of your of your codes to use to fix the tea types.
So that's where we've done here: we've just specified the D types for all the columns and whenever we actually loading the data with that function, that's why I've provided that function here we make sure that every time we load the data base, if the same the same D-Types that way, we sure that something won't break, because some some data has been loaded has been inferred as a different type than it was important to have a requirements that PI as well.
So the text.
Sorry, since we here, we are cellulite in our model and we expecting our model to work for to work in the in the future and support to make sure we can replicate exactly the same, the same environment over and over again so requirements.
The text with the actual versions of the library will help you doing that yeah.
If you're, making incremental changes make sure you you're checking the the disco in the tuning steps of the cross-validation score and not the not disc, on the test set, because that would obviously be massive massive massive of overfitting to do that.
So just do all your feature.
Engineering using the the tuning, cross-validation score and yeah, the really cool thing is that this this is quite modular, so you can easily write tests.
So that means you really should now it's it's.
It'S quite straightforward.
Actually, to just take this all those transformers in isolation and build build test for it, and actually that's what that's what the next section is about yeah and with all of that hopefully you're.
Updating your test should be your model story, so it should be less stressful.
Just yet to finish, we we have a booth somewhere tomorrow.
So five minutes for questions you have any yep, oh yeah, you could use numpy, so it it relies on.
So the question was: does it work only with spenders or with non pay as well? So yeah? It since, like all of all of the Transformers from from psychic, learn except either numpy array or PEMDAS array here, turns out that the actual transformers that I've built I'm using some some pandas method.
So my own code will not work with spenders with oh yeah.
You need to specify the column there.
Yes, so it's not it's not this one! It'S not going to work with with number, because it's relying on you specifying what column it should be, but the rest should do yeah any more questions, yeah.
So with the pickle or the other, one was called yep, so you mentioned a couple of limitations.
Is there any other way to store circuit in your model? The model yeah? So you say I want to.
I turn a model and then they pass that to someone else.
So here did that that's actually the recommended way on on scikit-learn on their documentation, they're using so they don't have a custom way of doing it.
They'Re using pickers and job Lib which are like just generate by then once if you wanted to store that just just just the data then does has a binary formats that are that they're using so you could do that, but the model itself
Comments
Post a Comment