1 line
471 KiB
Plaintext
1 line
471 KiB
Plaintext
{"nbformat":4,"nbformat_minor":0,"metadata":{"interpreter":{"hash":"3fc12855e5119aa7119eb8b28b2c79e4453dd0444ad04c81a8c18197ce5b843e"},"kernelspec":{"display_name":"Python 3","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.9.7"},"orig_nbformat":4,"colab":{"provenance":[]}},"cells":[{"cell_type":"markdown","metadata":{"id":"0fQm9v7oCnl8"},"source":["# 1-Day Hackathon Tutorial\n","This NOTEBOOK will provide an introduction to the process of creating forecast results and the basic methodology.\n","\n","First, let's review the task we will be performing (see README.ipynb for details).\n","\n","**Objective**: To predict the probability of default based on customer data.\n","\n","**Evaluation metric**: ROC-AUC (Area Under the Receiver Operating Characteristic Curve)"]},{"cell_type":"markdown","metadata":{"id":"xcuOu_ttCnmA"},"source":["## Contents\n","- [1.Setup](#scrollTo=a5KcDAE8CnmB)\n","- [2.Loading the Data](#scrollTo=tp0cW0CD11Hi&line=1&uniqifier=1)\n","- [3.Visualizing and Understanding the Data](#scrollTo=3NuP1zcmCnmF&line=1&uniqifier=1)\n","- [4.Preprocessing and Feature Creation](#scrollTo=rsPYkguwCnmO&line=1&uniqifier=1)\n","- [5.Building the Machine Learning Model](#scrollTo=FoKdK60PCnmP&line=1&uniqifier=1)\n","- [6.Creating prediction results](#scrollTo=tn_kdvWYCnmQ&line=2&uniqifier=1)"]},{"cell_type":"markdown","metadata":{"id":"a5KcDAE8CnmB"},"source":["## 1.Setup"]},{"cell_type":"markdown","metadata":{"id":"617twnHyCnmB"},"source":["### 1.1 Import Libraries\n","Let's load basic libraries.\n","Other required libraries will be loaded when we explain them.\n","- numpy: Library for efficient numerical computation\n","- pandas: Library useful for data analysis\n","- matplotlib: Graph drawing library\n","- seaborn: Graph drawing library as well"]},{"cell_type":"code","metadata":{"id":"QShif6ZLCnmC"},"source":["# Importing libraries\n","import numpy as np\n","import pandas as pd\n","import matplotlib.pyplot as plt\n","import seaborn as sns\n","\n","import warnings\n","warnings.filterwarnings('ignore')"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"ZsnL75uiCnmD"},"source":["### 1.2 Connect with Google Drive\n","To load the data, we first need to connect this Colab notebook with Google Drive."]},{"cell_type":"code","metadata":{"id":"7S86UJm3PcOe"},"source":["# If you work with Google Colaboratory, please run this as well.\n","from google.colab import drive\n","drive.mount('/content/drive')"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Next, we need to navigate to where the `GCI World_202604_hackathon1` folder is located\n","\n","**IMPORTANT**:<br>\n","Change the path in the `%cd` command below to match the folder where this notebook is saved on Google Drive by **replacing \"WhereThisNotebookIsLocated\" with your actual folder path**.\n","\n","Examples:\n","- You uploaded `GCI World_202604_hackathon1` folder directly under MyDrive (the default folder when you open Google Drive)\n"," - Change to \"/<wbr>content/drive/MyDrive/GCI World_202604_hackathon1\"\n","- You uploaded `GCI World_202604_hackathon1` folder inside a folder named `00_GCIGlobal` under MyDrive\n"," - Change to \"/<wbr>content/drive/MyDrive/00_GCIGlobal/GCI World_202604_hackathon1\"\n","\n","You can easily locate your notebook's directory by:\n","1. Open the Files panel on the left side of Colab\n","2. Navigate through the \"drive\" and \"MyDrive\" folders until you find your notebook's folders\n","3. Click the more actions icon (three vertical dots, $\\vdots$) next to the folder name\n","4. Select the option \"Copy path\""],"metadata":{"id":"jImmmc4kQaKd"}},{"cell_type":"code","source":["# Specify the directory where this notebook is located after %cd.\n","%cd \"/content/drive/MyDrive/2026spring_compe/\""],"metadata":{"id":"1u_jx2O8mFAE"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":[""],"metadata":{"id":"S1QAwr57VFPg"}},{"cell_type":"markdown","source":["Run the cell below to check if the path is correctly set."],"metadata":{"id":"oOWzRgsMztnd"}},{"cell_type":"code","source":["import os\n","from pathlib import Path\n","\n","# Automatically get the current working directory\n","current_dir = Path(os.getcwd())\n","\n","# Define file paths using pathlib\n","train_file = current_dir / \"input\" / \"train.csv\"\n","test_file = current_dir / \"input\" / \"test.csv\"\n","sample_sub_file = current_dir / \"input\" / \"sample_submission.csv\"\n","\n","# Check if path exists\n","if train_file.exists() and test_file.exists() and sample_sub_file.exists():\n"," print(\"All files exist and path is correctly set.\")\n","else:\n"," print(\"Some files are missing or path is not correctly set.\")"],"metadata":{"id":"kNikBXwXl3fx"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## 2.Loading the Data"],"metadata":{"id":"tp0cW0CD11Hi"}},{"cell_type":"markdown","source":["### 2.1 Data Overview\n","Run the cell to load the dataset as `pd.DataFrame`.\n","\n","**IMPORTANT:**<br>\n","**When you make modifications to preprocessing or model training, always make sure to run all cells from this cell.**"],"metadata":{"id":"2TTHzi1c3a5E"}},{"cell_type":"code","source":["train = pd.read_csv(train_file)\n","test = pd.read_csv(test_file)\n","sample_sub = pd.read_csv(sample_sub_file)"],"metadata":{"id":"phmNjzQa12-E"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"jSh7uV_4CnmF"},"source":["Before conducting a full-scale analysis, we will first review a brief overview of the data."]},{"cell_type":"code","metadata":{"id":"tmJzxS2nCnmG"},"source":["# Check train data\n","print(f\"train shape: {train.shape}\")\n","train.head(3)"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"0j6hSZwjCnmG"},"source":["# Check test data\n","print(f\"test shape: {test.shape}\")\n","test.head(3)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Excluding the `TARGET` column in train data and the `SK_ID_CURR` which represents ID number, you can see that there are 32 types of features."],"metadata":{"id":"Zd5aPgQc2Voe"}},{"cell_type":"markdown","source":["### 2.2 Selecting Features\n","\n","It is often difficult to perform data analysis and preprocessing on all features from the beginning. Instead, an easier way to get started is to start with a small number of features and then add features one by one.\n","\n","This notebook will focus on 3 features. For the remaining 23 types of features, please refer to the lecture materials, the methods introduced in this notebook, etc., and perform the analysis on your own.\n","\n","Feel free to also ask questions in lectures, office hours, or in the Slack community!"],"metadata":{"id":"4NLboDmT3Wsg"}},{"cell_type":"code","metadata":{"id":"zCF0BZ2lCnmH"},"source":["# Focus on 5 features\n","use_features = [\n"," \"AMT_INCOME_TOTAL\",\n"," \"EXT_SOURCE_2\",\n"," \"OWN_CAR_AGE\",\n"," # add new features to use here\n","]\n","target = train[\"TARGET\"].values\n","\n","train = train[use_features]\n","train[\"TARGET\"] = target\n","test = test[use_features]"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"5zCvRUOCCnmH"},"source":["Let's check the data once again."]},{"cell_type":"code","metadata":{"id":"X4IL84J-CnmI"},"source":["# Check train data\n","print(f\"train shape: {train.shape}\")\n","train.head(3)"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"ns__iaZ3CnmI"},"source":["# Check test data\n","print(f\"test shape: {test.shape}\")\n","test.head(3)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["### **[Next Steps]**\n","> + Try adding more features, starting with the ones that you think are more relevant for predicting default probability. Check `HomeCredit_columns_description.xlsx` to understand what each column represents.\n","> + When you add new features, always restart from [Section 2.1](#scrollTo=2TTHzi1c3a5E&line=5&uniqifier=1) by reloading the dataset."],"metadata":{"id":"cGpOtMKo4gEW"}},{"cell_type":"markdown","metadata":{"id":"3NuP1zcmCnmF"},"source":["## 3.Visualizing and Understanding the Data"]},{"cell_type":"markdown","metadata":{"id":"Ooo0LA-2CnmI"},"source":["The first thing we need to do before building the machine learning model is to **understand the data**. We do this by visualizing and analyzing, to deepen our understanding of data distribution, missing values, outliers, correlations, and etc. The results of the analysis obtained at this stage will be useful for preprocessing, feature creation, and selection of machine learning models, which are all important to building models with better prediction ability."]},{"cell_type":"markdown","metadata":{"id":"9tnoSfzvCnmJ"},"source":["### 3.1 Checking missing values\n","In this section, we check for missing values.\n","This is important as **most machine learning models cannot be trained on data with missing values**. If there are missing values, they need to be filled with some value."]},{"cell_type":"code","metadata":{"id":"10WA5hRBCnmJ"},"source":["# Check missing values of train data\n","train.isnull().sum()"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"3Y7FVExVCnmJ"},"source":["# Check missing values of test data\n","test.isnull().sum()"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"Erj6eK80CnmJ"},"source":["We found that there are missing values in `EXT_SOURCE_2` and `OWN_CAR_AGE`. We will deal with these missing values later. Of course, there is a possibility that there are missing values for other features that we are not covering here, so please check them by yourself."]},{"cell_type":"markdown","source":["**Findings**:<br>\n","* Need to deal with missing values in `EXT_SOURCE_2` and `OWN_CAR_AGE`"],"metadata":{"id":"YrCvOnzO97pC"}},{"cell_type":"markdown","metadata":{"id":"PHtdHORMCnmJ"},"source":["### 3.2 Visualization and analysis of each feature\n","In this section, we visualize each feature and analyze to see what kind of characteristics it has."]},{"cell_type":"markdown","source":["#### TARGET column"],"metadata":{"id":"5uYeP32t7ZV5"}},{"cell_type":"code","metadata":{"id":"EYRYwJ_LCnmK"},"source":["# The distribution of the target (default or not)\n","sns.countplot(data=train, x=\"TARGET\")\n","plt.show()"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"cw4sq8rSCnmK"},"source":["We can see that the **distribution of the objective variable is highly skewed**. Data in which the distribution of the objective variable is highly skewed in this way is called **unbalanced data**.\n","\n","When dealing with unbalanced data, we need to be particularly careful in selecting evaluation metrics. For example, if you choose accuracy, you will find that simply predicting all zeros will result in a high accuracy. **Choosing such an inappropriate metric can cause the machine learning model to fail to predict well on new data**."]},{"cell_type":"markdown","source":["#### EXT_SOURCE_2 column"],"metadata":{"id":"Z2Or9DAj8YLa"}},{"cell_type":"code","metadata":{"id":"ZVlLRi4TCnmL"},"source":["# The distribution of EXT_SOURCE_2\n","sns.displot(data=train, x=\"EXT_SOURCE_2\")\n","plt.show()"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"RaEXl1zICnmM"},"source":["We can see that EXT_SOURCE_2 is normalized between 0 and 1. It seems we can handle this feature as it is."]},{"cell_type":"markdown","source":["**Findings**:<br>\n","* No additional preprocessing is needed"],"metadata":{"id":"kR91l5IQ-0xX"}},{"cell_type":"markdown","source":["#### AMT_INCOME_TOTAL column"],"metadata":{"id":"ZHQYht0C8aYB"}},{"cell_type":"code","metadata":{"id":"2WwsHs0JCnmM"},"source":["# The distribution of AMT_INCOME_TOTAL\n","sns.displot(data=train, x=\"AMT_INCOME_TOTAL\")\n","plt.show()"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"JQbzv0D8CnmM"},"source":["The visualization of `AMT_INCOME_TOTAL` is hard to interpret. This may be caused by the presence of a small number of outliers that take large values. To visualize data like this, a logarithmic transformation can be effective."]},{"cell_type":"code","metadata":{"id":"kgeKevDMCnmM"},"source":["# The distribution of AMT_INCOME_TOTAL(Logarithmic transformation)\n","sns.displot(data=train, x=\"AMT_INCOME_TOTAL\", log_scale=10)\n","plt.show()"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"j4r3xGRMCnmM"},"source":["We displayed the graph successfully by using logarithmic transformation.\n","The income is supposed to be a continuous value, but it looks like a discrete value. Let's have a look at the type of `AMT_INCOME_TOTAL` values."]},{"cell_type":"code","metadata":{"id":"hWDiKd98CnmM"},"source":["# Check the type of AMT_INCOME_TOTAL values\n","len(train[\"AMT_INCOME_TOTAL\"].unique())"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"3VcQ-MEBCnmN"},"source":["There are 171202 data in train, but `AMT_INCOME_TOTAL` consists of only 1641 different values. Let's check the top 10 values specifically."]},{"cell_type":"code","metadata":{"id":"s8mETFkCCnmN"},"source":["# Top 10 values of AMT_INCOME_TOTAL\n","train[\"AMT_INCOME_TOTAL\"].value_counts().head(10)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"82L-z5gQCnmN"},"source":["It appears that `AMT_INCOME_TOTAL` is not an exact annual income, but rather data compiled from a rounded number."]},{"cell_type":"markdown","source":["**Findings**:<br>\n","* Should the outlier in the data be addressed?"],"metadata":{"id":"RmcAylwL_Jc_"}},{"cell_type":"markdown","source":["#### OWN_CAR_AGE column"],"metadata":{"id":"U_IUFERV8wDW"}},{"cell_type":"code","metadata":{"id":"bXBlabmLCnmN"},"source":["# The distribution of OWN_CAR_AGE\n","sns.displot(data=train, x=\"OWN_CAR_AGE\")\n","plt.show()"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"LheIcW5LCnmN"},"source":["`OWN_CAR_AGE` can be inferred to be in years from the scale of values. In addition, the distribution is natural from 0 to 40, but there is an unnatural distribution around 60 to 70. It is hard to imagine that the number of years a car has been purchased increases suddenly like this, so they are considered to be outliers."]},{"cell_type":"markdown","source":["**Findings**:<br>\n","* Treat numbers above 60 as outliers"],"metadata":{"id":"T6aOlsbS_QeD"}},{"cell_type":"markdown","metadata":{"id":"kl23q2zgCnmN"},"source":["Up to this point, we have visualized and analyzed each feature. I believe that you have realized that visualization requires some ingenuity and that visualization can deepen your understanding of data. I am sure that the visualization and analysis of the 25 features not covered here will lead to improved forecasting accuracy."]},{"cell_type":"markdown","source":["### **[Next Steps]**\n","> + Check for missing values for the features you have added in Section 2.2.\n","> + Visualize the features you have added. Is the feature categorical or continuous? What type of graph is most effective to understand it?\n","> + What do you notice about the features? What kind of preprocessing is needed?"],"metadata":{"id":"FzaMVo9x8_cS"}},{"cell_type":"markdown","metadata":{"id":"rsPYkguwCnmO"},"source":["## 4.Preprocessing and Feature Creation\n","Here, we will conduct the preprocessing and create new features based on what we have learned in the preceding visualization and analysis."]},{"cell_type":"markdown","metadata":{"id":"z0cPsOyMCnmO"},"source":["### EXT_SOURCE_2 column\n","Fill missing values in `EXT_SOURCE_2`. There are various methods for completing missing values, but in this case, since the number of missing values is small, we simply use the average value to complete the missing values.\n","\n","**IMPORANT**:\n","When you fill the missing values in the test data, you need to **fill with the average of the train data**."]},{"cell_type":"code","metadata":{"id":"yO-iJnwLCnmO"},"source":["# Complete missing values of EXT_SOURCE_2 with the average\n","train[\"EXT_SOURCE_2\"].fillna(train[\"EXT_SOURCE_2\"].mean(), inplace=True)\n","test[\"EXT_SOURCE_2\"].fillna(train[\"EXT_SOURCE_2\"].mean(), inplace=True) # Use average of train data to fill test data\n","\n","train.isnull().sum()"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["### **[Next Steps]**\n","> + Apply preprocessing to the features you added. Is it correctly preprocessed?\n","> + Explore other preprocessing methods to apply to the features.\n","> + If you have errors, try reloading the dataset by going back to [Section 2.1](#scrollTo=2TTHzi1c3a5E&line=5&uniqifier=1)."],"metadata":{"id":"lxZerNEvBleb"}},{"cell_type":"markdown","metadata":{"id":"OtdYByRsCnmP"},"source":["### OWN_CAR_AGE column\n","First, we will replace the unnatural outliers that are over 60 as `np.nan` (missing values)."]},{"cell_type":"code","metadata":{"id":"7-ojzBAtCnmP"},"source":["# Treat values above 60 (outliers) in OWN_CAR_AGE as missing values\n","train.loc[train[\"OWN_CAR_AGE\"] >= 60, \"OWN_CAR_AGE\"] = np.nan\n","test.loc[test[\"OWN_CAR_AGE\"] >= 60, \"OWN_CAR_AGE\"] = np.nan"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"aUbZ-qsiCnmP"},"source":["Next, we consider the handling of missing values. The original `OWN_CAR_AGE` had 112992 missing values out of 171202 data. With such a large number of missing values, it is difficult and impractical to properly fill the missing values with some value. Therefore, we will group `OWN_CAR_AGE` by decade (e.g. Group 1: 0-9 years, Group 2: 10-19 years, etc.), then apply **One Hot Encoding**."]},{"cell_type":"code","metadata":{"id":"8bHTyLv_CnmP"},"source":["# Divide OWN_CAR_AGE into groups\n","train[\"OWN_CAR_AGE\"] = train[\"OWN_CAR_AGE\"] // 10\n","test[\"OWN_CAR_AGE\"] = test[\"OWN_CAR_AGE\"] // 10\n","\n","train[\"OWN_CAR_AGE\"].unique()"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"GIcCvBHvCnmP"},"source":["# Apply One Hot Encoding to OWN_CAR_AGE\n","train_car_age_ohe = pd.get_dummies(train[\"OWN_CAR_AGE\"]).add_prefix(\"OWN_CAR_AGE_\")\n","test_car_age_ohe = pd.get_dummies(test[\"OWN_CAR_AGE\"]).add_prefix(\"OWN_CAR_AGE_\")\n","\n","# Align columns so train and test have the same dummy columns\n","test_car_age_ohe = test_car_age_ohe.reindex(columns=train_car_age_ohe.columns, fill_value=0)\n","\n","# Add the one hot encoded columns to train/test\n","train = pd.concat([train, train_car_age_ohe], axis=1)\n","test = pd.concat([test, test_car_age_ohe], axis=1)\n","\n","# Remove original OWN_CAR_AGE\n","train.drop('OWN_CAR_AGE', axis=1, inplace=True)\n","test.drop('OWN_CAR_AGE', axis=1, inplace=True)\n","\n","train.head(5)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["### **[Next Steps]**\n","> + Apply preprocessing to the features you added. Is it correctly preprocessed?\n","> + Explore other preprocessing methods to apply to the features.\n","> + If you have errors, try reloading the dataset by going back to [Section 2.1](#scrollTo=2TTHzi1c3a5E&line=5&uniqifier=1)."],"metadata":{"id":"eqNUNF2-6X_o"}},{"cell_type":"markdown","metadata":{"id":"FoKdK60PCnmP"},"source":["## 5.Building the Machine Learning Model\n","Now, we are ready to start building the machine learning model."]},{"cell_type":"markdown","metadata":{"id":"2LLoN-aQIsrk"},"source":["### 5.1 Import Additional Libraries\n","First, we import the necessary libraries for training and evaluation.\n","\n","- `train_test_split`: Split data into training and evaluation data.\n","- `StandardScaler`: Standardize the data.\n","- `roc_auc_score`: Calculate ROC-AUC, the evaluation metric for this competition."]},{"cell_type":"code","metadata":{"id":"RVwkQ5b8CnmP"},"source":["# Importing libraries\n","from sklearn.model_selection import train_test_split\n","from sklearn.preprocessing import StandardScaler\n","from sklearn.metrics import roc_auc_score"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"bcgHtFrRJpev"},"source":["### 5.2 Preparing the Data\n","Split the data into explanatory and target variables. The target variable for this dataset is `TARGET` column and the rest are explanatory variables."]},{"cell_type":"code","metadata":{"id":"joie-z89KBdg"},"source":["# Split the data into explanatory and target variables\n","X = train.drop(\"TARGET\", axis=1).values\n","y = train[\"TARGET\"].values\n","X_test = test.values\n"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"M85-Y6M2KQBX"},"source":["Standardize the data. Standardization is the operation of transforming the values so that the mean is 0 and the variance is 1. Some models, such as logistic regression and neural networks, do not learn well without scaling the values in this way."]},{"cell_type":"code","metadata":{"id":"x8lmkBJbCnmQ"},"source":["# Standardization\n","sc = StandardScaler()\n","sc.fit(X)\n","X_std = sc.transform(X)\n","X_test_std = sc.transform(X_test)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"3faa7zQnKmSS"},"source":["### 5.3 Training the Model\n","We first split the training data into training data and validation data. This method of keeping a portion of the training data for evaluation and not using it for training is called the **holdout method**. This is one method to approximate the model's predictive ability on unknown data (**generalization** performance).\n","\n","Here, we will use 70% of the data as training data and 30% as validation data"]},{"cell_type":"code","metadata":{"id":"_PipkeXdKlvK"},"source":["# Split the original data into the training data and the validation data\n","X_train, X_valid, y_train, y_valid = train_test_split(X_std, y, test_size=0.3, stratify=y, random_state=0)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"I62WMgQiMU8J"},"source":["Now, let's create models with logistic regression and random forest."]},{"cell_type":"code","metadata":{"id":"bpzK_zyWCnmQ"},"source":["# Logistic Regression\n","from sklearn.linear_model import LogisticRegression\n","\n","lr = LogisticRegression(random_state=0)\n","lr.fit(X_train, y_train)\n","\n","lr_train_pred = lr.predict_proba(X_train)[:, 1]\n","lr_valid_pred = lr.predict_proba(X_valid)[:, 1]\n","print(f\"Train Score: {roc_auc_score(y_train, lr_train_pred)}\")\n","print(f\"Valid Score: {roc_auc_score(y_valid, lr_valid_pred)}\")"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"qvRvrmC2CnmQ"},"source":["# Random Forest\n","from sklearn.ensemble import RandomForestClassifier\n","\n","rf = RandomForestClassifier(random_state=0, max_depth=10)\n","rf.fit(X_train, y_train)\n","\n","rf_train_pred = rf.predict_proba(X_train)[:, 1]\n","rf_valid_pred = rf.predict_proba(X_valid)[:, 1]\n","print(f\"Train Score: {roc_auc_score(y_train, rf_train_pred)}\")\n","print(f\"Valid Score: {roc_auc_score(y_valid, rf_valid_pred)}\")"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Compare the validation ROC-AUC of the two models. Note that the model with the higher validation score may differ depending on which features you added and how you preprocessed them. Use the validation score, not the training score, as the basis for comparison."],"metadata":{"id":"cg_HBfWDFrdM"}},{"cell_type":"markdown","metadata":{"id":"CSthGRA-ZXaY"},"source":["### 5.4 (Optional) Ensemble Learning\n","Now that we have created two models, we can try combining these two models for better predictive ability (**ensemble learning**). There are various methods for ensemble learning, but here we will simply take the average of the two models."]},{"cell_type":"code","metadata":{"id":"QwEBg6HbGPGn"},"source":["train_pred = (lr_train_pred + rf_train_pred) / 2\n","valid_pred = (lr_valid_pred + rf_valid_pred) / 2\n","\n","print(f\"Train Score: {roc_auc_score(y_train, train_pred)}\")\n","print(f\"Valid Score: {roc_auc_score(y_valid, valid_pred)}\")"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"RFXE351Dah5p"},"source":["Whether the simple average ensemble improves over the individual models depends on your features and preprocessing. As a rule of thumb, choose whichever of logistic regression, random forest, or the ensemble achieves the highest validation score, and use it for the final prediction in the next section."]},{"cell_type":"markdown","source":["### **[Next Steps]**\n","> + Is holdout method the best method to evaluate your model?\n","> + Is the model's hyperparameters optimized? What hyperparameters needs tuning?\n","> + Explore other models to use to make predictions.\n","> + Explore other ensembling methods to further improve the model's performance.\n","> + If you have errors, try reloading the dataset by going back to [Section 2.1](#scrollTo=2TTHzi1c3a5E&line=5&uniqifier=1)."],"metadata":{"id":"R33--AYQGV--"}},{"cell_type":"markdown","metadata":{"id":"tn_kdvWYCnmQ"},"source":["## 6.Creating Prediction Results\n","Finally, let's make a prediction for the test data, and prepare a CSV file to submit."]},{"cell_type":"markdown","source":["### 6.1 Predicting on the test data\n","Make the final prediction with the model that achieved the highest validation ROC-AUC in Sections 5.3 and 5.4. The cell below uses logistic regression by default; comment-switch to random forest or the ensemble if your results justify it.\n","\n","```python\n","# If random forest model was better\n","pred = rf.predict_proba(X_test_std)[:, 1]\n","\n","```"],"metadata":{"id":"FVfIg3QgP_8d"}},{"cell_type":"code","metadata":{"id":"-IPNn-_ZCnmQ"},"source":["# Make predictions for the test data\n","# Change model name if needed\n","pred = lr.predict_proba(X_test_std)[:, 1]"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["### 6.2 Saving the prediction as CSV file [DO NOT CHANGE]\n","**WARNING**: DO **NOT** CHANGE THE CODES BELOW!!!"],"metadata":{"id":"THpn4GVgQD5d"}},{"cell_type":"code","metadata":{"id":"c13Ycte5W047"},"source":["# Put the prediction into the format of submission\n","sample_sub['TARGET'] = pred\n","sample_sub"],"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# Create the \"output\" directory if it doesn't exist\n","output_dir = current_dir / \"output\"\n","os.makedirs(output_dir, exist_ok=True)\n","\n","# Specify the new output file path\n","output_file = output_dir / \"submission.csv\"\n","\n","# Save the CSV file to the \"output\" directory\n","sample_sub.to_csv(output_file, index=False)"],"metadata":{"id":"i2PJ33yqnWei"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"LEIQ6tJFYSs4"},"source":["That's all for the tutorial of Home Credit Default Risk competition! Submit your CSV file to Omnicampus to see the result.\n","\n","Only 5 out of 30 features are covered in this notebook, so there are a lot of room for improvement. Check out **[Next Steps]** in each section to see what you can do to improve your score."]}]} |