Skip to content

Commit

Permalink
Documentation changes
Browse files Browse the repository at this point in the history
  • Loading branch information
himapatel1 authored and shivdeep-singh-ibm committed May 9, 2024
1 parent f777cb3 commit 270f3e5
Showing 1 changed file with 39 additions and 44 deletions.
83 changes: 39 additions & 44 deletions examples/demo_with_launcher.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -5,33 +5,44 @@
"id": "841e533d-ebb3-406d-9da7-b19e2c5f5866",
"metadata": {},
"source": [
"# Data processing Pipeline <a class=\"anchor\" id=\"top\"></a>"
"<div style=\"background-color: #04D7FD; padding: 20px; text-align: left;\">\n",
" <h1 style=\"color: #000000; font-size: 36px; margin: 0;\">Demo: Data Prep Kit</h1>\n",
" \n",
"</div>\n"
]
},
{
"cell_type": "markdown",
"id": "053ecf08-5f62-4b99-9347-8a0955843d21",
"metadata": {},
"source": [
"Pipeline in this context is a name for a data processing pipeline which transforms data using a series of components called transforms(annotators/filters)\n",
"\n",
"\n",
"\n",
"The list of the components used in this pipeline are:\n",
"## Overview\n",
"Welcome to the demo notebook! Inside, you will find an end-to-end sample data pipeline designed for processing code datasets, beginning with GitHub repositories (.zip files) and culminating in processed data. This notebook provides the following transforms for processing the data. \n",
"\n",
"- [Ingest2parquet](#item1)\n",
"- [Exact Dedup](#item2)\n",
"- [Doc_ID generation](#item3)\n",
"- [Fuzzy Dedup](#item4)\n",
"- [Programming Language Annotation](#item5)\n",
"- [Code quality annotation](#item6)\n",
"- [Programming Language Select](#item5)\n",
"- [Code quality](#item6)\n",
"- [Filtering](#item7)\n",
"- [Tokenization](#item8)\n",
"\n",
"The notebook executes the above transforms one after another to give a sense of\n",
"how data processing can be done using data processing library.\n",
"### Getting started\n",
"\n",
"If you want to try this pipeline on your data, you need to download your github repositories, as .zip files. Please refer to steps below for the same. One can also try it on sample data by downloading a few repos of interest.\n",
"\n",
"Here's how to download a GitHub repository in ZIP format:\n",
"\n",
"1. Go to the desired repository on GitHub.\n",
"2. Click the \"Code\" button near the top right corner of the repository.\n",
"3. Click the \"Download ZIP\" button.\n",
"\n",
"It is possible to change input/output paths and supply your own data and execute the pipeline. Each transform takes in a set of parameters which can also be tweaked as per requirement."
"This will download a ZIP archive of the entire repository to your computer.\n",
"\n",
"Follow these steps and download some repositories from github into a folder. Now your data is ready.\n",
"\n",
"The folder containing this data would serve as the input to the pipeline. Assign the path of this data folder to the variable `zip_input_folder` in the below cell. \n"
]
},
{
Expand All @@ -44,7 +55,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 2,
"id": "66178913-42b8-426b-a2e9-9587268fd05b",
"metadata": {},
"outputs": [],
Expand All @@ -57,33 +68,6 @@
"from data_processing.utils import ParamsUtils"
]
},
{
"cell_type": "markdown",
"id": "887f30ec-e2e2-42b3-93f6-8b83dc7a4011",
"metadata": {},
"source": [
"## Sample Data For the pipeline\n",
"\n",
"This sample pipeline can be used to process data needed to train/finetune models for code. Hence the data we need before starting the \n",
"pipeline can be code repositories from say github/gitlab/bitbucket etc.\n",
"\n",
"### Collecting Data\n",
"\n",
"Easiest way to get data is to download repositories from github in zip format. \n",
"\n",
"Here's how to download a GitHub repository in ZIP format:\n",
"\n",
"1. Go to the desired repository on GitHub.\n",
"2. Click the \"Code\" button near the top right corner of the repository.\n",
"3. Click the \"Download ZIP\" button.\n",
"\n",
"This will download a ZIP archive of the entire repository to your computer.\n",
"\n",
"Follow these steps and download some repositories from github into a folder. Now your data is ready.\n",
"\n",
"The folder containing this data would serve as the input to the pipeline. Assign the path of this data folder to the variable `zip_input_folder`. It would be used\n"
]
},
{
"cell_type": "markdown",
"id": "72510ae6-48b0-4b88-9e13-a623281c3a63",
Expand Down Expand Up @@ -137,8 +121,7 @@
"Raw code data files which are in zip format are converted to parquet files, where each row of the parquet file corresponds to a separate code file. Apart from the contents of the code file, every row also contains a unique document id, file URL, name of the repository, source of the data, date of acquisition and license of the repository. For every code file, a language field is also added, which is detected using the filename\n",
"extensions.\n",
"\n",
"It is advised to prepare a dataset in the folder `test-data/input`. The dataset should contain zip\n",
"files of github repos. One way to make this dataset is to download github repos in zip format.\n"
"\n"
]
},
{
Expand All @@ -151,10 +134,22 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 3,
"id": "482605b2-d814-456d-9195-49a2ec454ef0",
"metadata": {},
"outputs": [],
"outputs": [
{
"ename": "NameError",
"evalue": "name 'zip_input_folder' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[3], line 3\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m# For this stage input folder contains the zip files, each zip file contains a github repo.\u001b[39;00m\n\u001b[0;32m----> 3\u001b[0m input_folder \u001b[38;5;241m=\u001b[39m \u001b[43mzip_input_folder\u001b[49m\n\u001b[1;32m 4\u001b[0m output_folder \u001b[38;5;241m=\u001b[39m parquet_data_output\n",
"\u001b[0;31mNameError\u001b[0m: name 'zip_input_folder' is not defined"
]
}
],
"source": [
"# For this stage input folder contains the zip files, each zip file contains a github repo.\n",
"\n",
Expand Down Expand Up @@ -797,7 +792,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.2"
"version": "3.11.8"
}
},
"nbformat": 4,
Expand Down

0 comments on commit 270f3e5

Please sign in to comment.