Familiarity with Pipenv. See here for my post on Pipenv.
Familiarity with JupyterLab. See here for my post on JupyterLab.
Getting started
Let's create the generating-fake-csv-data-with-python directory and install Pillow.
# Make the `generating-fake-csv-data-with-python` directory
$ mkdir generating-fake-csv-data-with-python
$ cd generating-fake-csv-data-with-python
# Create a folder to place your icons
$ mkdir docs
# Init the virtual environment
$ pipenv --three
$ pipenv install faker
$ pipenv install --dev jupyterlab
At this stage, we have the packages that we
Now we can start up the notebook server.
# Startup the notebook server
$ pipenv run jupyter-lab
# ... Server is now running on http://localhost:8888/lab
The server will now be up and running.
Creating the notebook
Once on http://localhost:8888/lab, select to create a new Python 3 notebook from the launcher.
Ensure that this notebook is saved in generating-fake-csv-data-with-python/docs/generating-fake-data.ipynb.
We will create four cells to handle four parts of this mini project:
Importing Faker and generating data.
Importing the CSV module and exporting the data to a CSV file.
Before generating our data, we need to look at what we are trying to emulate.
Emulating The Netflix Original Movies IMDB Scores Dataset
Looking at the preview for our dataset, we can see that it contains the following columns and example rows:
Title
Genre
Premiere
Runtime
IMDB Score
Language
Enter the Anime
Documentary
August 5, 2019
58
2.5
English/Japanese
Dark Forces
Thriller
August 21, 2020
81
2.6
Spanish
We only have two rows for example, but from here we can make a few assumptions about how we want to emulate it.
In our langauges, we will stick to a single language (unlike the example English/Japanese).
IMDB scores are between 1 and 5. We won't be too harsh on any movies and go from 0.
Runtimes should emulate a real movie - we can set it to be between 50 and 150 minutes.
Genres may be something we need to write our own Faker provider for.
We are going to be okay with non-sense data, so we can just use a string generator for the names.
With this said, let's look at how we can fake this.
Emulating a value for each column
We will create seven cells - one to import Faker and one for each column.
For the first cell, we will import Faker.
from faker import Faker
fake = Faker()
Secondard, we will fake a movie name with words:
def capitalize(str):
return str.capitalize()
words = fake.words()
capitalized_words = list(map(capitalize, words))
movie_name = ' '.join(capitalized_words)
print(movie_name) # Serve Fear Consider
Third, we will generate a date this decate and use the same format as the example:
from datetime import datetime
date = datetime.strftime(fake.date_time_this_decade(), "%B %d, %Y")
print(date) # April 30, 2020
Fourth, we will create our own fake data geneartor for the genre:
# creating a provider for genre
from faker.providers import BaseProvider
import random
# create new provider class
class GenereProvider(BaseProvider):
def movie_genre(self):
return random.choice(['Documentary', 'Thriller', 'Mystery', 'Horror', 'Action', 'Comedy', 'Drama', 'Romance'])
# then add new provider to faker instance
fake.add_provider(GenereProvider)
# now you can use:
movie_genre = fake.movie_genre()
print(movie_genre) # Horror
Fifth, we will do the same for a language:
# creating a provider for genre
from faker.providers import BaseProvider
import random
# create new provider class
class LanguageProvider(BaseProvider):
def language(self):
return random.choice(['English', 'Chinese', 'Italian', 'Spanish', 'Hindi', 'Japanese'])
# then add new provider to faker instance
fake.add_provider(LanguageProvider)
# now you can use:
language = fake.language()
print(language) # Spanish
Sixth we need to generate a runtime:
# Getting random movie length
movie_len = random.randrange(50, 150)
print(movie_len) # 143
Lastly, we need a rating with one decimal point between 1.0 and 5.0: