Languages

Creating Dummy Data using Python Faker Package

It is critical to test and evaluate software and hardware with dummy data before working with actual data. Running the code through various scenarios and test cases allows the detection of possible bugs.  Faked data can be easily generated with a Python library faker. In this tutorial, we will learn how dummy data is generated using the Python Faker library.

Methods and types for generating dummy data
Other dummy data creation methods
A few more words regarding dummy data
Conclusion
References

Pre-requisites

To follow along, you must have:

A good understanding of the Python programming language.
Python IDE installed. To install Python visit this documentation.

To begin with, let's install the Python library, Faker, as shown:

pip install faker

Methods and types for generating dummy data

Create and initialize faker generators

You can generate and initialize fake generators using Faker(). Using the Faker generator, you'll be able to generate any data you desire.

from faker import Faker
ourFake = Faker()

Create random text

We use text() method to create a paragraph with random text message:

ourFake.text()

Output:

Business happy black arrives end. Election wear list. Would lay though.\nCentury collection everybody key fight. Goal nation woman assume both.

For generating addresses, we use address() and for names we use name() methods:

ourFake.address()

Output:

3722 Garza Port\nSmithshire, UT 28618

ourFake.name()

Output:

Dana Williams

Create dummy data using `seed()`

You may wish to use the same collection of data again. Here, seeding the generator is a viable option.

A random function's state is saved using the seed function, allowing the process to create the same random numbers again and over again, whether the code is executed on the same system or not. The number created by the generator serves as the seed.

In addition to seeding, the dummy data is generated using faker. So, you don't have to use a faker to write dummy data.

You can read more about seeding here.

The following code creates dummy data using the seed() method:

Faker.seed(111)
print(ourFake.text())

Output:

Management huge pay college cover instead. Consumer leg start research her.
Sound finish set draw notice imagine that. Blue between least democratic down week wait. Reduce inside me.

Unique data generation

You can utilize the generators property unique to ensure that the dummy data created is always unique.

# generates 10 unique texts 
texts = [ourFake.unique.name() for b in range(10)]

Command-line usage

Command-line invocation of the faker package allows you to generate dummy data by directly typing for it in the command prompt.

Here is an example in the command prompt:

faker address

Output:

173 Castro Ferry\nSouth Alexandriafort, WI 38412

Locales

The Faker generator may generate localized fake data if a locale is provided as an input.

Moreover, localized dummy data can be translated into various languages.

The following are some names where we have specified the locale to be en-US:

ourFake = Faker('en-US')
for b in range(10):
    print(ourFake.name())

Output:

Daniel Davidson
Kristin Stewart
Derrick Tran
Matthew Mccarty
Kevin Davis
Kim Watkins
Ashley Humphrey
Corey Webb
Melissa Barrera
Juan Greene

Similarly, you can view the full list of locales here.

Currency

The Faker generator may generate fake data about currencies using the currency() method.

Here is one such example:

ourFake.currency()

Output:

('CUC', 'Cuban convertible peso')

You can also use the Faker() properties to generate dummy data about cryptocurrencies as well:

ourFake.cryptocurrency()

Output:

('POT', 'PotCoin')

Create a dummy dataset

Now, let's try to create a dummy dataset that can be used for Machine Learning.

Let's assume dataset generation for 20 people containing their employment status, type of job, company name, residence, and so on.

We'll construct the dataset using a Standard Provider called profile() and save it in Pandas Dataframes.

import pandas as pan
ourProfile = [ourFake.profile() for i in range(20)]
ourDataFrame = pan.DataFrame(ourProfile)
print(ourDataFrame)

Profile data

Let's understand more about providers.

Assembling the items is made easier with the assistance of providers. Create an object by calling the provider as if it were a function.

The service provider is responsible for retrieving and injecting the underlying dependencies into the newly generated data.

Providers include many valuable attributes, such as names() and address(). Many standard providers are basic, like the internet and a person, while others are community-created, like music.

Other dummy data creation methods

They are as follows:

Using NumPy's `random()`

Pseudorandom numbers can be generated with the random package functions like rand(), randint(), and many more.

import numpy as num
myArray = num.random.rand(5)
print("Array : \n", myArray);

Output:

Array : 
 [0.02471149 0.41561035 0.76783821 0.89628689 0.8540258 ]

Fauxfactory

Automated testing is made easier with FauxFactory's random data generator. When building tests for your application, you may need to provide the sections you're testing with random, non-specific data.

To test your code quickly, you can use this anytime.

You can learn more about Fauxfactory here.

A few more words regarding dummy data

Highly interconnected attributes that predict the value of each other are known as the dummy variable traps.

Dummy variable traps can be avoided if you have many characteristics that are highly connected (Multicollinear).

Multicollinearity occurs when the correlations between two or more independent variables are incredibly high in a regression model.

Conclusion

We were able to generate various types of dummy data using faker, a Python library. In the past, we learned how to create fictitious data like names, addresses, and currency data.

During our investigation of the providers, we discovered the possibility of creating data specific to a specific location.

We also learned how dummy datasets can be generated for training your machine learning models.

You will save a lot of time and effort if you follow this information when testing your application.

Happy coding!