ETL Python Tutorial.py or Streamlining Data Connecting with Simple Python ETL Framework

ETL Python Tutorial.py or Streamlining Data Connecting with Simple Python ETL Framework

ETL Python
ETL Python

ETL Python – Instructions for Use and Understanding

Extract-Transform-Load Python (ETL) is a data integration process that involves extracting data from various sources. It transforms it into a unified format and loads it into a target database or data warehouse. It is a critical ETL process using python in data warehousing and business intelligence applications. Python, a high-level programming language, has emerged as a popular choice for ETL operations due to its simplicity, flexibility, and wide range of libraries and frameworks.

One of the advantages of using Python for ETL is its ability to handle diverse data types, such as structured, semi-structured, and unstructured data. Python’s built-in data manipulation capabilities, combined with its extensive ecosystem of third-party libraries, make it a powerful tool for transforming data into a format that can be easily analyzed and visualized.

In addition, Python’s support for parallel processing and distributed computing allows it to scale effectively for large data sets. Whether you’re working with ETL data science stored in a local database or a cloud-based ETL data analytics warehouse, Python provides a wide range of tools and techniques for optimizing performance and improving efficiency.

Overall, Python’s versatility and extensibility make it an excellent choice for ETL in python tasks, whether you’re working on a small-scale project or a large-scale enterprise system. By leveraging Python’s strengths, you can streamline your data management processes, improve data quality, and gain deeper insights into your organization’s operations.

Python is an elegant, versatile language with an ecosystem of powerful modules and code libraries. Writing Python in ETL starts with knowledge of the relevant frameworks and libraries, such as workflow management utilities, libraries for accessing and extracting data, and fully-featured ETL toolkits.

How to Use Python for ETL

ETL using Python can take many forms based on technical requirements and business objectives. It depends on the compatibility of existing tools and how much developers feel they need to work from scratch. Python’s strengths lie in working with indexed data structures and dictionaries, which are crucial in ETL operations.

Python is versatile enough that users can code almost any ETL process with native data structures. For example, filtering null values out of a list is easy with some help from the built-in Python ETL example:

import math
data = [1.0, 3.0, 6.5, float('NaN'), 40.0, float('NaN')]
filtered = []
for value in data:
if not math.isnan(value):
    filtered.append(value)

Users can also take advantage of list comprehensions for the same purpose:

#filtered = [value for value in data if not math.isnan(value)]

Coding the entire ETL process with Python from scratch isn’t particularly efficient, so most Python ETL tutorials end up being a mix of pure Python code and externally defined functions or objects, such as those from libraries mentioned above. For instance, users can employ pandas to filter an entire DataFrame of rows containing nulls.

filtered = data.dropna()

Python software development kits (SDK), application programming interfaces (API), and other utilities are available for many platforms, some of which may be useful in coding for ETL. The Anaconda platform is a Python distribution of modules and libraries relevant to working with data. It includes its package manager and cloud hosting for sharing code notebooks and Python environments.

Much of the advice relevant to general coding in Python also applies to programming for ETL. For example, the code should be “Pythonic” — which means programmers should follow some language-specific guidelines that make scripts concise and legible and represent the programmer’s intentions. Documentation is also important, as well as good package management and watching out for dependencies.

What is ETL File?

An ETL file is not a specific file format or type. Instead, it refers to a set of files and processes used in an ETL workflow. An ETL workflow might involve extracting data from various data sources. Such as databases, spreadsheets, or flat files. Transforming the data using scripts or programs to clean, normalize, or aggregate it. And then loading the transformed data into a target database or data warehouse.

Libraries

Beyond alternative programming languages for manually building ETL logs processes, a wide set of platforms and tools can now perform ETL for enterprises. There are benefits to using existing ETL tools over trying to build a data ETL pipeline Python from scratch. ETL tools can compartmentalize and simplify data pipelines, leading to cost and resource savings, increased employee efficiency, and more performant data ingestion.

ETL tools include connectors for many popular data sources and destinations and can ingest data quickly. Organizations can add or change source or target systems without waiting for programmers to work on the pipeline first. ETL tools keep pace with SaaS platforms’ updates to their APIs as well, allowing data ingestion to continue uninterrupted.

Although manual coding provides the highest level of control and customization, outsourcing ETL design, implementation, and management to expert third parties rarely represents a sacrifice in features or functionality. ETL has been a critical part of IT infrastructure for years, so ETL service providers now cover most use cases and technical requirements.

Extract Data

Python provides several libraries and frameworks that simplify the ETL process definition in python. These libraries and frameworks enable users to extract data from various sources, transform it according to their requirements, and load it into a target database or data warehouse. Some of the popular libraries used for ETL programming in Python are Pandas, NumPy, and Scikit-learn.

Do One`s Best with Pandas:

Pandas is a popular Python ETL library for data manipulation and analysis in Python. It provides data structures for efficiently storing and manipulating large datasets. Pandas also provides functions for data cleaning, aggregation, and filtering. Pandas can read data from various sources such as CSV, Excel, and SQL databases. Pandas also provides functions for writing data to files. Pandas is widely used in ETL procedures as it provides a simple and efficient way to perform data transformations. Pandas’ data structures, such as DataFrame and Series, make it easy to filter and manipulate data. Pandas also provides functions for joining and merging data, which is essential in ETL methodology.

Python ETL code example:

import pandas as pd
import sqlite3

# Extract data from CSV file
sales_data = pd.read_csv('sales_data.csv')

# Transform data
sales_data['date'] = pd.to_datetime(sales_data['date']) # convert date column to datetime
sales_data['revenue'] = sales_data['units_sold'] * sales_data['price'] # calculate revenue column

# Load data into SQLite database
conn = sqlite3.connect('sales.db')
sales_data.to_sql('sales', conn, if_exists='replace', index=False)

# Close database connection
conn.close()

NumPy and It`s Example:

NumPy is a fundamental library for scientific computing in Python. It provides functions for working with arrays and matrices. NumPy also provides mathematical functions for data manipulation and analysis. NumPy is widely used in ETL functions for numerical computations and data manipulation.

NumPy’s array data structure is highly efficient in handling large datasets. NumPy also provides functions for reshaping and transposing arrays, which is essential in ETL operations. NumPy also provides linear algebra functions for matrix manipulation, which is useful in data transformation.

 

ETL script in Python example:

import numpy as np

# Extract data from text file
data = np.genfromtxt('sensor_data.txt', delimiter=',')

# Transform data
data = np.delete(data, [0,1], axis=1) # remove first two columns
data = np.nan_to_num(data) # replace NaN values with 0

# Load data into NumPy array
array = np.array(data)

# Print array shape and first few elements
print('Array shape:', array.shape)
print('First 5 elements:', array[:5])

ETL Scikit-learn:

Scikit-learn is a popular machine-learning library in Python. It provides functions for data preprocessing, feature extraction, and data modeling. Scikit-learn is widely used in ETL operations for data cleaning, normalization, and transformation.

Scikit-learn provides functions for handling missing data, scaling features, and encoding categorical variables. Scikit-learn also provides functions for dimensionality reduction and feature selection, which is useful in data transformation. Scikit-learn’s machine learning algorithms can also be used for data modeling in ETL operations.

 

ETL Python example:

# Import necessary libraries
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the dataset into a Pandas dataframe
df = pd.read_csv('data.csv')

# Separate the features from the target variable
X = df.drop('target', axis=1)
y = df['target']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a pipeline to scale the data and perform principal component analysis (PCA)
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=2))
])

# Transform the training set using the pipeline
X_train_transformed = pipeline.fit_transform(X_train)

# Create a logistic regression classifier
clf = LogisticRegression()

# Train the classifier on the transformed training set
clf.fit(X_train_transformed, y_train)

# Transform the testing set using the pipeline
X_test_transformed = pipeline.transform(X_test)

# Make predictions on the transformed testing set
y_pred = clf.predict(X_test_transformed)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)

# Print the accuracy of the model
print('Accuracy:', accuracy)

PySpark:

PySpark is a Python API for Apache Spark, a distributed computing framework for big data processing. PySpark provides functions for data manipulation and transformation on large datasets. PySpark’s DataFrame API provides a high-level interface for data transformation.

PySpark provides functions for handling missing data, filtering, and aggregating data. PySpark also provides functions for joining and merging data, which is essential in data transformation. PySpark’s MLlib library also provides functions for data transformation for machine learning applications.

 

Create ETL with python example:

# Import necessary libraries
from pyspark.sql import SparkSession
from pyspark.ml.feature import StandardScaler
from pyspark.ml.feature import PCA
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Create a Spark session
spark = SparkSession.builder.appName('ETL Project Examples').getOrCreate()

# Load the dataset into a Spark dataframe
df = spark.read.csv('data.csv', header=True, inferSchema=True)

# Separate the features from the target variable
X = df.drop('target')
y = df.select('target')

# Split the dataset into training and testing sets
X_train, X_test = X.randomSplit([0.8, 0.2], seed=42)
y_train, y_test = y.randomSplit([0.8, 0.2], seed=42)

# Create a pipeline to scale the data and perform principal component analysis (PCA)
pipeline = Pipeline(stages=[
StandardScaler(inputCol='features', outputCol='scaledFeatures'),
PCA(k=2, inputCol='scaledFeatures', outputCol='pcaFeatures')
])

# Fit the pipeline to the training set
pipelineModel = pipeline.fit(X_train)

# Transform the training set using the pipeline
X_train_transformed = pipelineModel.transform(X_train)

# Create a logistic regression classifier
lr = LogisticRegression(featuresCol='pcaFeatures', labelCol='target')

# Train the classifier on the transformed training set
lrModel = lr.fit(X_train_transformed)

# Transform the testing set using the pipeline and the trained model
X_test_transformed = pipelineModel.transform(X_test)
y_pred = lrModel.transform(X_test_transformed)

# Evaluate the performance of the model on the testing set
evaluator = MulticlassClassificationEvaluator(labelCol='target', predictionCol='prediction', metricName='accuracy')
accuracy = evaluator.evaluate(y_pred)

# Print the accuracy of the model
print('Accuracy:', accuracy)

Dask:

Dask is a distributed computing framework for parallel computing in Python. Dask provides functions for data manipulation and transformation on large datasets. Dask’s DataFrame API provides a high-level interface for data transformation.

Dask provides functions for handling missing data, filtering, and aggregating data. Dask also provides functions for joining and merging data, which is essential in data transformation. Dask’s machine learning library, Dask-ML, also provides functions for data transformation for machine learning applications.

Connect Data

In addition to providing libraries for data manipulation and analysis, Python also provides libraries for connecting to various data sources for data extraction. Here are some of the popular data sources that Python libraries can connect to:

Databases:

Python provides libraries such as SQLAlchemy and PyMySQL for connecting to popular databases such as MySQL, PostgreSQL, and Oracle. These libraries enable the execution of SQL queries from Python and the extraction of data from databases.

The SQL Alchemy library provides a unified API for connecting to various databases and executing SQL queries. The library also provides an Object-Relational Mapping (ORM) framework for working with databases. PyMySQL, on the other hand, is a lightweight library for connecting to MySQL databases and executing SQL queries.

APIs:

Python provides libraries for connecting to various APIs such as Twitter, Facebook, and Google. These libraries enable the extraction of data from APIs and the integration of data into ETL workflows.

Some of the popular libraries for connecting to APIs in Python are Requests, Tweepy, and Google API Client. The Requests library provides a simple and efficient way to make HTTP requests to APIs. Tweepy is a library for extracting data from Twitter, while the Google API Client library provides a unified API for connecting to various Google APIs such as Google Sheets and Google Analytics.

Web Scraping:

Python provides libraries such as BeautifulSoup and Scrapy for web scraping. Web scraping involves extracting data from websites by parsing HTML and XML documents.

The BeautifulSoup library provides functions for parsing HTML and XML documents and extracting data from them. Scrapy, on the other hand, is a framework for web scraping that provides features such as URL management, spidering, and data extraction.

Transform Data:

These are some of the popular libraries used for data transformation in ETL operations in Python.

Python provides several libraries and frameworks for loading data into a target database or data warehouse in ETL operations. These libraries enable users to load transformed data into a target database or data warehouse for further analysis. Here are some of the popular libraries used for loading data in Python:

SQL Alchemy:

SQL Alchemy is a powerful library for working with databases in Python. SQL Alchemy’s ORM framework provides a high-level interface for working with databases. SQL Alchemy provides functions for creating database connections, executing SQL queries, and loading data into databases.

SQL Alchemy’s DataFrames API allows users to load transformed data into a database by converting it into a pandas DataFrame object. The DataFrame object can then be written to a database using SQL Alchemy’s to_sql() function.

Psycopg2:

Psycopg2 is a library for connecting to PostgreSQL databases in Python. Psycopg2 provides functions for creating database connections, executing SQL queries, and loading data into databases.

Psycopg2’s copy_from() function allows users to load large datasets into a PostgreSQL database efficiently. The copy_from() function reads data from a file and writes it directly to the database without the need for intermediary data structures.

PyMySQL:

PyMySQL is a lightweight library for connecting to MySQL databases in Python. PyMySQL provides functions for creating database connections, executing SQL queries, and loading data into databases.

PyMySQL’s execute many () function that allows users to load large datasets into a MySQL database efficiently. The execute many () function takes a SQL query with placeholders and a list of tuples and executes the query for each tuple in the list.

SQLAlchemy ORM:

SQLAlchemy ORM is a high-level interface for working with databases in Python. SQLAlchemy ORM provides functions for creating database connections, executing SQL queries, and loading data into databases.

SQLAlchemy ORM’s bulk_insert_mappings() function allows users to load large datasets into a database efficiently. The bulk_insert_mappings() function takes a list of dictionaries, where each dictionary represents a row of data to be inserted into the database.

Efficient Libraries

Python offers various libraries and open-source ETL tools Python that can help perform ETL operations efficiently. Here are some of the popular Python libraries and tools for ETL operations:

Try pandas:

Pandas is a popular Python library for data manipulation and analysis. It offers data structures for efficiently handling large datasets and tools for data cleaning and transformation.

Go with Apache Airflow:

Apache Airflow is an open-source platform for orchestrating ETL workflows. It allows you to define ETL workflows as DAGs (Directed Acyclic Graphs) and provides tools for monitoring and managing workflows.

Code in petl:

Petl is a Python library for ETL operations. It provides a simple API for performing common ETL tasks such as filtering, transforming, and loading data.

Play with Code of Bonobo:

Bonobo is a lightweight ETL framework for Python. It provides a simple API for defining ETL pipelines and can handle various data sources and targets.

Workflow Management

Workflow management is the process of designing, modifying, and monitoring workflow applications, which perform business tasks in sequence automatically. In the context of ETL, workflow management organizes engineering and maintenance activities, and workflow applications can also automate ETL tasks themselves. Two of the most popular workflow management tools are Airflow and Luigi.

Airflow

Apache Airflow uses directed acyclic graphs (DAG) to describe relationships between tasks. In a DAG, individual tasks have both dependencies and dependents — they are directed — but following any sequence never results in looping back or revisiting a previous task — they are not cyclic.

Airflow provides a command-line interface (CLI) for sophisticated task graph operations and a graphical user interface (GUI) for monitoring and visualizing workflows.

Luigi

Original developer Spotify used Luigi to automate or simplify internal tasks such as those generating weekly and recommended playlists. Now it’s built to support a variety of workflows. Prospective Luigi users should keep in mind that it isn’t intended to scale beyond tens of thousands of scheduled jobs.

Moving and Processing Data

Beyond overall workflow management and scheduling, Python can access libraries that extract, processes, and transport data, such as pandas, Beautiful Soup, and Odo.

Move with pandas

Pandas is an accessible, convenient, and high-performance data manipulation and analysis library. It’s useful for data wrangling, as well as general data work that intersects with other processes, from manual prototyping and sharing a machine learning algorithm within a research group to setting up automatic scripts that process data for a real-time interactive dashboard. pandas is often used alongside mathematical, scientific, and statistical libraries such as NumPy, SciPy, and sci-kit-learn.

Beautiful Soup

On the data extraction front, Beautiful Soup is a popular web scraping and parsing utility. It provides tools for parsing hierarchical data formats, including those found on the web, such as HTML pages or JSON records. Programmers can use Beautiful Soup to grab structured information from the messiest of websites and online applications.

Odo

Odo is a lightweight utility with a single, eponymous function that automatically migrates data between formats. Programmers can call odo (source, target) on native Python data ETL structures or external file and framework formats, and the data is immediately converted and ready for use by other ETL codes.

Self-contained ETL Toolkits

Finally, a whole class of Python libraries are actually complete, fully-featured ETL frameworks, including Bonobo, petl, and pygrametl.

At First Try Framework Bonobo

Bonobo is a lightweight framework, uses native Python features like functions and iterators to perform ETL tasks. These are linked together in DAGs and can be executed in parallel. Bonobo is designed for writing simple, atomic, but diverse transformations that are easy to test and monitor.

Finally, a whole class of Python libraries are actually complete, fully-featured ETL frameworks, including Bonobo, petl, and pygrametl.

Package petl

petl is a general-purpose ETL package designed for ease of use and convenience. Though it’s quick to pick up and get working, this package is not designed for large or memory-intensive data sets and pipelines. It’s more appropriate as a portable ETL toolkit for small, simple projects, or for prototyping and testing.

ETL pygrametl

pygrametl also provides ETL functionality in ETL code in Python that’s easy to integrate into other Python applications. pygrametl includes integrations with Jython and CPython libraries, allowing programmers to work with other tools and providing flexibility in ETL performance and throughput.

Practices

As with any data processing operation, there are best practices that should be followed when performing ETL and Python. Here are some best practices to consider:

Define Clear Data Requirements:

Before starting any ETL operation, it is essential to define clear data requirements. Data requirements should include data sources, data formats, and data quality standards. This information will guide the data transformation and loading process.

Plan for Scalability:

ETL operations can quickly become complex and time-consuming as the data volume increases. Therefore, it is essential to plan for scalability from the start. Consider using distributed computing python ETL frameworks such as Apache Spark or Dask to handle large datasets efficiently.

Use Version Control:

Version control is critical in ETL operations, especially when working with a team. Use a version control system such as Git to track changes and collaborate with other team members.

Perform Data Validation:

Data validation is essential to ensure that the transformed data meets the required data quality standards. Use tools such as data profiling or data auditing to validate the data.

Use Error handling:

ETL operations can be prone to errors. Therefore, it is essential to use error-handling techniques such as logging and error reporting to identify and handle errors efficiently.

Document the ETL Process:

Documenting the ETL process is essential for future maintenance and troubleshooting. Document the data sources, data transformations, and data loading processes.

Test the ETL Process:

Testing is essential to ensure that the ETL process is working as expected. Use unit tests to test individual components of the ETL process, and integration tests to test the entire process end-to-end.
Following best practices when performing Python and ETL can help ensure that the data is transformed and loaded efficiently and accurately. The key is to define clear data requirements, plan for scalability, use version control, perform data validation, use error handling, document the ETL process, and test the ETL process.

Monitor ETL Performance:

ETL operations can take a significant amount of time, especially when dealing with large datasets. Therefore, it is essential to monitor ETL performance to identify any bottlenecks or performance issues. Use tools such as APM (Application Performance Management) to monitor ETL performance in real-timereal time.

Use the Appropriate Data Types:

Using the appropriate data types is essential for data quality and efficiency. Ensure that the data types used in the target database or data warehouse match the data types of the transformed data.

Implement Data Lineage:

Data lineage is essential for tracking the origin of data and its transformation process. Implementing data lineage can help ensure data quality and compliance with data governance policies.

Optimize Data Processing:

Optimizing data processing can help reduce ETL processing time and improve efficiency. Consider using techniques such as data partitioning, data compression, and data caching to optimize data processing.

Use Cloud-based ETL Services:

Cloud-based ETL services such as AWS Glue, Azure Data Factory, or Google Cloud Dataflow can provide a scalable and cost-effective solution for ETL operations. These services offer pre-built connectors to various data sources and targets and can handle large datasets efficiently.

Alternative Programming Languages for ETL

Although Python is a viable choice for coding ETL tasks, developers do use other programming languages for data ingestion and loading.

Java

Java is one of the most popular programming languages, especially for building client-server web applications. Java has influenced other programming languages — including Python — and spawned several spinoffs, such as Scala. Java forms the backbone of a slew of big data tools, such as Hadoop and Spark. The Java ecosystem also features a collection of libraries comparable to Python’s.

Ruby

Ruby is a scripting language like Python that allows developers to build ETL pipelines, but few ETL-specific Ruby frameworks exist to simplify the task. However, several libraries are currently undergoing development, including projects like Kiba, Nokogiri, and Square’s ETL package.

Go

Go, or Golang is a programming language similar to C that’s designed for data analysis and big data applications. Go features several machine learning libraries, support for Google’s TensorFlow, some data pipeline libraries, like Apache Beam, and a couple of ETL tool kits — Crunch and Pachyderm.

Flaky Tests as Sticking Points in Software Development

Flaky Tests as Sticking Points in Software Development

Flacky Tests
Flacky tests

The Impact of Flaky Tests on Software Quality and the Ways to Reduce It

Flaky tests are automatically performed tests, which do not always pass or fail at all, regardless that their code is stable and was not changed. These tests are unpredictable. That is why, they pass or do not pass by occasion, despite they are made in equal conditions.

These flaky tests are usually problematic as they can cause false positive results (when the passed test is false in fact) and false negative results (when the test which is not passed is true in fact). As the result, the developers may lose their time and force.

To avoid test flakiness, the developers should implement various measures, such as insolation of the tested environment, repeating attempt mechanisms, increasing of waiting time or log analyses and indicators to detect main problems.

What Does the Flaky Test Mean?

A flaky test is a kind of test that gives unreliable and contradictory results, which leads to its failure and contradictory passing. In other words, it may be possible when the unreliable test fails or passes unexpectedly, even if the tested code was not changed.

The test is flaky for various reasons, including environmental problems, timing problems, hurry-up conditions, or a problem with implementing tests. For example, a test, which depends on an external resource, which may not be always accessible, may be flaky as it may pass or fail incoherently, depending on resource availability.

Examples of Flaky Tests

Here are some typical occasions considers, each of them can be used as a flaky test example:

The tests which depend on the network: if the automatically made test depends on the network connection, for example, a test checking data got from an external API, it becomes flaky when the network connection is unstable or too slow.

Tests dependent on time, which are based on certain parameters of time, such as timeouts or waiting periods, may become flaky if there is a small delay in the tested system. For example, a test checking the webpage response time sometimes passes or fails. It depends on the load of the server or network.

Tests related to concurrency, which are performed simultaneously and can interact with each other causing flaky behavior. For example, a test that records the database in the same table as another test can sometimes finish with an error, depending on the test performing order.

Tests that depend on the environment. Such as availability of certain resources can become flaky when the environment changes. A good flaky test example, in this case, is a test checking a file presence in a file system, which can end with an error, if a file is removed by another process.

Main Causes of Flaky Tests

There are different reasons for test flakiness, including issues with the environment, such as network connection or server productivity, synchronization problems, such as race conditions or timeouts, or problems with the code (problems with concurrency or incorrectly processed exceptions).

Here are some common reasons for flaky tests:

  • Problems with time
  • Conditions of race
  • Environment problems
  • Problems with the implementation of tests
  • Dependencies on external resources
  • Incomplete coverage of tests

How to Detect Flaky Tests?

Detecting flaky tests can be problematic because of the test flakiness diversity, time problems, conditions of race, and unreliable data of tests.

While detecting flaky tests some common problems may arise, such as test may not appear immediately or mistakes, caused by false positives or false negatives, unreliable environment, or problems with test design.

Here are Some Flaky Test Detection Ways:

To analyze the results of tests and realize which tests do not correspond each other. It will help to reveal patterns pointing to the instability:

To monitor test performing to find out inconsistent tests by performing one test several times and comparing results.

To record the time of test performing for each test and then compare it with the average time of execution. If a test takes considerably more time, than the average time of execution, it can point to the fact that this test is flaky.

To use the tools for code analysis, such as SonarQube or, for example Code Climate to detect potentially bad quality tests. These tools can detect even smells of a code, coverage of tests and other flakiness signs.

To make tests running in parallel which can help to detect tests, which are not working. If a test fails in an inconsistent manner, its performance together with other tests will help to detect the cause of such failure.

To track the dependencies of tests, as tests depending on external resources or services can be unstable. You should check the availability and agreement of these resources to make potentially flaky test detection possible.

To view a code of tests for revealing potential causes of such test flakiness, including synchronization of threads, sleep statement usage and conditions of race.

In general, to detect flaky tests you need to combine monitoring, analysis, and code checking. If you reveal and remove the test flakiness in time, it will allow you to increase the testing accuracy and reduce the flaky testing.

How to Fight With Flaky Tests?

Fighting with test flakiness requires different techniques and process approaches, which can be combined in practice. There are some useful strategies that can work:

Flaky test identification and prioritization. It is necessary to identify test, which is not working, and prioritize them depending on their impact on the software quality and development time. Prioritizing failed tests can help developers to focus on the most crucial issues.

Correction of flaky tests. After developers found unstable tests, they need to remove them and their initial cause. It may include the refactoring of a testing code, corrections of race conditions, improvement of the mechanisms of synchronization as well as reduction of the dependency on external resources.

Test automation. It can reduce the probability of bad quality tests ensuring more consequent and reliable results. Automation will also help to reveal bad-quality tests more quickly and precisely.

To perform tests in parallel. Launching tests in parallel helps to reveal flaky tests by starting them in different environments or at different times. It helps to detect the problems with surrounding and time, which can be the reason for the test flakiness.

Isolate tests can help to reduce the impact of unstable tests by isolating them from other ones. It may prevent flaky tests from leading to the failure of other tests.

Test results monitoring. It is recommended to track the results of testing to detect bad quality tests and their frequency or impact on the quality of the software. It helps to reveal trends and patterns, which can point to the test flakiness.

Improvement of testing coverage, which can help in the reduction of bad quality test probability due to more full testing of the software. It can help to reveal and remove problems before they become bad quality tests.

 

Here is the table, which summarizes the flaky test reasons, their consequences, and recommended remedies:

Causes

Problems with time

Conditions of race

Problems with environment

Problems in test performing

Dependence on external resources

Not full coverage of testing

Results

Tests can pass or fail in an inconsistent manner because there are some timing factors, for example, the network delay, delay in input or output, and waiting time.

Tests pass or fail in an insequent manner if ordering of parallel threads and processed are unpredictable.

Tests can pass or fail in an inconsistent manner as testing environments may be different compared to the production environment, for example, due to the different dependency versions or configuration of the equipment.

Tests can pass or fail in an inconsistent manner because there are problems with the code in the test, such as incorrect affirmation, and improperly cleaned-up tested data which depend on the performance order.

Inconsistent passing or failure of tests may be caused by the dependence on external resources, including databases, services of third parties, or API.

Inconsistent test passing or failure may be caused by the test set does not cover all possible variants of development and edge cases.

Proposed remedies

Using time waiting and repeatable attempts, launching tests in parallel and insulating them using virtual environments, which ensures agreed testing surrounding.

Using synchronizing mechanisms, including blockings, semaphores and barriers which manage access to general sources. 

Using imitation objects and plugs to insulate tests from external dependencies by application of virtual environments or containers ensuring an agreed testing environment, and tracking the production environment on differences, which can cause test flakiness.

Use test refactoring for code quality improvement, increasing their reliability and accuracy, ensuring the proper cleaning up of the tested data, and making tests independent on the order of their performance.

Using plugs and layouts for imitation of external resources, using test tweens for external resource imitation, reduction of dependencies on external resources, using a database in the memory, or other alternative remedies.

Improving the coverage of tests for ensuring the coverage of all paths of a code and edge cases, using mutated testing or other measures to detect white spaces in the coverage of tests.

It is worth noting, that all these means do not always work perfectly. The best approach may depend on each case, test, or nature of a test case that failed. In addition, prioritizing and failed test removal are very crucial as they influence the quality of the software and the time of development.

Conclusion

In general, failed tests can impact badly the soft quality, time of development, and issuing cycle. It is very important to reveal and remove them as soon as possible. Fighting with flaky tests requires a very active approach, including detection, prioritizing, and solving problems quickly and effectively. By applying the remedy strategies, developers can increase the reliability and efficiency of their attempts for testing and create better software.

What is DevSecOps Pipeline and Why It’s Important?

What is DevSecOps Pipeline and Why It’s Important?

DevSecOps pipeline

The Importance of Integrating DevSecOps Pipeline into the DevOps Workflow

You might know that DevSecOps is all about security measures in the software development process, but how should it look in practice? How can you use it to create a secure CI/CD pipeline? What are the main DevSecOps phases? What is the definition of DevSecOps? And which tools should you use in a typical Dev Sec Ops pipeline? In this article, we will answer these questions, we will discuss the secure pipeline in detail, including its benefits, components, and best practices for implementing it. We will also provide case studies of companies that have successfully implemented the DevSecOps pipeline.

Cyber Threat Scenario

Let’s you own a hypothetical software development firm. You’re proud of your team of talented developers using cutting-edge technologies and generating innovative ideas. One day, however, disaster struck. The company fell victim to a cyber attack, and its systems were breached by a group of hackers. The hackers stole sensitive customer information, including personal and financial data, and brought the entire company to its knees.

In the aftermath of the attack, the company struggled to recover. Your reputation was tarnished, and your customers lost faith in your ability to protect their information. The company’s finances took a hit, and many employees were left without jobs.

This scenario is not that outlandish. Around 60 % of small businesses go down within 6 months after being hacked. That’s why it’s crucial to learn this valuable lesson. You’d better start working on creating a comprehensive security plan to protect your systems, your data, and your customers.

Start educating your employees on the importance of security. Implement best policies and procedures to ensure that all employees follow best practices when it came to protecting sensitive data. Invest in cutting-edge security technologies, including firewalls, intrusion detection systems, and advanced encryption methods.

Over time, your efforts will be paid off. When your systems are secure, your customers will be confident that their information is safe. Implementing security measures is crucial for any business that deals with sensitive information. Cyber attacks are a real threat, and they can have devastating consequences. But with the right approach and the best security measures in place, you can protect your data and your customers.

What is DevSecOps Pipeline?

If you’re unfamiliar with the concept, it may seem too complex and even scary at first. Cynics may even say it’s the perfect way to add even more complexity to your already complex tech processes. Who needs simplicity and efficiency when you can have a pipeline that’s so convoluted, it requires a whole new set of skills just to navigate it?

Just think about all the extra steps you get to take to make sure your code is secure. Plus, who doesn’t love waiting for those security scans to finish before moving on to the next step? It’s like a fun game of “will this pass or fail?” every time!

And let’s not forget the joys of collaboration between developers, security experts, and operations professionals. Clear communication? Never heard of it. Instead,  you have misunderstandings at every turn. It may look like a game of telephone, but with your codebase.

If you think that DevSecOps Pipeline is the ultimate solution for anyone who loves to add extra layers of complexity and chaos to their tech processes, likes playing a never-ending game of whack-a-mole, except the moles are your code vulnerabilities and they just keep popping up no matter how many times you hit them, we suggest reading this article. You may change your mind about the topic.

In reality, DevSecOps is the best way of integrating security measures into every step of the software development life cycle. The traditional approach to software development, which involved developing software in silos by different company departments or outsourcing contractors and then handing it off to the security team for testing, is no longer sufficient in today’s fast-paced, agile environment. Developers and operations teams need to work together to ensure that security is built into every step of the development process.

The Basics of DevSecOps Pipeline

DevSecOps pipeline is an approach to software development that integrates security into the DevOps workflow. It is based on the principle that security should be built into every phase of the development process, from planning and design to coding, testing, and deployment. Take a look at a typical DevSecOps pipeline diagram:

DevSecOps

Picture this: you’re the captain of a ship sailing the vast ocean of software development. You have a skilled crew of developers, operations personnel, and security experts all working together to ensure a smooth voyage. But what if we told you that you could make your journey even smoother with the power of DevSecOps?

With DevSecOps pipeline architecture, you can spot any lurking sea monsters early in the journey, when they’re still small and easy to handle. This means you can save time and money by avoiding any costly detours or battles with larger, more dangerous beasts later on.

Not only that, but with the magic of automation, your crew can focus on more important tasks, like charting your course and making sure your ship is running smoothly. This means you can cover more ground and reach your destination faster, all while keeping an eye out for any potential threats.

And let’s not forget about the benefits of better collaboration between your team members. With DevSecOps, everyone is working together towards a common goal, making it easier to communicate and share ideas. It’s like having a well-oiled machine, where everyone knows their role and works seamlessly together.

But perhaps the most important benefit of all is that security becomes a shared responsibility across the entire crew. No longer is it just the responsibility of a few security experts – everyone on board is responsible for ensuring the safety and security of your journey.

So what are you waiting for? Set sail with DevSecOps and discover the true potential of your software development journey.

Components of DevSecOps Pipeline

DevSecOps pipeline is made up of several components, each of which plays an important role in ensuring that security is built into the development process.

Source Code Management (SCM)

Source code management is one of the most crucial components of a DevSecOps pipeline. It involves the use of a version control system to manage changes to the code base. This allows developers to collaborate on code, track changes, and roll back to previous versions if necessary. The most common SCM tools are:

  • Git
  • SVN
  • Mercurial

Continuous Integration (CI)

Continuous integration is a process in which developers integrate code changes into a central repository on a regular basis. The code is then automatically built and tested, and any issues are identified and resolved immediately before they cause more serious problems. This process ensures that the code is always in a working state and that any issues are identified and addressed early in the development process. Popular CI tools include:

  • Jenkins
  • CircleCI
  • Travis CI

Continuous Deployment(CD)

Continuous Deployment, or CD for short, allows you to effortlessly deploy your code changes to production. The process allows automating the way code goes through the build, testing, and deployment process all on its own, with minim-to-none human intervention.

With CD, you can say goodbye to the days of manual deployments and the risk of human error that comes with them. Instead, you can rest assured knowing that the process is fully automated and any issues are identified and addressed before the software is released to production.

It’s like having a personal assistant who takes care of all the mundane tasks so you can focus on the more important things. CD tools work tirelessly in the background, ensuring that your code is always in a releasable state and that deployments are consistent and repeatable.

CD is not just a tool, it’s a mindset. It’s about embracing a culture of continuous improvement and constant feedback. With CD, you can deliver software faster, with higher quality, and with less risk. It’s a game-changer that can transform the way you develop and deploy software.

Some popular CD tools include:

  • Octopus deploy
  • Argo CD
  • Harness

Security Testing

Just like a secret undercover operation, security testing is the stealthy and strategic process of identifying any hidden vulnerabilities that could pose a threat to the software’s security. It’s a tireless part of the DevSecOps pipeline constantly scanning the code for any potential breaches.

Security testing can be conducted using different techniques, from the classic method of manual testing to the modern approach of automated testing. Just as a master thief would carefully analyze every aspect of a building’s security system, static code analysis, dynamic application security testing (DAST), and software composition analysis (SCA) are the tools that security testers use to assess the software’s defenses.

These techniques help testers identify potential security flaws such as cross-site scripting, SQL injection, and buffer overflows. Like a skilled detective, security testing allows developers to anticipate and thwart any malicious intent before it becomes a serious threat.

In the end, the software emerges as a fortified fortress, ready to stand up to any attacks. Security testing relies on precision to ensure that the software is well-equipped to handle any security issues that may arise.

For this purpose, you can use such tools as:

  • SonarQube
  • Veracode
  • Checkmarx

Infrastructure as Code (IaC)

Infrastructure as Code is the practice of managing infrastructure using code, rather than manual processes. This involves creating scripts or configuration files that define the desired state of the infrastructure, which can be version-controlled and automated. Popular IaC tools include:

  • Terraform
  • Ansible
  • Chef

Containerization

Containerization involves packaging an application and its dependencies into a lightweight, portable container. Containers can be easily deployed across different environments, making it easier to scale applications and maintain consistency. Docker is the most widely-used containerization tool.

Monitoring and Logging

Monitoring and logging are essential for detecting and diagnosing issues in production environments. To monitor and analyze application and system metrics, logs, and alerts, you can use such tools as:

  • Prometheus
  • Grafana
  • ELK Stack

Security in Design

Security in design is the practice of designing software with security in mind from the beginning. This involves identifying potential security risks and designing the software to mitigate those risks. For example, suppose the software requires users to enter sensitive information, such as credit card or social security numbers. In that case, the software should be designed to encrypt that information to protect it from hackers.

Implementing DevSecOps Pipeline

Challenges

If you want to boost your organization’s efficiency, security, and collaboration, then you need to get on board with DevSecOps pipelines. These practices can help you deliver lightning-fast software, rock-solid security, and seamless teamwork between development and operations teams. But let’s not kid ourselves, there will be some challenges to overcome along the way. Don’t worry though, with a little bit of grit and determination, you can conquer any obstacle that comes your way.

1. Cultural Shift

You need to be prepared for a major cultural shift. The truth is, DevSecOps demands a continuous and collaborative approach to software development and deployment, which can be a major challenge for organizations that are used to working in silos. But this is one of those challenges that can be and should be overcome. To make DevSecOps work, you need to break down those barriers and create a culture of collaboration and shared responsibility. When everyone is on the same page and working towards the same goal, amazing things can happen. So don’t let a little cultural shift hold you back from realizing the full potential of DevSecOps.

2. Tooling Integration

Another challenge of implementing DevSecOps is integrating the various tools and technologies required to support the software development pipelines. This can include integrating source code management, continuous integration/continuous delivery (CI/CD) tools, security testing tools, and infrastructure-as-code (IaC) tools. Ensuring that these tools are properly integrated and working together can be complex and time-consuming.

3. Security Skills Shortage

DevSecOps requires a strong focus on security, which can be challenging for organizations that lack security expertise. This can lead to a skills shortage, with a lack of qualified security professionals available to oversee the security aspects of the pipeline. To address this challenge, organizations may need to invest in training and education for their existing staff, or consider partnering with external security experts.

4. Compliance and Regulation

Many industries are subject to strict regulatory and compliance requirements, which can make implementing DevSecOps more challenging. Compliance requirements may include ensuring data privacy, maintaining audit trails, and demonstrating compliance with industry standards. Organizations need to ensure that their DevSecOps pipeline meets these requirements, which can add additional complexity and cost.

5. Legacy Systems and Applications

Finally, legacy systems and applications can pose a challenge to implementing DevSecOps. Legacy systems may be difficult to integrate with modern tools and technologies, or may not be designed to support a continuous delivery approach. This can make it challenging to fully automate the pipeline and achieve the desired benefits of DevSecOps.

Organizations need to address these challenges in order to successfully implement DevSecOps, including fostering a collaborative culture, integrating tools and technologies, addressing security skills shortages, complying with regulations, and managing legacy systems and applications.

How to Improve Collaboration in Your Team?

If you want everyone in your organization to cooperate better, just force them to work on group projects, even if they hate each other’s guts. It doesn’t matter if the project is completely irrelevant to their job description or if they have no interest in it whatsoever. Just make them do it. That’ll bring them closer together.

And don’t forget the team-building exercises. Because nothing screams “collaboration” like playing trust games with your coworkers. Blindfolded, you must trust that your colleague will catch you before you hit the ground. And if they don’t? Well, it’s all good fun, right?

Also, you may consider the open office concept. Why have walls and doors when you can have a shared workspace where everyone can see and hear each other? And those who want to escape constant distractions and interruptions can just wear headphones listening soothing music.

If all else fails, just force your employees to socialize outside of work. Schedule mandatory after-work drinks and make sure everyone attends. Because if they don’t want to be friends with their coworkers, they’re obviously not team players.

In conclusion, if you want to create a collaborative culture in your organization, just force it upon your employees. They’ll thank you for it…eventually.

Best practices to follow to ensure success

There are some DevSecOps steps that organizations can take to ensure success.

1. Automate Everything

In all of the phases of DevSecOps pipeline, automation is an absolute game-changer. By automating key processes like development, testing, and deployment, organizations can slash the time and money it takes to develop software, while also ensuring that security is baked into every single step of the process. It’s a total win-win situation that you don’t want to miss out on. So if you’re ready to streamline your development process and level up your security game, automation is the way to go.

2. Create a Culture of Collaboration

DevSecOps pipeline requires collaboration between development, operations, and security teams. To create a culture of collaboration, organizations should:

  • Foster open communication between teams
  • Encourage cross-functional teams
  • Provide training and resources to help teams understand each other’s roles and responsibilities
  • Reward teams for working together to achieve common goals
  • Implement Security Testing Early and Often

 

To ensure that security is built into every step of the development process, organizations should implement security testing early and often. This includes using DevSecOps pipeline tools such as static code analysis, dynamic application security testing (DAST), and software composition analysis (SCA) to identify and mitigate security vulnerabilities.

Use Secure Coding Practices

Secure coding practices are essential for building secure software. Developers should be trained in secure coding practices and should follow coding standards such as OWASP Top 10 and CWE/SANS Top 25.

Case Studies

Many well-established companies have successfully implemented the DevSecOps CI/CD pipeline in their operations. Here are some of the most prominent DevSecOps pipeline examples:

Netflix

Netflix is a streaming service that uses DevSecOps pipeline to ensure that its software is secure and reliable. The company has a team of security experts who work closely with developers and operations teams to identify and mitigate security vulnerabilities. Netflix uses tools such as static code analysis, DAST, and SCA to automate security testing and ensure that security is built into every step of the development process.

Capital One

Capital One is a financial services company that has implemented DevSecOps pipeline to ensure the security of its software. The company uses automation tools to speed up the development process and ensure that security is a priority at every step of the way. Capital One also employs a security team that works in cooperation with developers and operations teams to identify and mitigate security vulnerabilities.

Aim at the Future

As the world advances, so too does the art of software development. The future is a canvas yet to be painted, a world yet to be explored.

Software development has come a long way since its inception, and the future promises even more innovation. Imagine a world where software not only understands what you want but anticipates your needs before you even know them. Where machines work in tandem with humans to create software that is not just functional, but intuitive and immersive.

The future of software development is not just about writing lines of code, but about creating experiences that transform the way we interact with technology. It’s about understanding the nuances of human behavior and incorporating that into software design. It’s about creating software that is accessible to all, regardless of ability or language.

Artificial intelligence and machine learning will play a critical role in the future of software development. With the ability to analyze vast amounts of data, machines will be able to identify patterns and trends that humans may miss, leading to faster and more efficient software development.

In the future, software development will also be more decentralized and collaborative. Teams will work together, sharing code and ideas in real-time, regardless of their location. The rise of open-source software will only accelerate this trend, leading to a more transparent and inclusive development process.

As we move forward, the future of software development is limited only by our imagination. The possibilities are endless, and the potential for innovation is limitless. Let us embrace this future, and create software that not only solves problems but inspires and delights us in ways we never thought possible.

Conclusion

DevSecOps pipeline is a groundbreaking methodology for software development that fuses security into the heart of the DevOps workflow. This forward-thinking approach allows organizations to identify and eliminate security vulnerabilities at the earliest stages of development, saving valuable time and resources.

By incorporating security into every facet of the development process, teams can reduce the need for costly security testing later on and establish a culture of collaboration. This ensures that security is a shared responsibility across the organization, fostering a sense of teamwork and cooperation.

To put DevSecOps pipeline into practice, organizations should prioritize automation, cultivating a culture of collaboration and implementing security testing from the outset. By using secure coding practices, organizations can build top-tier software that meets the demands of both their clients and stakeholders.

Adopting the best practices of DevSecOps pipeline is the key to unlocking the full potential of software development, ensuring a streamlined, secure, and high-quality process. With this groundbreaking methodology, organizations can stay ahead of the curve and deliver exceptional software solutions.

Bias in AI Problem In Life and Technology

Bias in AI Problem In Life and Technology

Bias in AI
Bias in AI

What is Bias in AI and How to Avoid It?

When we are weighing things, events, or people using different ways for various goals, the algorithms cannot be neutral. Thus, to develop solutions for the creation of impartial systems of artificial intelligence, we need to understand these biased algorithms. The goal of this article is to reveal the AI Bias sense, its types, bias in ai examples, and how to mitigate risks associated with them.

First, let us define what AI Bias is.

What is Bias Algorithms and Why They are Important?

Bias Algorithms are the types of algorithms describing computer system repeating and systematic errors, which lead to unfair results, such as the preference of one random user group over other groups.

Two types of Bias in AI exist. One is the AI algorithm Bias is trained with a Biased system of data.  Another type of AI Biases is bias AI in society. Here our social norms and assumptions make us have blanks or some definite expectations in our minds.

For instance, a fair algorithm of a credit ranking can refuse you in giving a loan, if it constantly weighs appropriate financial indicators.

Why bias algorithms are so significant?

The explanation is simple – people write algorithms, select data, that these algorithms use, and decide about the application of these algorithms’ outcomes. People may accept such subtle and unconscious AI biases without various commands and careful thorough training, which can lead IA to automatize and immortalize them.

Application Bias in Machine Learning

Machine learning bias sometimes the name Bias in AI is a kind of event when algorithms create outcomes that always have a form of biases systematically as machine learning has wrong assumptions.

There are follow wing common Bias AI known:

Algorithmic types of biases

This event takes place when an algorithm has such a problem with computations with support calculations for machine learning.

Bias of samples

It occurs if a certain problem with data intended for training a model for machine learning appears. The data of this kind of machine learning bias are not too much big or suitable enough for teaching the system. For instance, when we use the data for teaching which foreseen only women teachers making tuition of the system, the conclusion arises that all tutors have only female gender.

Preconceived artificial intelligence bias

Here, we use records for tuition in the system accounting for actual preconceptions, stereotypes, or wrong social assumptions that can introduce these true biases in computing learning. For instance, we use the medical specialists’ data that include only women nurses and men doctors, thus, creating a timeless stereotype of medical employees in machine systems.

Measurement AI Bias

As its title says, this bias in AI is caused by the fact that data are not enough precise and the measurement and evaluation of data. If a system intended for an assessment of the workplace area is touted with the help of the photos of happy employees, it can be a biased system, if these employees already knew that the purpose of their training was the achievement of luck. When the system is trained to evaluate the share, it will have a bias type, if the shares in the data for such tuition were successively surrounded.

Bias of exception

It takes place when an important data period stays beyond the data which are applied, which means, something occurs, when the developers refuse to acknowledge the data period as indirect.

The Most Common Bias in AI Examples

Bias in AI is a belief, which is not based on famous facts about a person or a certain group of persons. Thus, there is a well-known belief that females are weak, however many women worldwide are known for their strength. Another one belief – all black people are not honest, but in fact, most of them are honest.

The meaning bias algorithms describe repeatable systematic mistakes, which lead to unfair results. For instance, loan ranking algorithms can refuse to issue a credit, even it is fair if is constantly weighing appropriate financial indicators. If this algorithm provides credits for one customer group but refuses to give them to another group of customers, which are almost the same, based on unlinked criteria, and this kind of behavior repeats several times, we can call it AI algorithm bias in this case. It can be intended or not intended bias, it, for instance, can come from the biased records received from an employee, who performed a job, which will be made by an algorithm from this moment).

Let us consider an example of an algorithm for recognizing faces, which can more easy thought to detect a white person, than a person with black skin, because this type of data is more often used in tuition. The minors can suffer from it as equal opportunities are not possible in discrimination and oppressing can be endless. These biases are not intended and can be hardly revealed until they are programmed with appropriate soft, and this is the problem.

Here are some common Bias in AI examples we can face in real life:

Racism in the medical system of the USA

Technology must facilitate the reduction of health inequality, but not make it worth it when the population fights with continuous preconceptions. Artificial intelligence systems learned on the basis of health data, which is not representative, usually work badly with not enough represented population groups.

A scientist in the USA discovered in 2019 that the algorithm used in American hospitals for the prediction of which patients need medical care gave a privilege to white patients more than to black ones by a great margin. As medical care expenses indicate the needs of a human in medical care, this algorithm takes into account the health expenses of patients in the past.

This figure was associated with race in a significant grade. Black people with the same diseases pay less for medical care, than white ones with the same problems. The scientists and the medical service provider Optum cooperated to make the Biased system less by 80%. Although, if there were no doubts about artificial intelligence, the AI preconditions would have discriminated against black people.

Imagination that CEOs can exclusively men

27% of Chief directors are women. Although, according to the reports of 2015, 11% of people emerging in Google picture search by the key „CEO“ were female representatives. Later, Carnegie Mellon University made is independent study and concluded that the online advertising Google showed more high-income positions for males, than females.

Google reacted indicating that advertisers can point to the persons and web portals to which the search engine must show this advertising. One of the features set by the companies is gender.

Nevertheless, it has been an assumption, that the algorithm of Google could define itself that men are more suitable for leading positions at companies. Researchers think Google could make it based on the behavior of the users. If, for example, men are the only people who see and click on the ads for high-income vacancies, the algorithm will be able to learn to give these ads only to males.

AI Bias algorithm common in personnel hiring by Amazon

Automation played a key role in Amazon’s domination over other companies in e-commerce. Some people, who worked with the company, stated that it uses artificial intelligence in hiring staff to assign 1 to 5-star rankings to job seekers, which was similar to the customer’s estimate products on the Amazon platform. When the company noticed that is new Biased system cannot assess the job seekers who are looking for software developers positions and other leading positions in a gender-neutral way, mostly because it was biased concerning women, the company made necessary adjustments to create a new non-biased ranking system.

After analyzing the summary of the computer model of Amazon, the similarities are in the applications of candidates. Most applications were drawn by males, which certifies that there are more men in this area. The algorithm in Amazon concluded, that male candidates are preferable. Thus, it punished CVs containing that a job seeker was a woman. It also reduced the number of applications from those people who visited one of two women’s educational establishments.

After that Amazon made software changes to make them neutral in relation to these keys. However, it does not prevent emerging of other AI Biases during its work. HRs used the proposals of the tool for searching for new staff, but never fully depended on these ratings. After the Amazon leadership lost their belief in this initiative, the project was closed in 2017.

AI Bias algorithm common in personnel hiring by Amazon

Automation played a key role in Amazon’s domination over other companies in e-commerce. Some people, who worked with the company, stated that it uses artificial intelligence in hiring staff to assign 1 to 5-star rankings to job seekers, which was similar to the customer’s estimate products on the Amazon platform. When the company noticed that is new Biased system cannot assess the job seekers who are looking for software developers positions and other leading positions in a gender-neutral way, mostly because it was biased concerning women, the company made necessary adjustments to create a new non-biased ranking system.

After analyzing the summary of the computer model of Amazon, the similarities are in the applications of candidates. Most applications were drawn by males, which certifies that there are more men in this area. The algorithm in Amazon concluded, that male candidates are preferable. Thus, it punished CVs containing that a job seeker was a woman. It also reduced the number of applications from those people who visited one of two women’s educational establishments.

After that Amazon made software changes to make them neutral in relation to these keys. However, it does not prevent emerging of other AI Biases during its work. HRs used the proposals of the tool for searching for new staff, but never fully depended on these ratings. After the Amazon leadership lost their belief in this initiative, the project was closed in 2017.

How AI Bias Can be Prevented?

Based on the above-mentioned issues, we would like to propose some ideas to overcome occurring of Bias algorithms in our life and work.

Trying machine teaching Bias algorithms in life

For example, candidates for a job. The AI-based decision you made may not be trustworthy if the information of your computer tuition system is given by a certain group of candidates. Although it cannot be a problem, if you apply artificial intelligence to the same seekers, the issue occurs when you apply it to another group of candidates, which your data set did not include before. In this case, it looks like you ask the algorithm to apply the preconditions, which it found out about the previous seekers, to the group of people with the wrong assumption.

To prevent this artificial intelligence bias and find a solution, you need to perform testing for the algorithm in such a way as you could use it in your practical life.

Accounting for justness in Bias in AI prevention

Moreover, we should understand that the term “justness” as well as the way it is calculated must be discussed. It can change under an influence of external factors, which means the AI should consider such changes as well.

Scientists already created many methods to make artificial intelligence systems meet them, such as the preliminary treatment of data, changing the choice of postpartum system, or integrating a certain justness into a tuition program. The contrafactual justness is its method warrantying that the choice of the model would be equal in the contrafactual environment where susceptible features, such as gender belonging, race type, or a sexual focus.

Considering the “Man in a cycle” system

The purpose of the “Man in a cycle” system is to make what a man or a computer cannot do themselves. In case a PC is not able to address an issue, people must help and find a solution instead of a machine. This procedure causes an unbroken feedback cycle.

This unbroken feedback teaches the system and increases its productivity at every further launch. Thus, the participation of a human in this cycle leads to more precise seldom data sets and increased safety and accuracy.

Creating a non-biased system by making  changes in technical education

Craig Smith in his article published in the New York Times, while suggesting fighting with Bias in technology, expressed his opinion, that we need to make serious changes in the ways people obtain knowledge in the field of technological science. He states we need to create reforms in technical education. Nowadays, education is based on an objective point of view. We need to make it on a more inter-disciplinary level and educational revision.

He declares we need to consider and agree with some important issues globally, while other problems should be discussed on the local level. We must create regulations and rules, manage authorities and specialists, supporting control of such algorithms and events. More various collecting of information is only a single criterion, but it will not address the artificial Bias problem.

Conclusion

Biases in all fields of our social, private, and professional life are very important issues. It is very hard to overcome them only by trusting the ordinary computation methods based on AI and standard assumptions. Bias phenomena can cause errors associated with the wrong interpretation of collected data by algorithms. This problem can lead to wrong results and bad productivity in science, production, medicine, education, and other spheres. It is necessary to fight biases using testing methods, creating fair systems, allowing the right human to interfere in the automated computation processing, and changing methods of education.

Power BI Python with Instances – What is the Best Way to Vizualise Python Code?

Power BI Python with Instances – What is the Best Way to Vizualise Python Code?

Python with Power BI
Python with Power BI

Python With Power BI

Actually, when you need to create a bilateral data analysis you can use Power BI by Microsoft. It is not only interactive but also can visualize your information for your business intelligence. So that is why it is called BI or in other words Business Intelligence. It is useful because you can use Python with it.

Using Python you can enhance Power BI programing language capabilities. Python dashboard for data analysis, data acceptance, data conversion, data addition, and data visualization, also can use complicated functions such as machine learning libraries and more. This all can be done with numerous data dashboards with Python thankfully to these two mechanics Power BI and Python. Or in other words, thanks to Python BI Power.

Either experts or newbies can use this blog to learn Python BI Power. We will also use pandas and Matplotlib libraries.

Actually, pandas is an open-source library that can be used for operating with relative and termed data simply. It provides an interactive dashboard Python made from data and some functions for working with numbers and times. The core of pandas is the NumPy library and it is efficient and it has high performance.

And matplotlib is one of the excellent libraries that can help you visualize Power BI drag and drop dashboard for arrays. It is useful when you need to create a visualisation of big numbers. And also it has a lot of settings for your plots such as lines, bars, and even histograms.

So the answer to the question: When can you use BI Power? is “You can use them with a sales data dashboard using Python.”.

The Start in Using Power BI

As it is logical, Power BI code language is created by Microsoft so the only operating system where it runs on is Windows. But do not worry, you can use it with macOS or Linux distribution or any other operating system if it supports virtual machines. The version of Windows that supports Microsoft BI Power is Windows 8.1 and newer. Actually, if you are not using Windows as your main operating system, you need to have 30 gigabytes for a virtual operating system.

Installation of Power BI Desktop

Here we will try to set all tools and after that, we can make codes using Python. Microsoft Power BI Desktop is a powerful collection of tools and services that can be obtained, free of charge, without a Microsoft account, and with no need for an Internet connection. You can easily install it on your computer by accessing the Microsoft Store from the Start menu or its web-based storefront. With this amazing suite, you can work offline like a traditional office suite, giving you the convenience of having all the tools and services you need in one place. By installing Power BI Desktop from the Microsoft Store, you can ensure automatic and quick updates to the most recent versions of the tool without having to be logged in as the system’s administrator.

 

If the usual method of installing Power BI Desktop doesn’t work for you, you can always try downloading the installer from the Microsoft Download Center and running it manually. This executable file is about 400MB in size. Once you have the application installed, launch it and you’ll be met with a welcome screen. At first, the Power BI Desktop user interface may seem intimidating, but don’t worry. You’ll become accustomed to the basics as you progress through the tutorial.

What Python Code Editor to Choose?

To maximize your experience, why not install Microsoft Visual Studio Code? This free and modern code editor is immensely popular and can be easily found in the Microsoft Store. If you already use an IDE such as PyCharm or don’t require any of the advanced editing capabilities, feel free to skip this step. Otherwise, Visual Studio Code is a great choice for any coding enthusiast.

Microsoft Power BI Desktop offers only basic code editing capabilities, which is understandable given its primary purpose as a data analysis tool. Unfortunately, it lacks advanced features such as intelligent contextual suggestions, auto-completion, or syntax highlighting for Python, which are essential for writing anything but the most easy Python scripts in Power BI coding language. Consequently, it is highly recommended that you use an external code editor for writing more complex Python scripts Power BI.

You can download Visual Studio Code on any operating system without virtual machines and you can find it on the Microsoft website. The installation is simple as Visual Studio helps you when you download the installer.

VS Code is a cutting-edge code editor that brings seamless support for a wide array of programming languages via its extensions. Actually, it does not Power BI Python support from the box as PyCharm, but it will offer to install extensions to start working. Say goodbye to limitations, as VS Code doesn’t just stop at Python for Power BI. With its intuitive interface, it becomes a Python powerhouse once you open an existing Python file or create a new one, automatically recognizing the language and prompting you to install the best set of recommended extensions designed specifically for Python programming.

But all this works only if you have already installed raw Python on your computer. You can google it and find the answer to how to do this.

Does BI Desktop Need Some Libraries?

The answer is “Yes, it does”. Unleash the full potential of Power BI Desktop by ensuring that your Python setup is equipped with pandas and Matplotlib. These libraries are not part of the standard installation, but can easily be obtained if you’ve utilized Anaconda. It’s worth noting that incorporating third-party packages into the global Python interpreter is discouraged, as it can pose potential risks. Moreover, attempting to run the system interpreter from Power BI Python on a Windows machine is not possible due to permission restrictions. The solution? A Python virtual environment is a secure and efficient way to manage your Python packages and dependencies.

What is a Virtual Environment?

An isolated folder, a virtual environment, comprises a directory that comprises a replication of the main Python interpreter, allowing you to experiment with your heart’s content. You can install any additional libraries within this space without any concern of disrupting other programs that are reliant on Python. And, whenever you wish, you can effortlessly eliminate the folder holding your virtual environment without any adverse effect on the existence of Python on your machine.

To create a new virtual space you need to use a Windows terminal and write the command:

python -m venv python-virtual

Here we named our folder “python-virtual” but you can use any you want.

In just a matter of moments, a fresh folder containing a duplicate of the Python interpreter will materialize on your desktop. With this, you can activate the virtual environment by executing its activation script, and subsequently install the two libraries that are required by Power BI. To achieve this, type in the following commands while your desktop remains in your present working directory:

.\python-virtual\Scripts\activate
python -m pip install pandas matplotlib

Upon activation, you should be able to identify your virtual environment with the name “python-virtual” in the command prompt. Failure to do so would result in the installation of additional third-party packages into the primary Python interpreter, which is precisely what we aimed to avoid. Congratulations, you’re almost there! You can repeat the activation and pip installation steps if you wish to incorporate additional libraries into your virtual environment. Finally, the next step is to inform Power BI of the location of Python in your virtual environment.

Let’s Run Python on It

Firstly, we need to set special options in Power BI Desktop. Once you access the configuration options, you’ll find various settings organized by categories. Locate the category named “Python scripting” in the left-hand column, and proceed to set the Python home directory by selecting the “Browse” button.

To ensure proper functionality, it’s important to specify the path to the “Scripts” subfolder in your virtual environment, which contains the “python.exe” executable. If your virtual environment is located in your Desktop folder, your path should resemble the following format:

\Desktop\python-virtual\Scripts

Before “\Desktop” must be the name of your user. If the designated path is incorrect and does not contain a virtual environment, you will receive an appropriate error message.

Great job! You have now successfully configured Python to Power BI. One key setting to verify is the path to your Power BI to Python virtual environment. It should include both the pandas and Matplotlib libraries. With this setup complete, you’re ready to start exploring the capabilities of Python with Power BI in the next section.

So next let’s talk about running the code and how it works.

How Can You Operate With It?

There are several methods available to execute Python and Power BI, each of which seamlessly integrates with a data analyst’s regular workflow. One such method involves using Python as a data science using Power BI source to import or create datasets within your report. Another method involves utilizing Python to perform data cleaning and other transformations on any dataset directly in Power BI with Python. Additionally, Python’s advanced plotting libraries can be used to create engaging and informative data visualizations. This article will explore all three of these applications in detail.

Using pandas.DataFrame

If you need to ingest data into Power BI from a proprietary or legacy system, you can use a Python  BI script tool to connect to the system and load the data into a pandas DataFrame. This is a useful approach when the data is stored in an obsolete or less commonly used file format that Power BI does not support natively.

To get started, you can write a Python script that connects to the legacy system and loads the data into a pandas DataFrame. Once the data is in a DataFrame, you can manipulate it using the pandas library to clean and transform the data as needed.

Power BI can then access the DataFrame by connecting to the Python script and retrieving the DataFrame. This allows you to leverage the power of both tools – the data manipulation capabilities of Python and the visualization and reporting capabilities of Power BI.

In this tutorial, we’ll use Python to load fake sales data from SQLite, which is a popular file-based database engine. While it is technically possible to load SQLite data directly into Power BI Desktop using an appropriate driver and connector, using Python can be more convenient since it supports SQLite out of the box.

Before jumping into the code, it would help to explore your dataset to get a feel for what you’ll be dealing with. It’s going to be a single table consisting of used car dealership data stored in the “sales.db” file.

Let’s imagine this file has a thousand records and a lot of columns of data in the table, which represent sold goods, their buyers, and the date of sold items. May you remember what we mentioned about Anaconda? Yes, Anaconda has a Jupyter Notebook that can be called a code editor. You can quickly visualize this sample database by loading it into a pandas DataFrame and sampling a few records in a Jupyter Notebook using the following Power BI Python tutorial:

import sqlite3
import pandas as pand

with sqlite3.connect(r"C:\Users\User\Desktop\sales.db") as connection:
  df = pand.read_sql_query("SELECT * FROM sales", connection)

df.sample(15)

Note that the path to the sales.db file may be different on your computer. If you can’t use Jupyter Notebook, then try installing a tool like SQLite Browser and loading the file into it.

At a glance, you can tell that the table needs some cleaning because of several problems with the underlying data. However, you’ll deal with most of them later, in the Power Query editor, during the data transformation phase. Right now, focus on loading the data into Power BI.

As long as you haven’t dismissed the welcome screen in Power BI yet, then you’ll be able to click the link labeled Get data with a cylinder icon on the left. Alternatively, you can click Get data from another source on the main view of your report, as none of the few shortcut icons include Python. Finally, if that doesn’t help, then use the menu at the top by selecting Home › Get data › More… as depicted below.

Doing so will reveal a pop-up window with a selection of Power BI connectors for several data sources, including a Python script, which you can find by typing Python into the search box.

Select it and click the Connect button at the bottom to confirm. Afterward, you’ll see a blank editor window for your Python script, where you can type a brief code snippet to load records into a pandas DataFrame.

You can notice the lack of syntax highlighting or intelligent code suggestions in the editor built into Power BI. As you learned earlier, it’s much better to use an external code editor, such as VS Code, to test that everything works as expected and only then paste your Python code to Power BI.

Before moving forward, you can double-check if Power BI uses the right virtual environment, with pandas and Matplotlib installed, by reading the text just below the editor.

While there’s only one table in the attached SQLite database, it’s currently kept in a denormalized form, making the associated data redundant and susceptible to all kinds of anomalies. Extracting separate entities, such as cars, sales, and customers, into individual DataFrames would be a good first step in the right direction to rectify the situation.

Fortunately, your Python script may produce as many DataFrames as you like, and Power BI will let you choose which ones to include in the final report. Here in Power BI programming language code, you can extract those three entities with pandas using column subsetting in the following way:

import sqlite3
import pandas as pand

with sqlite3.connect(r"C:\Users\User\Desktop\sales.db") as connection:
  df = pand.read_sql_query("SELECT * FROM sales", connection)

goods = df[
[
"color",
"purchase_date",
"purchase_price",
"investment",
]
]
sales = df[["sale_price", "sale_date"]]

First, you connect to the SQLite database by specifying a suitable path for the car_sales.db file, which may look different on your computer. Next, you run a SQL query that selects all the rows in the sales table and puts them into a new pandas DataFrame called df. Finally, you create three additional DataFrames by cherry-picking specific columns. It’s customary to abbreviate pandas as pd in Power BI coding. Often, you’ll also see the variable name df used for general, short-lived DataFrames. However, as a general rule, please choose meaningful and descriptive names for your variables to make the code more readable.

When you click OK and wait for a few seconds, Power BI will present you with a visual representation of the four DataFrames produced by your Python script.

The resulting table names correspond to your Python variables. When you click on one, you’ll see a quick preview of the contained data. The screenshot above shows the customers table, which comprises only two columns.

Select cars, customers, and sales in the hierarchical tree on the left while leaving off df, as you won’t need that one. You could finish the data import now by loading the selected DataFrames into your report. However, you’ll want to click a button labeled Transform Data to perform data cleaning using pandas in Power BI.

In the next section, you’ll learn how to use Python to clean, transform, and augment the data that you’ve been working within Power BI.

Using Python Query Editor

If you have followed the instructions in this guide, you should now be in the Power Query Editor, which displays the three DataFrames you selected earlier. These DataFrames are referred to as queries in this particular view. However, if you have already imported data into your Power BI report without applying any transformations, there’s no need to worry! You can access the same editor at any time.

To do so, navigate to the Data perspective by clicking on the table icon located in the center of the ribbon on the left-hand side, and then select Transform data from the Home menu. Alternatively, you can right-click on one of the fields in the Data view on the far right of the window and choose the Edit query for the same result. Once you have accessed the Power Query Editor window again, you will be able to see your DataFrames or Queries on the left-hand side, while the Applied Steps for the currently selected DataFrame will be displayed on the right-hand side, with rows and columns in the center.

Each step in the Applied Steps represents a sequence of data transformations that are applied in a pipeline-like fashion against a query, from top to bottom. Each step is expressed as a Power Query M formula. The first step, named Source, involves invoking your Python script, which generates four DataFrames based on the SQLite database. The other two steps extract the relevant DataFrame and transform the column types.

By clicking the gear icon next to the Source step, you’ll reveal your data ingestion script’s original Python source code. This feature can access and edit Python code baked into a Power BI report even after saving it as a .pbix file.

You can insert custom steps into the pipeline for more granular control over data transformations. Power BI Desktop offers plenty of built-in transformations that you’ll find in the top menu of Power Query Editor. But in this tutorial, you’ll explore the Run Python script transformation, which is the second mode of running Python code in Power BI:

Conceptually, it works almost identically to data ingestion, but there are a few differences. First of all, you may use this transformation with any data source that Power BI supports natively, so it could be the only use of Python in your report. Secondly, you get an implicit global variable called dataset in your script, which holds the current state of the data in the pipeline, represented as a pandas DataFrame.

Note: As before, your script can produce multiple DataFrames, but you’ll only be able to select one for further processing in the transformation pipeline. You can also decide to modify your dataset in place without creating any new DataFrames.

Pandas lets you extract values from an existing column into new columns using regular expressions. For example, some customers in your table have an email address enclosed in angle brackets (<>) next to their name, which should belong to a separate column.

Select the customer’s query, then select the last Changed Type step, and add a Run Python script transformation to the applied steps. When the pop-up window appears, type the following Python script code example:

dataset = dataset.assign(
full_name=dataset["customer"].str.extract(r"([^<]+)"),
email=dataset["customer"].str.extract(r"<([^>]+)>")
).drop(columns=["customer"])

When working with Power BI, you can utilize the implicit dataset variable in your script to reference the customer’s DataFrame, giving you access to its methods and allowing you to override it with your transformed data. Alternatively, you have the option to define a new variable for the resulting DataFrame. During the transformation process, you can add two new columns, full_name and email, and then remove the original customer column containing both information pieces.

Once you’ve finished your transformation, clicking OK and waiting a few seconds will display a table showing the DataFrames your script produced. In this case, there is only one DataFrame named dataset, as you reused the implicit global variable provided by Power BI for your new DataFrame. To choose your desired DataFrame, simply click the yellow Table link in the Value column.

Your customers` table now has two new columns, allowing you to quickly identify customers who have not provided their email addresses. If you desire further transformations, you can add additional steps. For example, you could split the full_name column into separate columns for first_name and last_name, assuming that there are no instances of customers with more than two names.

Be sure to select the final transformation step and insert another Run Python script. The corresponding Python code for this step should appear as follows:

dataset[
["first_name", "last_name"]
] = dataset["full_name"].str.split(n=1, expand=True)
dataset.drop(columns=["full_name"], inplace=True)

Unlike in the previous step, the dataset variable refers to a DataFrame with three columns, full_name, and email, because you’re further down the pipeline. Also, notice the inplace=True parameter, which drops the full_name column from the existing DataFrame rather than returning a new object.

You’ll notice that Power BI gives generic names to the applied steps and appends consecutive numbers to them in case of many instances of the same step. Fortunately, you can give the steps more descriptive names by right-clicking on a step and choosing Rename from the context menu:

By editing Properties…, you may also describe in a few sentences what the given step is trying to accomplish.

When you’re finished transforming your datasets, you can close the Power Query Editor by choosing Close & Apply from the Home ribbon or its alias in the File menu:

This will apply all transformation steps across your datasets and return to the main window of Power BI Desktop.

Next up, you’ll learn how to use Python to produce custom data visualizations.

Power BI Python Data Transformation

So far, we’ve covered importing and transforming data using Python in Power BI Desktop. Python’s third and final application is creating visual representations of your data. When it comes to visualizations, you have the flexibility to use any of the supported Python libraries, provided you’ve installed them in the virtual environment that Power BI utilizes. However, Matplotlib serves as the foundation for plotting, which other libraries delegate to in any case.

If Power BI hasn’t already directed you to the Report perspective following your data transformations, you can now navigate by clicking on the chart icon on the left ribbon. This will bring up a blank report canvas where you can add your graphs and other interactive components, collectively referred to as visuals.

Over on the right in the Visualizations palette, you’ll see several icons corresponding to the available visuals. Find the icon of the Python visual and click it to add the visual to the report canvas. The first time you add a Python or R visual to a Power BI report, it’ll ask you to enable script visuals.

In fact, it’ll keep asking you the same question in each Power BI session because there’s no global setting for this. When you open a file with your saved report that uses script visuals, you’ll have the option to review the embedded Python code before enabling it. Why? The short answer is that Power BI cares for your privacy, as any script could leak or damage your data if it’s from an untrusted source.

However, if you’ve configured Power BI to use an external code editor, then clicking on the little skewed arrow icon (↗) will launch it and open the entire scaffolding of the script. You can ignore its content for the moment, as you’ll explore it in an upcoming section. Unfortunately, you have to manually copy and paste the script’s part between the auto-generated # Prolog and # Epilog comments back to Power BI when you’re done editing.

Note: Don’t ignore the yellow warning bar in the Python script editor, which reminds you that rows with duplicate values will be removed. If you only dragged the color column, then you’d end up with just a handful of records corresponding to the few unique colors. However, adding the vin column prevents this by letting colors repeat throughout the table, which can be useful when performing aggregations.

To demonstrate an elementary use of a Python visual in Power BI, you can plot a bar chart showing the number of goods painted in a given color. Here is an example of import visualize Python:

import matplotlib.pyplot as mat

mat.style.use("seaborn")

series = dataset[dataset["color"] != ""]["color"].value_counts()
series.plot(kind="bar", color=series.index, edgecolor="black")

mat.show()

To get started with creating visualizations, you can begin by enabling Matplotlib’s theme that mimics the seaborn library. This will provide a more visually appealing look and feel compared to the default theme.

Next, you can remove any records with missing color data, and count the number of remaining records in each unique color group. This will result in pandas.Series object that can be plotted and color-coded using its index, which consists of the color names. Finally, you can render the plot by calling plt.show().

With these steps, you can easily create a basic visualization of your data using Python in Power BI. Of course, the possibilities for visualizing your data are endless, and you can explore and experiment with other Python libraries and techniques to create even more engaging and informative visualizations.

Additional Settings For Power BI Desktop

With the power of pandas and Python, there are countless possibilities for transforming your datasets in Power BI. Some examples include:

  • Anonymizing sensitive personal information, such as credit card numbers
  • Identifying and extracting new entities from your data
  • Rejecting sales with missing transaction details
  • Removing duplicate sales records
  • Unifying inconsistent purchase and sale date formats

These are just a few ideas to get you started, but the possibilities are endless. While we can’t cover everything in this article, don’t hesitate to experiment on your own. Keep in mind that your success in using Python to transform data in Power BI will depend on your understanding of pandas, which is the library that Power BI uses under the hood. The more you learn about pandas and their capabilities, the more you can achieve with your data in Power BI.

    Special Code Editor

    Within the Python scripting options in Power BI, a useful setting allows you to specify the default Python integrated development environment (IDE) or code editor you prefer to use when working on a code snippet. You can stick with the operating system’s default program associated with the .py file extension, or you can select a specific Python IDE of your choice to launch within Power BI. This flexibility can make it easier and more efficient for you to write and debug Python code directly in Power BI.

    To indicate your preferred Python Integrated Development Environment (IDE), opt for “Other” from the initial dropdown menu, and navigate to the executable file of your preferred code editor. For instance, you may browse this one:

    \Desktop\Programs\Microsoft VS Code\Code.exe
    

      As before, the path to your app can be different and contain different folders.

        What are the Cons of Using Python Power BI?

        Python integration in Power BI Desktop has some limitations you should be aware of.

          Timeouts

          The most notable limitations are related to timeouts, data size, and non-interactive visuals. Your data ingestion and transformation scripts defined in Power Query Editor can’t run longer than thirty minutes. Python scripts in Power BI visuals are limited to only five minutes of execution, and there are additional data size limitations, such as only being able to plot the top 150,000 rows or fewer in a dataset and the input dataset can’t be larger than 250 megabytes.

            Marshaling

            In Power BI Desktop, the communication between Power BI and Python is done by exchanging CSV files. Therefore, when using Python to manipulate data, the script must load the dataset from a text file created by Power BI for each run, and then save the results to another text or image file for Power BI to read. This redundant data marshaling can result in a significant performance bottleneck when working with larger datasets. It is the biggest drawback of Python integration in Power BI Desktop.

            If you encounter poor performance, you may want to consider using Power BI’s built-in transformations or the Data Analysis script Expression (DAX) formula language instead of Python. Another approach to improve performance is to reduce the number of data serializations by collapsing multiple steps into a single Python script that does the heavy lifting in bulk. For example, instead of making multiple steps in the Power Query Editor for a very large dataset, you can combine them into the first loading script.

              Python Visualization

              Data visualizations created using Python code are static images, which means you can’t interact with them to filter your dataset. However, Power BI will update the Python visuals in response to interacting with other visuals. It’s worth noting that Python visuals take slightly longer to display due to the data marshaling overhead and the need to run Python code to render them.

                And Others

                Using Python in Power BI has some other minor limitations. For instance, it can be difficult to share Power BI reports that rely on Python code with others, since the recipients would need to install and configure Python. Additionally, all datasets in a report must be set to a public privacy level for Python scripts to work properly in the Power BI service. Furthermore, a finite number of supported Python libraries in Power BI exist. There may be additional minor limitations, which can be found in Microsoft’s documentation that outlines how to prepare a Python script and known limitations of Python visuals in Power BI.