Welcome to this comprehensive tutorial on advanced data structures in Python, tailored specifically for data science.
Let's get started!
Introduction
Data scientists are often faced with the need to count items, manipulate complex data structures, or create dictionary-like objects even when some information might be missing. Python's standard library provides a handful of tools that make these tasks quite straightforward. In this tutorial, we will delve into some of these tools.
Collections Module
Python's collections module offers alternative, more advanced, data structures compared to Python's built-in ones. You can think of the collections module as a "toolbox" that contains special versions of Python's built-in data structures, each designed to solve different types of problems.
Counter
The Counter object is an extremely useful tool in Python that operates similar to a dictionary. However, instead of storing key-value pairs in the conventional sense, it takes in a list and counts the frequency of each item in that list.
To understand it better, let's use an example. Imagine you have a basket full of different types of fruit. Using a Counter, you can quickly know how many apples, bananas, oranges, etc., are in your basket. Let's see how:
from collections import Counter
# Let's say this is our basket
basket = ['apple', 'banana', 'orange', 'apple', 'banana', 'apple']
# Create a Counter object
fruit_count = Counter(basket)
print(fruit_count)
This will output:
Counter({'apple': 3, 'banana': 2, 'orange': 1})
As you can see, the Counter object has counted the frequency of each type of fruit in our basket.
Most Common Elements
The Counter object also has a very handy method called most_common(). This method returns a list of elements and their counts, sorted by the count in descending order. So if you ever need to know which items are most common in your list (or least common), Counter has got you covered!
Let's use the previous example and find out the most common fruit in our basket:
# Print the most common fruit
print(fruit_count.most_common(1))
This will output:
[('apple', 3)]
Our most_common(1) method call tells us that 'apple' is the most common fruit in our basket, appearing 3 times. You can replace 1 with any number n to get the top n most common fruits.
Defaultdict for Unknown Structures
Python's defaultdict is another powerful tool from the collections module. It is a dictionary-like object that provides all methods provided by a dictionary but takes a first argument as a default data type for the dictionary.
This can be very useful when working with data where we don't know all the keys that will be used. It's like having an empty notebook for each new topic that comes up while you're studying - you don't need to know all the topics ahead of time, and each one gets its own space as soon as it comes up.
Here's an example to illustrate:
from collections import defaultdict
# List of tuples with park's id and the name of the eatery
park_eateries = [('M010', 'Tavern on the Green'), ('M020', 'Loeb Boathouse'), ('M010', 'Le Pain Quotidien')]
# Create a defaultdict of type list
eateries_by_park = defaultdict(list)
# Loop over the list and populate the defaultdict
for park_id, eatery_name in park_eateries:
eateries_by_park[park_id].append(eatery_name)
print(eateries_by_park)
This will output:
defaultdict(<class 'list'>, {'M010': ['Tavern on the Green', 'Le Pain Quotidien'], 'M020': ['Loeb Boathouse']})
As you can see, our defaultdict has conveniently grouped eateries by the park ID.
Defaultdict as a Counter
defaultdict is versatile. It can also be used as a counter, just like the Counter object we explored before. This comes in handy when working with more complex structures, such as a list of dictionaries.
Imagine you're a detective, and you're trying to find out which phone numbers and websites appear most often in your list of eateries. You can think of defaultdict as your sidekick who keeps track of all the details, no matter how many eateries or contact details there are.
Here's how you can use defaultdict to count the occurrences of phone numbers and websites in a list of dictionaries:
from collections import defaultdict
# This is our list of eateries with contact details
nyc_park_eateries = [
{"name": "Tavern on the Green", "phone": "1234567890", "website": "www.tavernonthegreen.com"},
{"name": "Loeb Boathouse", "phone": "1234567890", "website": "www.loebboathouse.com"},
{"name": "Le Pain Quotidien", "phone": "0987654321", "website": "www.tavernonthegreen.com"},
]
# Create defaultdicts of type int
phone_count = defaultdict(int)
website_count = defaultdict(int)
# Loop over the list and populate the defaultdicts
for eatery in nyc_park_eateries:
phone_count[eatery['phone']] += 1
website_count[eatery['website']] += 1
print("Phone counts:", dict(phone_count))
print("Website counts:", dict(website_count))
This will output:
Phone counts: {'1234567890': 2, '0987654321': 1}
Website counts: {'www.tavernonthegreen.com': 2, 'www.loebboathouse.com': 1}
Now, you can easily see which phone numbers and websites appear most often in your data.
Namedtuple
Next up in our toolbox is namedtuple. Named tuples are just like regular tuples, but they have named fields. You can think of named tuples like passports: they hold the same kind of values (like name, nationality, birth date) for many individuals. This makes our data more structured and our code more readable.
Let's see how we can use namedtuple to create a more structured representation of an eatery:
from collections import namedtuple
# Define a namedtuple type for an Eatery
Eatery = namedtuple('Eatery', ['name', 'phone', 'website'])
# Create an instance of our Eatery namedtuple
tavern_on_the_green = Eatery("Tavern on the Green", "1234567890", "www.tavernonthegreen.com")
print(tavern_on_the_green)
This will output:
Eatery(name='Tavern on the Green', phone='1234567890', website='www.tavernonthegreen.com')
Now, we can easily access the fields of our Eatery namedtuple:
print(tavern_on_the_green.name)
print(tavern_on_the_green.phone)
print(tavern_on_the_green.website)
These lines will output:
Tavern on the Green
1234567890
www.tavernonthegreen.com
Just like that, namedtuple has made our data more structured and our code easier to understand!
Dataclasses
Imagine if you could have a data structure that behaves like a namedtuple but comes with even more useful features. Enter dataclasses - a new kind of data structure introduced in Python 3.7 that allows for more customization and control. You can think of a dataclass as an upgraded passport - it still holds important details, but now also includes a photo, electronic data, and even biometrics!
Introduction to Dataclasses
from dataclasses import dataclass
@dataclass
class Eatery:
name: str
website: str
phone: str
Here, we've created a new dataclass Eatery with three fields - name, website, and phone.
Let's create an instance of our Eatery dataclass:
tavern_on_the_green = Eatery("Tavern on the Green", "www.tavernonthegreen.com", "1234567890")
print(tavern_on_the_green)
This will output:
Eatery(name='Tavern on the Green', website='www.tavernonthegreen.com', phone='1234567890')
Just like namedtuple, dataclasses offer readable string representations right out of the box!
Advantages of Dataclasses
Setting Default Values
Imagine a situation where most eateries have a website, but a few don't. In such a case, we can set a default value for the website field in our Eatery dataclass:
from dataclasses import dataclass
@dataclass
class Eatery:
name: str
website: str = 'No website'
phone: str
Now, if we create an Eatery without a website:
some_eatery = Eatery("Some Eatery", phone="0987654321")
print(some_eatery)
This will output:
Eatery(name='Some Eatery', website='No website', phone='0987654321')
Here, 'No website' is used as the default value for the website field.
Conversion to Dictionaries and Tuples
Dataclasses can easily be converted to dictionaries or tuples, which can be very useful for compatibility with other Python functions:
from dataclasses import asdict, astuple
# Convert to dictionary
eatery_dict = asdict(tavern_on_the_green)
print(eatery_dict)
# Convert to tuple
eatery_tuple = astuple(tavern_on_the_green)
print(eatery_tuple)
These lines will output:
{'name': 'Tavern on the Green', 'website': 'www.tavernonthegreen.com', 'phone': '1234567890'}
('Tavern on the Green', 'www.tavernonthegreen.com', '1234567890')
Custom Properties in Dataclasses
While dataclasses provide a very straightforward way to define data containers, sometimes you need a little more control over how data is accessed or set. For this, we can use properties.
A property in Python is like a custom gateway to a private variable. It allows you to control how that variable is accessed or modified.
For example, let's say we want to ensure that every time someone accesses the phone attribute of our Eatery class, it's returned in a formatted way:
from dataclasses import dataclass
@dataclass
class Eatery:
name: str
website: str = 'No website'
_phone: str = None # We'll store the raw phone data here
@property
def phone(self):
return f"({self._phone[:3]}) {self._phone[3:6]}-{self._phone[6:]}"
Now, when we create a new Eatery and access its phone attribute:
some_eatery = Eatery("Some Eatery", "0987654321")
print(some_eatery.phone)
This will output:
(098) 765-4321
The phone number is formatted in a more readable form!
Frozen Instances
Finally, we can make our dataclass "frozen", which makes it immutable - you cannot change the values once the dataclass has been created. This can be handy when you want to ensure that data remains constant and safe from accidental modifications.
We just have to set frozen=True in our dataclass decorator:
@dataclass(frozen=True)
class Eatery:
name: str
website: str = 'No website'
_phone: str = None
@property
def phone(self):
return f"({self._phone[:3]}) {self._phone[3:6]}-{self._phone[6:]}"
Let's see what happens when we try to update a frozen instance:
tavern_on_the_green = Eatery("Tavern on the Green", "www.tavernonthegreen.com", "1234567890")
tavern_on_the_green.name = "New Name"
This will output:
---------------------------------------------------------------------------
FrozenInstanceError Traceback (most recent call last)
<ipython-input-xx-xxxxxxxxx> in <module>
----> 1 tavern_on_the_green.name = "New Name"
FrozenInstanceError: cannot assign to field 'name'
As expected, we get a FrozenInstanceError because we're trying to update a frozen instance!
Conclusion
Congratulations! You've made it to the end of this deep dive into some of Python's powerful built-in tools for managing complex data structures. From counting items and handling unknown structures, to structuring your data with namedtuple and dataclasses, we hope this guide has armed you with the knowledge and confidence to tackle any data structure challenge in your data science journey.
As a Data Scientist, understanding and using these tools effectively can help you to improve the efficiency and readability of your code, making it easier to work with complex data and boosting your productivity. Happy coding!