Data never sleeps. In 2021, the world’s five billion plus internet users consumed an estimated 79 zettabytes worth of data, from Amazon to Zoom.* As new data sources become more widely accessible to human and machine analytics, the main challenge is not so much data management as it is data discernment. That is:
You’ve certainly heard it before, but let’s just reiterate for good measure: labeled datasets are the backbone of most machine learning systems. However, the lack of high-quality, permissively licensed training data continues to be a barrier to entry for DIYers interested in building and deploying smart applications and systems.
Since its inception, IQT Labs has been deeply involved in improving Open Source access to high-quality data. We began arduously building and releasing hand-curated datasets, as well as unearthing the utility of synthetic data production. At the moment we are slightly obsessed with the trends towards automatic curation of omniscient labelled data and the transformative role these methods may play in the future.
While data creation remains a focus, so are the tools and methods growing in the Open Source that help us better understand the world we live in. There is so much to explore, just in the data that surrounds us.
* That’s 79*1021 or 7,900,000,000,000,000,000,000 bytes, expected to more than double to 180 zettabytes by 2025.