About Alluxio
One of the technologies born from the big data revolution is Alluxio, created by Haoyuan “HY” Li, one of the BigDATAwire People to Watch for 2024. Alluxio is a virtual distributed file system designed to be used with frameworks like Apache Hadoop and Apache Spark.
Li founded a company called Alluxio, where he serves as chairman and CEO. BigDATAwire recently caught up with Li to talk about his work.
Inspiration
BigDATAwire: You created Alluxio while working in the AMPLab at UC Berkeley. What was the source of the inspiration for the project?
HY Li: When I was doing research at Google during my undergraduate time, I saw the power of data as the foundation of many aspects of our world in the future. With that belief, I was very fortunate to have the opportunity to pursue my Ph.D. at Berkeley AMPLab under the tutelage of Professor Ion Stoica and Professor Scott Shenkar.
At the time, there was an explosion in innovation at the compute layer and storage layer, which created a unique problem associated with data orchestration (including data access, management, etc). While the introduction of new technologies enabled many new applications, every new storage system became yet another data silo. The rise of cloud storage only exacerbated these challenges.
What is Missing from the Big Data Stack Today?
BigDATAwire: What is missing from the big data stack today?
Li: Companies are racing to leverage AI and machine learning in their businesses, and what they are realizing is that machine learning applications create a new set of challenges for their data platforms. Traditional data infrastructures often struggle to cope with these demands, leading to cost inefficiencies, slower innovation, and complex data engineering.
With the rise of machine learning workloads such as computer vision and LLMs, the need for a high-performance data layer that serves all critical data-driven applications is even greater. Alluxio provides an efficient offline model training cache capable of serving datasets of any size directly to training nodes without impacting the training performance.
Relationship Between Distributed File Systems and Streaming Data Platforms
BigDATAwire: You had a role in developing Spark Streaming. What’s the relationship between distributed file systems and streaming data platforms?
Li: We see streaming data applications as a type of data-driven application that the data platform such as Alluxio serves.
Outside Interests
BigDATAwire: Outside of the professional sphere, what can you share about yourself that your colleagues might be surprised to learn – any unique hobbies or stories?
Li: Outside of work, I enjoy exploring the great outdoors through hiking and scuba diving. I love what I do, but it can be difficult to find the space to step back and appreciate the world. I’ve found scuba diving to be the perfect activity as it requires focus to ensure safety, which allows me to be fully present and appreciate the wonders of the sea world.
I also have a keen interest in world history and cultural exchange. I enjoy learning about different cultures and traditions from around the world. This curiosity has led me to travel extensively and engage with people from diverse backgrounds, enriching my understanding of the world and fostering meaningful connections.
Conclusion
Alluxio is a innovative solution that bridges the gap between compute and storage, providing high-performance data access for all data-driven workloads. With its ability to serve datasets of any size directly to training nodes without impacting the training performance, Alluxio accelerates model updates from experimentation to production, facilitating a better user experience and deeper user engagement.
Frequently Asked Questions
Q: What inspired Haoyuan “HY” Li to create Alluxio?
A: Li was inspired by the power of data and the need for a new type of data platform that could bridge the gap between compute and storage.
Q: What is the relationship between Alluxio and big data?
A: Alluxio is a virtual distributed file system designed to be used with big data frameworks like Apache Hadoop and Apache Spark.
Q: What is Alluxio used for?
A: Alluxio is used for high-performance data access, model training, and offline model training cache, which enables data teams to achieve magnitudes higher training performance without the need for costly specialized storage.

