Easily Query ORC Data in Python with PySpark

Holly Emblem
Towards Data Science
3 min readAug 12, 2019

--

Photo by Eric Han on Unsplash

Optimized Row Columnar, or ORC, is an column-oriented data storage format, that is part of the Apache Hadoop family. While ORC files and processing them might not be typically within the wheelhouse of a data scientist, there are occasions where you’ll need to pull these files out and handle them using the data munging libraries of your choice.

--

--

Head of Insights at Rare, a Xbox Game Studio. Previous experience as a data scientist and lead. Interested in deep learning, quantum computing and statistics.