Easily Query ORC Data in Python with PySpark
Published in
3 min readAug 12, 2019
Optimized Row Columnar, or ORC, is an column-oriented data storage format, that is part of the Apache Hadoop family. While ORC files and processing them might not be typically within the wheelhouse of a data scientist, there are occasions where you’ll need to pull these files out and handle them using the data munging libraries of your choice.