Schema on Read vs on Write - Portfolio & Notes

#schema-on-read (Hadoop) 1. Load the data ``` Hadoop hdfs dfs -copyFromLocal bleh/name.txt /user/hadoop/customer ``` 2. Query the data ``` Hadoop hadoop jar Hadoop-streaming.jar -mapper customer-mapper.py -reducer customer-reducer.py -input /user/hadoop/*.txt -output /user/hadoop/output/query1 ``` In Hadoop (non-SQL), the data's structure is interpreted as it's read, in this case by a #python script. #schema-on-write (SQL) 1. Create Schema ```SQL CREATE TABLE Customers ( Key int, Name varchar(40),... ); ``` 2. Add Data ``` SQL BULK INSERT Customers FROM '.../name.txt' WITH FIELDTERMINATOR = '","'; ``` 3. Query Data ``` SQL SELECT Key, Name FROM Customers; ``` In SQL can't add data until the table's schema has been declared. If the data changes and needs to be redefined, what are the implications of dropping and re-loading 500TB of data?