automatminer crashes when using a 'saved' pipeline for prediction

Arnab_Kabiraj · October 4, 2019, 12:40pm

Dear developers,

I’m trying to save the best automatminer pipeline after optimization for a particulat dataset using the MatPipe.save() function as the optimization takes quite some time. It dumps the pipeline fine. However, when I’m loading it using MatPipe.load() and using it to predict some unknown data, it throws the error ‘Pipeline’ object has no attribute ‘fitted_pipeline_’. I understand this has something to do with removing the backend and replacing it with the best pipeline while saving which the predict() function isn’t being able to comprehend, but was unable to solve the problem myself. The commands and the outputs from the screen are pasted below.

pipe = MatPipe.load(‘pipe.pickle’)
2019-10-04 17:44:28 INFO Loaded MatPipe from file pipe.pickle.
2019-10-04 17:44:28 WARNING Only use this model to make predictions (do not retrain!). Backend was serialzed as only the top model, not the full automl backend.

pipe.predict(df)
2019-10-04 17:44:38 INFO Beginning MatPipe prediction using fitted pipeline.
2019-10-04 17:44:38 INFO AutoFeaturizer: Starting transforming.
2019-10-04 17:44:38 INFO AutoFeaturizer: composition column already exists, overwriting with composition from structure.
2019-10-04 17:44:38 INFO AutoFeaturizer: Guessing oxidation states of structures if they were not present in input.
StructureToOxidStructure: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 2/2 [00:00<00:00, 11.49it/s]
StructureToComposition: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 2/2 [00:00<00:00, 12.21it/s]
2019-10-04 17:44:39 INFO AutoFeaturizer: Guessing oxidation states of compositions, as they were not present in input.
CompositionToOxidComposition: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 2/2 [00:00<00:00, 12.08it/s]
2019-10-04 17:44:39 INFO AutoFeaturizer: Featurizing with ElementProperty.
ElementProperty: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 2/2 [00:00<00:00, 11.96it/s]
2019-10-04 17:44:39 INFO AutoFeaturizer: Guessing oxidation states of structures if they were not present in input.
StructureToOxidStructure: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 2/2 [00:00<00:00, 13.03it/s]
2019-10-04 17:44:40 INFO AutoFeaturizer: Featurizing with SineCoulombMatrix.
SineCoulombMatrix: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 2/2 [00:00<00:00, 12.05it/s]
2019-10-04 17:44:40 INFO AutoFeaturizer: Featurizer type bandstructure not in the dataframe. Skipping…
2019-10-04 17:44:40 INFO AutoFeaturizer: Featurizer type dos not in the dataframe. Skipping…
2019-10-04 17:44:40 INFO AutoFeaturizer: Finished transforming.
2019-10-04 17:44:40 INFO DataCleaner: Starting transforming.
2019-10-04 17:44:40 INFO DataCleaner: Cleaning with respect to samples with sample na_method ‘fill’
2019-10-04 17:44:40 INFO DataCleaner: Replacing infinite values with nan for easier screening.
2019-10-04 17:44:40 INFO DataCleaner: One-hot encoding used for columns [‘material’, ‘dir’, ‘XY’, ‘E’]
2019-10-04 17:44:40 INFO DataCleaner: Before handling na: 2 samples, 162 features
2019-10-04 17:44:40 INFO DataCleaner: 0 samples did not have target values. They were dropped.
2019-10-04 17:44:40 WARNING DataCleaner: Mismatched columns found in dataframe used for fitting and argument dataframe.
2019-10-04 17:44:40 WARNING DataCleaner: Coercing mismatched columns…
2019-10-04 17:44:40 INFO DataCleaner: After handling na: 2 samples, 143 features
2019-10-04 17:44:40 INFO DataCleaner: Reordering columns…
2019-10-04 17:44:40 INFO DataCleaner: Finished transforming.
2019-10-04 17:44:40 INFO FeatureReducer: Starting transforming.
2019-10-04 17:44:40 INFO FeatureReducer: Finished transforming.
2019-10-04 17:44:40 INFO TPOTAdaptor: Starting predicting.
Traceback (most recent call last):
File “”, line 1, in
File “/home/mag1/atomate/atomate_env/lib/python3.6/site-packages/automatminer/utils/pkg.py”, line 65, in wrapper
return func(*args, **kwargs)
File “/home/mag1/atomate/atomate_env/lib/python3.6/site-packages/automatminer/pipeline.py”, line 170, in predict
predictions = self.learner.predict(df, self.target)
File “/home/mag1/atomate/atomate_env/lib/python3.6/site-packages/automatminer/utils/pkg.py”, line 65, in wrapper
return func(*args, **kwargs)
File “/home/mag1/atomate/atomate_env/lib/python3.6/site-packages/automatminer/utils/log.py”, line 94, in wrapper
result = meth(*args, **kwargs)
File “/home/mag1/atomate/atomate_env/lib/python3.6/site-packages/automatminer/automl/base.py”, line 115, in predict
y_pred = self.best_pipeline.predict(X)
File “/home/mag1/atomate/atomate_env/lib/python3.6/site-packages/automatminer/automl/adaptors.py”, line 197, in best_pipeline
return self.backend.fitted_pipeline
AttributeError: ‘Pipeline’ object has no attribute 'fitted_pipeline’_

Please have a look and let me know how can I save the best pipeline as a file and use it later to make predictions on unseen data.

Regards,

Arnab

ardunn · October 4, 2019, 6:07pm

Hey Arnab,

Debugging save/load

Thanks for pointing this out. Upon running, I am also getting this issue. I’ve opened an issue on GitHub and am working on fixing it presently. I’ll update this thread when it is fixed.

Just to get some more info though, which version of automatminer and matminer are you using?

Other issues

I noticed in your log that DataCleaner is one-hot encoding some suspicious columns: [‘material’, ‘dir’, ‘XY’, ‘E’]. By default, automatminer presets include “extra” columns in the learning process in the case you have some features which you want to use for learning then you don’t have to do any extra work.

I am guessing the first two are columns are some material id string and the directory the files are in? If so, it’s unlikely you want to keep these as features. When one-hot encoded, the features the ML algorithms will see will include some troublesome data:

If you original df is something like

“material” | “dir” |

“mat-1” “my_dir_1”

“mat-2” “my_dir_2”

…

Then what the ML algorithm sees is:

“mat-1” | “mat-2” | … | “my_dir_1” | “my_dir_2”| …

1 0 1 0

0 1 0 1

This can add thousands of extra features which are not relevant to your problem which will (1) add considerable noise to the learning problem and (2) make automatminer pipelines much slower and****larger in size.

If you didn’t mean to include them, the easiest way to remove them is just by dropping them from the training data frames. I think the pipeline will drop them on prediction automatically. If you can, this is the way I’d recommend.

Alternatively, if you are familiar with defining your own pipelines, you can ignore columns in each automatminer class: AutoFeaturizer, DataCleaner, FeatureReducer, and AutoMLAdaptor. We currently are working on a way to easily ignore columns throughout the entire pipeline (https://github.com/hackingmaterials/automatminer/issues/228) but this is not quite done yet.

Thanks,

Alex

ardunn · October 4, 2019, 6:47pm

Hey Arnab,

The issue has been fixed as of commit db8e940b328dd1e29a2a9206788caaa99b130a96

Pull the latest commits from the GitHub repo for the fix. I’ll be releasing a new version soon (within the next 2 weeks or so) but if you need it quicker than that go ahead and do a git pull.

Thanks,

Alex

Arnab_Kabiraj · October 5, 2019, 8:16am

Thanks a lot, Alex, for the swift response. I can confirm that the problem has been resolved.

···

On Friday, October 4, 2019 at 6:10:18 PM UTC+5:30, Arnab Kabiraj wrote:

Dear developers,

I’m trying to save the best automatminer pipeline after optimization for a particulat dataset using the MatPipe.save() function as the optimization takes quite some time. It dumps the pipeline fine. However, when I’m loading it using MatPipe.load() and using it to predict some unknown data, it throws the error ‘Pipeline’ object has no attribute ‘fitted_pipeline_’. I understand this has something to do with removing the backend and replacing it with the best pipeline while saving which the predict() function isn’t being able to comprehend, but was unable to solve the problem myself. The commands and the outputs from the screen are pasted below.

pipe = MatPipe.load(‘pipe.pickle’)
2019-10-04 17:44:28 INFO Loaded MatPipe from file pipe.pickle.
2019-10-04 17:44:28 WARNING Only use this model to make predictions (do not retrain!). Backend was serialzed as only the top model, not the full automl backend.

pipe.predict(df)
2019-10-04 17:44:38 INFO Beginning MatPipe prediction using fitted pipeline.
2019-10-04 17:44:38 INFO AutoFeaturizer: Starting transforming.
2019-10-04 17:44:38 INFO AutoFeaturizer: composition column already exists, overwriting with composition from structure.
2019-10-04 17:44:38 INFO AutoFeaturizer: Guessing oxidation states of structures if they were not present in input.
StructureToOxidStructure: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 2/2 [00:00<00:00, 11.49it/s]
StructureToComposition: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 2/2 [00:00<00:00, 12.21it/s]
2019-10-04 17:44:39 INFO AutoFeaturizer: Guessing oxidation states of compositions, as they were not present in input.
CompositionToOxidComposition: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 2/2 [00:00<00:00, 12.08it/s]
2019-10-04 17:44:39 INFO AutoFeaturizer: Featurizing with ElementProperty.
ElementProperty: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 2/2 [00:00<00:00, 11.96it/s]
2019-10-04 17:44:39 INFO AutoFeaturizer: Guessing oxidation states of structures if they were not present in input.
StructureToOxidStructure: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 2/2 [00:00<00:00, 13.03it/s]
2019-10-04 17:44:40 INFO AutoFeaturizer: Featurizing with SineCoulombMatrix.
SineCoulombMatrix: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 2/2 [00:00<00:00, 12.05it/s]
2019-10-04 17:44:40 INFO AutoFeaturizer: Featurizer type bandstructure not in the dataframe. Skipping…
2019-10-04 17:44:40 INFO AutoFeaturizer: Featurizer type dos not in the dataframe. Skipping…
2019-10-04 17:44:40 INFO AutoFeaturizer: Finished transforming.
2019-10-04 17:44:40 INFO DataCleaner: Starting transforming.
2019-10-04 17:44:40 INFO DataCleaner: Cleaning with respect to samples with sample na_method ‘fill’
2019-10-04 17:44:40 INFO DataCleaner: Replacing infinite values with nan for easier screening.
2019-10-04 17:44:40 INFO DataCleaner: One-hot encoding used for columns [‘material’, ‘dir’, ‘XY’, ‘E’]
2019-10-04 17:44:40 INFO DataCleaner: Before handling na: 2 samples, 162 features
2019-10-04 17:44:40 INFO DataCleaner: 0 samples did not have target values. They were dropped.
2019-10-04 17:44:40 WARNING DataCleaner: Mismatched columns found in dataframe used for fitting and argument dataframe.
2019-10-04 17:44:40 WARNING DataCleaner: Coercing mismatched columns…
2019-10-04 17:44:40 INFO DataCleaner: After handling na: 2 samples, 143 features
2019-10-04 17:44:40 INFO DataCleaner: Reordering columns…
2019-10-04 17:44:40 INFO DataCleaner: Finished transforming.
2019-10-04 17:44:40 INFO FeatureReducer: Starting transforming.
2019-10-04 17:44:40 INFO FeatureReducer: Finished transforming.
2019-10-04 17:44:40 INFO TPOTAdaptor: Starting predicting.
Traceback (most recent call last):
File “”, line 1, in
File “/home/mag1/atomate/atomate_env/lib/python3.6/site-packages/automatminer/utils/pkg.py”, line 65, in wrapper
return func(*args, **kwargs)
File “/home/mag1/atomate/atomate_env/lib/python3.6/site-packages/automatminer/pipeline.py”, line 170, in predict
predictions = self.learner.predict(df, self.target)
File “/home/mag1/atomate/atomate_env/lib/python3.6/site-packages/automatminer/utils/pkg.py”, line 65, in wrapper
return func(*args, **kwargs)
File “/home/mag1/atomate/atomate_env/lib/python3.6/site-packages/automatminer/utils/log.py”, line 94, in wrapper
result = meth(*args, **kwargs)
File “/home/mag1/atomate/atomate_env/lib/python3.6/site-packages/automatminer/automl/base.py”, line 115, in predict
y_pred = self.best_pipeline.predict(X)
File “/home/mag1/atomate/atomate_env/lib/python3.6/site-packages/automatminer/automl/adaptors.py”, line 197, in best_pipeline
return self.backend.fitted_pipeline
AttributeError: ‘Pipeline’ object has no attribute 'fitted_pipeline’_

Please have a look and let me know how can I save the best pipeline as a file and use it later to make predictions on unseen data.

Regards,

Arnab

ardunn · October 5, 2019, 12:43pm

Sure thing! And if you are still having issues with the columns (see my previous response, if they are indeed unintended) and I can help troubleshoot