vulkan support for typescript bindings, gguf support (#1390)

* adding some native methods to cpp wrapper * gpu seems to work * typings and add availibleGpus method * fix spelling * fix syntax * more * normalize methods to conform to py * remove extra dynamic linker deps when building with vulkan * bump python version (library linking fix) * Don't link against libvulkan. * vulkan python bindings on windows fixes * Bring the vulkan backend to the GUI. * When device is Auto (the default) then we will only consider discrete GPU's otherwise fallback to CPU. * Show the device we're currently using. * Fix up the name and formatting. * init at most one vulkan device, submodule update fixes issues w/ multiple of the same gpu * Update the submodule. * Add version 2.4.15 and bump the version number. * Fix a bug where we're not properly falling back to CPU. * Sync to a newer version of llama.cpp with bugfix for vulkan. * Report the actual device we're using. * Only show GPU when we're actually using it. * Bump to new llama with new bugfix. * Release notes for v2.4.16 and bump the version. * Fallback to CPU more robustly. * Release notes for v2.4.17 and bump the version. * Bump the Python version to python-v1.0.12 to restrict the quants that vulkan recognizes. * Link against ggml in bin so we can get the available devices without loading a model. * Send actual and requested device info for those who have opt-in. * Actually bump the version. * Release notes for v2.4.18 and bump the version. * Fix for crashes on systems where vulkan is not installed properly. * Release notes for v2.4.19 and bump the version. * fix typings and vulkan build works on win * Add flatpak manifest * Remove unnecessary stuffs from manifest * Update to 2.4.19 * appdata: update software description * Latest rebase on llama.cpp with gguf support. * macos build fixes * llamamodel: metal supports all quantization types now * gpt4all.py: GGUF * pyllmodel: print specific error message * backend: port BERT to GGUF * backend: port MPT to GGUF * backend: port Replit to GGUF * backend: use gguf branch of llama.cpp-mainline * backend: use llamamodel.cpp for StarCoder * conversion scripts: cleanup * convert scripts: load model as late as possible * convert_mpt_hf_to_gguf.py: better tokenizer decoding * backend: use llamamodel.cpp for Falcon * convert scripts: make them directly executable * fix references to removed model types * modellist: fix the system prompt * backend: port GPT-J to GGUF * gpt-j: update inference to match latest llama.cpp insights - Use F16 KV cache - Store transposed V in the cache - Avoid unnecessary Q copy Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> ggml upstream commit 0265f0813492602fec0e1159fe61de1bf0ccaf78 * chatllm: grammar fix * convert scripts: use bytes_to_unicode from transformers * convert scripts: make gptj script executable * convert scripts: add feed-forward length for better compatiblilty This GGUF key is used by all llama.cpp models with upstream support. * gptj: remove unused variables * Refactor for subgroups on mat * vec kernel. * Add q6_k kernels for vulkan. * python binding: print debug message to stderr * Fix regenerate button to be deterministic and bump the llama version to latest we have for gguf. * Bump to the latest fixes for vulkan in llama. * llamamodel: fix static vector in LLamaModel::endTokens * Switch to new models2.json for new gguf release and bump our version to 2.5.0. * Bump to latest llama/gguf branch. * chat: report reason for fallback to CPU * chat: make sure to clear fallback reason on success * more accurate fallback descriptions * differentiate between init failure and unsupported models * backend: do not use Vulkan with non-LLaMA models * Add q8_0 kernels to kompute shaders and bump to latest llama/gguf. * backend: fix build with Visual Studio generator Use the $<CONFIG> generator expression instead of CMAKE_BUILD_TYPE. This is needed because Visual Studio is a multi-configuration generator, so we do not know what the build type will be until `cmake --build` is called. Fixes #1470 * remove old llama.cpp submodules * Reorder and refresh our models2.json. * rebase on newer llama.cpp * python/embed4all: use gguf model, allow passing kwargs/overriding model * Add starcoder, rift and sbert to our models2.json. * Push a new version number for llmodel backend now that it is based on gguf. * fix stray comma in models2.json Signed-off-by: Aaron Miller <apage43@ninjawhale.com> * Speculative fix for build on mac. * chat: clearer CPU fallback messages * Fix crasher with an empty string for prompt template. * Update the language here to avoid misunderstanding. * added EM German Mistral Model * make codespell happy * issue template: remove "Related Components" section * cmake: install the GPT-J plugin (#1487) * Do not delete saved chats if we fail to serialize properly. * Restore state from text if necessary. * Another codespell attempted fix. * llmodel: do not call magic_match unless build variant is correct (#1488) * chatllm: do not write uninitialized data to stream (#1486) * mat*mat for q4_0, q8_0 * do not process prompts on gpu yet * python: support Path in GPT4All.__init__ (#1462) * llmodel: print an error if the CPU does not support AVX (#1499) * python bindings should be quiet by default * disable llama.cpp logging unless GPT4ALL_VERBOSE_LLAMACPP envvar is nonempty * make verbose flag for retrieve_model default false (but also be overridable via gpt4all constructor) should be able to run a basic test: ```python import gpt4all model = gpt4all.GPT4All('/Users/aaron/Downloads/rift-coder-v0-7b-q4_0.gguf') print(model.generate('def fib(n):')) ``` and see no non-model output when successful * python: always check status code of HTTP responses (#1502) * Always save chats to disk, but save them as text by default. This also changes the UI behavior to always open a 'New Chat' and setting it as current instead of setting a restored chat as current. This improves usability by not requiring the user to wait if they want to immediately start chatting. * Update README.md Signed-off-by: umarmnaq <102142660+umarmnaq@users.noreply.github.com> * fix embed4all filename https://discordapp.com/channels/1076964370942267462/1093558720690143283/1161778216462192692 Signed-off-by: Aaron Miller <apage43@ninjawhale.com> * Improves Java API signatures maintaining back compatibility * python: replace deprecated pkg_resources with importlib (#1505) * Updated chat wishlist (#1351) * q6k, q4_1 mat*mat * update mini-orca 3b to gguf2, license Signed-off-by: Aaron Miller <apage43@ninjawhale.com> * convert scripts: fix AutoConfig typo (#1512) * publish config https://docs.npmjs.com/cli/v9/configuring-npm/package-json#publishconfig (#1375) merge into my branch * fix appendBin * fix gpu not initializing first * sync up * progress, still wip on destructor * some detection work * untested dispose method * add js side of dispose * Update gpt4all-bindings/typescript/index.cc Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> Signed-off-by: Jacob Nguyen <76754747+jacoobes@users.noreply.github.com> * Update gpt4all-bindings/typescript/index.cc Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> Signed-off-by: Jacob Nguyen <76754747+jacoobes@users.noreply.github.com> * Update gpt4all-bindings/typescript/index.cc Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> Signed-off-by: Jacob Nguyen <76754747+jacoobes@users.noreply.github.com> * Update gpt4all-bindings/typescript/src/gpt4all.d.ts Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> Signed-off-by: Jacob Nguyen <76754747+jacoobes@users.noreply.github.com> * Update gpt4all-bindings/typescript/src/gpt4all.js Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> Signed-off-by: Jacob Nguyen <76754747+jacoobes@users.noreply.github.com> * Update gpt4all-bindings/typescript/src/util.js Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> Signed-off-by: Jacob Nguyen <76754747+jacoobes@users.noreply.github.com> * fix tests * fix circleci for nodejs * bump version --------- Signed-off-by: Aaron Miller <apage43@ninjawhale.com> Signed-off-by: umarmnaq <102142660+umarmnaq@users.noreply.github.com> Signed-off-by: Jacob Nguyen <76754747+jacoobes@users.noreply.github.com> Co-authored-by: Aaron Miller <apage43@ninjawhale.com> Co-authored-by: Adam Treat <treat.adam@gmail.com> Co-authored-by: Akarshan Biswas <akarshan.biswas@gmail.com> Co-authored-by: Cebtenzzre <cebtenzzre@gmail.com> Co-authored-by: Jan Philipp Harries <jpdus@users.noreply.github.com> Co-authored-by: umarmnaq <102142660+umarmnaq@users.noreply.github.com> Co-authored-by: Alex Soto <asotobu@gmail.com> Co-authored-by: niansa/tuxifan <tuxifan@posteo.de>
2025-09-02 17:15:18 +00:00 · 2023-11-01 14:38:58 -05:00
parent 64101d3af5
commit da95bcfb4b
17 changed files with 5884 additions and 4349 deletions
--- a/.circleci/continue_config.yml
+++ b/.circleci/continue_config.yml
@@ -856,6 +856,7 @@ jobs:
      - node/install-packages:
          app-dir: gpt4all-bindings/typescript
          pkg-manager: yarn
          override-ci-command: yarn install
      - run:
          command: | 
            cd gpt4all-bindings/typescript
@@ -885,6 +886,7 @@ jobs:
      - node/install-packages:
          app-dir: gpt4all-bindings/typescript
          pkg-manager: yarn
          override-ci-command: yarn install
      - run:
          command: | 
            cd gpt4all-bindings/typescript
@@ -994,7 +996,7 @@ jobs:
          command: |
            cd gpt4all-bindings/typescript
            npm set //registry.npmjs.org/:_authToken=$NPM_TOKEN
-            npm publish --access public --tag alpha
+            npm publish
 workflows:
  version: 2
--- a/gpt4all-bindings/typescript/.yarnrc.yml
+++ b/gpt4all-bindings/typescript/.yarnrc.yml
@@ -0,0 +1 @@
 nodeLinker: node-modules
--- a/gpt4all-bindings/typescript/README.md
+++ b/gpt4all-bindings/typescript/README.md
@@ -75,15 +75,12 @@ cd gpt4all-bindings/typescript
 ```sh
 yarn
 ```
 *   llama.cpp git submodule for gpt4all can be possibly absent. If this is the case, make sure to run in llama.cpp parent directory
 ```sh
 git submodule update --init --depth 1 --recursive
 ```
 **AS OF NEW BACKEND** to build the backend,
 ```sh
 yarn build:backend
 ```
--- a/gpt4all-bindings/typescript/index.cc
+++ b/gpt4all-bindings/typescript/index.cc
@@ -1,6 +1,5 @@
 #include "index.h"
 Napi::FunctionReference NodeModelWrapper::constructor;
 Napi::Function NodeModelWrapper::GetClass(Napi::Env env) {
    Napi::Function self = DefineClass(env, "LLModel", {
@@ -13,14 +12,64 @@ Napi::Function NodeModelWrapper::GetClass(Napi::Env env) {
       InstanceMethod("embed", &NodeModelWrapper::GenerateEmbedding),
       InstanceMethod("threadCount", &NodeModelWrapper::ThreadCount),
       InstanceMethod("getLibraryPath", &NodeModelWrapper::GetLibraryPath),
       InstanceMethod("initGpuByString", &NodeModelWrapper::InitGpuByString),
       InstanceMethod("hasGpuDevice", &NodeModelWrapper::HasGpuDevice),
       InstanceMethod("listGpu", &NodeModelWrapper::GetGpuDevices),
       InstanceMethod("memoryNeeded", &NodeModelWrapper::GetRequiredMemory),
       InstanceMethod("dispose", &NodeModelWrapper::Dispose)
    });
    // Keep a static reference to the constructor
    //
-    constructor = Napi::Persistent(self);
+    Napi::FunctionReference* constructor = new Napi::FunctionReference();
-    constructor.SuppressDestruct();
+    *constructor = Napi::Persistent(self);
    env.SetInstanceData(constructor);
    return self;
 }
 Napi::Value NodeModelWrapper::GetRequiredMemory(const Napi::CallbackInfo& info) 
 {
    auto env = info.Env();
    return Napi::Number::New(env, static_cast<uint32_t>( llmodel_required_mem(GetInference(), full_model_path.c_str()) ));
 }
  Napi::Value NodeModelWrapper::GetGpuDevices(const Napi::CallbackInfo& info) 
  {
    auto env = info.Env();
    int num_devices = 0;
    auto mem_size = llmodel_required_mem(GetInference(), full_model_path.c_str());
    llmodel_gpu_device* all_devices = llmodel_available_gpu_devices(GetInference(), mem_size, &num_devices);
    if(all_devices == nullptr) {
        Napi::Error::New(
            env, 
            "Unable to retrieve list of all GPU devices"
        ).ThrowAsJavaScriptException(); 
        return env.Undefined();
    }
    auto js_array = Napi::Array::New(env, num_devices);
    for(int i = 0; i < num_devices; ++i) {
       auto gpu_device = all_devices[i];
       /* 
        *
        * struct llmodel_gpu_device {
            int index = 0;
            int type = 0;           // same as VkPhysicalDeviceType
            size_t heapSize = 0; 
            const char * name;
            const char * vendor;
          };
        *
        */
       Napi::Object js_gpu_device = Napi::Object::New(env);
        js_gpu_device["index"] = uint32_t(gpu_device.index);
        js_gpu_device["type"] = uint32_t(gpu_device.type);
        js_gpu_device["heapSize"] = static_cast<uint32_t>( gpu_device.heapSize );
        js_gpu_device["name"]= gpu_device.name;
        js_gpu_device["vendor"] = gpu_device.vendor;
        js_array[i] = js_gpu_device;
    }
    return js_array;
  }
- 
+
  Napi::Value NodeModelWrapper::getType(const Napi::CallbackInfo& info) 
  {
    if(type.empty()) {
@@ -29,15 +78,41 @@ Napi::Function NodeModelWrapper::GetClass(Napi::Env env) {
    return Napi::String::New(info.Env(), type);
  }
  Napi::Value NodeModelWrapper::InitGpuByString(const Napi::CallbackInfo& info) 
  {
    auto env = info.Env();
    uint32_t memory_required = info[0].As<Napi::Number>();
    std::string gpu_device_identifier = info[1].As<Napi::String>();   
    size_t converted_value;
    if(memory_required <= std::numeric_limits<size_t>::max()) {
        converted_value = static_cast<size_t>(memory_required);
    } else {
         Napi::Error::New(
            env, 
            "invalid number for memory size. Exceeded bounds for memory."
        ).ThrowAsJavaScriptException(); 
        return env.Undefined();
    }
    auto result = llmodel_gpu_init_gpu_device_by_string(GetInference(), converted_value, gpu_device_identifier.c_str());
    return Napi::Boolean::New(env, result);
  }
  Napi::Value NodeModelWrapper::HasGpuDevice(const Napi::CallbackInfo& info) 
  {
    return Napi::Boolean::New(info.Env(), llmodel_has_gpu_device(GetInference()));
  }
  NodeModelWrapper::NodeModelWrapper(const Napi::CallbackInfo& info) : Napi::ObjectWrap<NodeModelWrapper>(info) 
  {
    auto env = info.Env();
    fs::path model_path;
-    std::string full_weight_path;
+    std::string full_weight_path,
-    //todo
+                library_path = ".",
-    std::string library_path = ".";
+                model_name, 
-    std::string model_name;
+                device;
    if(info[0].IsString()) {
        model_path = info[0].As<Napi::String>().Utf8Value();
        full_weight_path = model_path.string();
@@ -56,13 +131,14 @@ Napi::Function NodeModelWrapper::GetClass(Napi::Env env) {
        } else {
            library_path = ".";
        }
        device = config_object.Get("device").As<Napi::String>();
    }
    llmodel_set_implementation_search_path(library_path.c_str());
    llmodel_error e = {
        .message="looks good to me",
        .code=0,
    };
-    inference_ = std::make_shared<llmodel_model>(llmodel_model_create2(full_weight_path.c_str(), "auto", &e));
+    inference_ = llmodel_model_create2(full_weight_path.c_str(), "auto", &e);
    if(e.code != 0) {
       Napi::Error::New(env, e.message).ThrowAsJavaScriptException(); 
       return;
@@ -74,18 +150,45 @@ Napi::Function NodeModelWrapper::GetClass(Napi::Env env) {
       Napi::Error::New(env, "Had an issue creating llmodel object, inference is null").ThrowAsJavaScriptException(); 
       return;
    }
    if(device != "cpu") {
        size_t mem = llmodel_required_mem(GetInference(), full_weight_path.c_str());
        if(mem == 0) {
            std::cout << "WARNING: no memory needed. does this model support gpu?\n";
        }
        std::cout << "Initiating GPU\n";
        std::cout << "Memory required estimation: " << mem << "\n";
        auto success = llmodel_gpu_init_gpu_device_by_string(GetInference(), mem, device.c_str());
        if(success) {
            std::cout << "GPU init successfully\n";
        } else {
            std::cout << "WARNING: Failed to init GPU\n";
        }
    }
    auto success = llmodel_loadModel(GetInference(), full_weight_path.c_str());
    if(!success) {
        Napi::Error::New(env, "Failed to load model at given path").ThrowAsJavaScriptException(); 
        return;
    }
    name = model_name.empty() ? model_path.filename().string() : model_name;
  };
  //NodeModelWrapper::~NodeModelWrapper() {
    //GetInference().reset();
  //}
    name = model_name.empty() ? model_path.filename().string() : model_name;
    full_model_path = full_weight_path;
  };
 //  NodeModelWrapper::~NodeModelWrapper() {
 //    if(GetInference() != nullptr) {
 //        std::cout << "Debug: deleting model\n";
 //        llmodel_model_destroy(inference_);
 //        std::cout << (inference_ == nullptr);
 //    }
 //  }
 //  void NodeModelWrapper::Finalize(Napi::Env env) {
 //    if(inference_ != nullptr) {
 //        std::cout << "Debug: deleting model\n";
 //
 //    } 
 //  }
  Napi::Value NodeModelWrapper::IsModelLoaded(const Napi::CallbackInfo& info) {
    return Napi::Boolean::New(info.Env(), llmodel_isModelLoaded(GetInference()));
  }
@@ -193,8 +296,9 @@ Napi::Function NodeModelWrapper::GetClass(Napi::Env env) {
    std::string copiedQuestion = question;
    PromptWorkContext pc = {
        copiedQuestion,
-        std::ref(inference_),
+        inference_,
        copiedPrompt,
        ""
    };
    auto threadSafeContext = new TsfnContext(env, pc);
    threadSafeContext->tsfn = Napi::ThreadSafeFunction::New(
@@ -210,7 +314,9 @@ Napi::Function NodeModelWrapper::GetClass(Napi::Env env) {
    threadSafeContext->nativeThread = std::thread(threadEntry, threadSafeContext);
    return threadSafeContext->deferred_.Promise();
  }
-
+  void NodeModelWrapper::Dispose(const Napi::CallbackInfo& info) {
    llmodel_model_destroy(inference_);
  }
  void NodeModelWrapper::SetThreadCount(const Napi::CallbackInfo& info) {
    if(info[0].IsNumber()) {
        llmodel_setThreadCount(GetInference(), info[0].As<Napi::Number>().Int64Value());
@@ -233,7 +339,7 @@ Napi::Function NodeModelWrapper::GetClass(Napi::Env env) {
  }
  llmodel_model NodeModelWrapper::GetInference() {
-    return *inference_;
+    return inference_;
  }
 //Exports Bindings
--- a/gpt4all-bindings/typescript/index.h
+++ b/gpt4all-bindings/typescript/index.h
@@ -6,24 +6,33 @@
 #include <atomic>
 #include <memory>
 #include <filesystem>
 #include <set>
 namespace fs = std::filesystem;
 class NodeModelWrapper: public Napi::ObjectWrap<NodeModelWrapper> {
 public:
  NodeModelWrapper(const Napi::CallbackInfo &);
-  //~NodeModelWrapper();
+  //virtual ~NodeModelWrapper();
  Napi::Value getType(const Napi::CallbackInfo& info);
  Napi::Value IsModelLoaded(const Napi::CallbackInfo& info);
  Napi::Value StateSize(const Napi::CallbackInfo& info);
  //void Finalize(Napi::Env env) override;
  /**
   * Prompting the model. This entails spawning a new thread and adding the response tokens
   * into a thread local string variable.
   */
  Napi::Value Prompt(const Napi::CallbackInfo& info);
  void SetThreadCount(const Napi::CallbackInfo& info);
  void Dispose(const Napi::CallbackInfo& info);
  Napi::Value getName(const Napi::CallbackInfo& info);
  Napi::Value ThreadCount(const Napi::CallbackInfo& info);
  Napi::Value GenerateEmbedding(const Napi::CallbackInfo& info);
  Napi::Value HasGpuDevice(const Napi::CallbackInfo& info);
  Napi::Value ListGpus(const Napi::CallbackInfo& info);
  Napi::Value InitGpuByString(const Napi::CallbackInfo& info);
  Napi::Value GetRequiredMemory(const Napi::CallbackInfo& info);
  Napi::Value GetGpuDevices(const Napi::CallbackInfo& info);
  /*
   * The path that is used to search for the dynamic libraries
   */
@@ -37,10 +46,10 @@ private:
  /**
   * The underlying inference that interfaces with the C interface
   */
-  std::shared_ptr<llmodel_model> inference_;
+  llmodel_model inference_;
  std::string type;
  // corresponds to LLModel::name() in typescript
  std::string name;
-  static Napi::FunctionReference constructor;
+  std::string full_model_path;
 };
--- a/gpt4all-bindings/typescript/package.json
+++ b/gpt4all-bindings/typescript/package.json
@@ -1,6 +1,6 @@
 {
  "name": "gpt4all",
-  "version": "2.2.0",
+  "version": "3.0.0",
  "packageManager": "yarn@3.6.1",
  "main": "src/gpt4all.js",
  "repository": "nomic-ai/gpt4all",
@@ -47,5 +47,10 @@
  },
  "jest": {
    "verbose": true
  }, 
  "publishConfig": {
    "registry": "https://registry.npmjs.org/",
    "access": "public",
    "tag": "latest"
  }
 }
--- a/gpt4all-bindings/typescript/prompt.cc
+++ b/gpt4all-bindings/typescript/prompt.cc
@@ -30,7 +30,7 @@ void threadEntry(TsfnContext* context) {
    context->tsfn.BlockingCall(&context->pc,
    [](Napi::Env env, Napi::Function jsCallback, PromptWorkContext* pc) {
        llmodel_prompt(
-            *pc->inference_,
+            pc->inference_,
            pc->question.c_str(),
            &prompt_callback,
            &response_callback,
--- a/gpt4all-bindings/typescript/prompt.h
+++ b/gpt4all-bindings/typescript/prompt.h
@@ -10,7 +10,7 @@
 #include <memory>
 struct PromptWorkContext {
    std::string question;
-    std::shared_ptr<llmodel_model>& inference_;
+    llmodel_model inference_;
    llmodel_prompt_context prompt_params;
    std::string res;
--- a/gpt4all-bindings/typescript/spec/chat.mjs
+++ b/gpt4all-bindings/typescript/spec/chat.mjs
@@ -1,8 +1,8 @@
 import { LLModel, createCompletion, DEFAULT_DIRECTORY, DEFAULT_LIBRARIES_DIRECTORY, loadModel } from '../src/gpt4all.js'
 const model = await loadModel(
-    'orca-mini-3b-gguf2-q4_0.gguf',
+    'mistral-7b-openorca.Q4_0.gguf',
-    { verbose: true }
+    { verbose: true, device: 'gpu' }
 );
 const ll = model.llm;
@@ -26,7 +26,9 @@ console.log("name " + ll.name());
 console.log("type: " + ll.type());
 console.log("Default directory for models", DEFAULT_DIRECTORY);
 console.log("Default directory for libraries", DEFAULT_LIBRARIES_DIRECTORY);
-
+console.log("Has GPU", ll.hasGpuDevice());
 console.log("gpu devices", ll.listGpu())
 console.log("Required Mem in bytes", ll.memoryNeeded())
 const completion1 = await createCompletion(model, [ 
    { role : 'system', content: 'You are an advanced mathematician.'  },
    { role : 'user', content: 'What is 1 + 1?'  }, 
@@ -40,6 +42,8 @@ const completion2 = await createCompletion(model, [
 console.log(completion2.choices[0].message)
 //CALLING DISPOSE WILL INVALID THE NATIVE MODEL. USE THIS TO CLEANUP
 model.dispose()
 // At the moment, from testing this code, concurrent model prompting is not possible. 
 // Behavior: The last prompt gets answered, but the rest are cancelled
 // my experience with threading is not the best, so if anyone who is good is willing to give this a shot,
@@ -47,16 +51,16 @@ console.log(completion2.choices[0].message)
 // INFO: threading with llama.cpp is not the best maybe not even possible, so this will be left here as reference
 //const responses = await Promise.all([
-//    createCompletion(ll, [ 
+//    createCompletion(model, [ 
 //    { role : 'system', content: 'You are an advanced mathematician.'  },
 //    { role : 'user', content: 'What is 1 + 1?'  }, 
 //    ], { verbose: true }),
-//    createCompletion(ll, [ 
+//    createCompletion(model, [ 
 //    { role : 'system', content: 'You are an advanced mathematician.'  },
 //    { role : 'user', content: 'What is 1 + 1?'  }, 
 //    ], { verbose: true }),
 //
-//createCompletion(ll, [ 
+//createCompletion(model, [ 
 //    { role : 'system', content: 'You are an advanced mathematician.'  },
 //    { role : 'user', content: 'What is 1 + 1?'  }, 
 //], { verbose: true })
--- a/gpt4all-bindings/typescript/spec/embed.mjs
+++ b/gpt4all-bindings/typescript/spec/embed.mjs
@@ -1,8 +1,6 @@
-import {  loadModel, createEmbedding } from '../src/gpt4all.js'
+import { loadModel, createEmbedding } from '../src/gpt4all.js'
-const embedder = await loadModel("ggml-all-MiniLM-L6-v2-f16.bin", { verbose: true })
+const embedder = await loadModel("ggml-all-MiniLM-L6-v2-f16.bin", { verbose: true, type: 'embedding'})
-console.log(
+console.log(createEmbedding(embedder, "Accept your current situation"))
    createEmbedding(embedder, "Accept your current situation")
 )
--- a/gpt4all-bindings/typescript/src/gpt4all.d.ts
+++ b/gpt4all-bindings/typescript/src/gpt4all.d.ts
@@ -61,6 +61,11 @@ declare class InferenceModel {
        prompt: string,
        options?: Partial<LLModelPromptContext>
    ): Promise<string>;
   /**
     * delete and cleanup the native model
    */
    dispose(): void
 }
 declare class EmbeddingModel {
@@ -69,6 +74,12 @@ declare class EmbeddingModel {
    config: ModelConfig;
    embed(text: string): Float32Array;
    /**
      * delete and cleanup the native model
     */
    dispose(): void
 }
 /**
@@ -146,6 +157,41 @@ declare class LLModel {
     * Where to get the pluggable backend libraries
     */
    getLibraryPath(): string;
    /**
     * Initiate a GPU by a string identifier. 
     * @param {number} memory_required Should be in the range size_t or will throw 
     * @param {string} device_name  'amd' | 'nvidia' | 'intel' | 'gpu' | gpu name.
     * read LoadModelOptions.device for more information
     */
    initGpuByString(memory_required: number, device_name: string): boolean
    /**
     * From C documentation
     * @returns True if a GPU device is successfully initialized, false otherwise.
     */
    hasGpuDevice(): boolean
    /**
      * GPUs that are usable for this LLModel
      * @returns 
      */
    listGpu() : GpuDevice[]
    /**
      * delete and cleanup the native model
     */
    dispose(): void
 }
 /** 
  * an object that contains gpu data on this machine.
  */
 interface GpuDevice {
    index: number;
    /**
      * same as VkPhysicalDeviceType
     */
    type: number;            
    heapSize : number; 
    name: string;
    vendor: string;
 }
 interface LoadModelOptions {
@@ -154,6 +200,21 @@ interface LoadModelOptions {
    modelConfigFile?: string;
    allowDownload?: boolean;
    verbose?: boolean;
    /* The processing unit on which the model will run. It can be set to
     * - "cpu": Model will run on the central processing unit.
     * - "gpu": Model will run on the best available graphics processing unit, irrespective of its vendor.
     * - "amd", "nvidia", "intel": Model will run on the best available GPU from the specified vendor.
 	   Alternatively, a specific GPU name can also be provided, and the model will run on the GPU that matches the name
       if it's available.
       Default is "cpu".
 	   Note: If a GPU device lacks sufficient RAM to accommodate the model, an error will be thrown, and the GPT4All
       instance will be rendered invalid. It's advised to ensure the device has enough memory before initiating the
       model.
    */ 
    device?: string;
 }
 interface InferenceModelOptions extends LoadModelOptions {
@@ -184,7 +245,7 @@ declare function loadModel(
 declare function loadModel(
    modelName: string,
-    options?: EmbeddingOptions | InferenceOptions
+    options?: EmbeddingModelOptions | InferenceModelOptions
 ): Promise<InferenceModel | EmbeddingModel>;
 /**
@@ -401,7 +462,7 @@ declare const DEFAULT_MODEL_CONFIG: ModelConfig;
 /**
 * Default prompt context.
 */
-declare const DEFAULT_PROMT_CONTEXT: LLModelPromptContext;
+declare const DEFAULT_PROMPT_CONTEXT: LLModelPromptContext;
 /**
 * Default model list url.
@@ -502,7 +563,7 @@ export {
    DEFAULT_DIRECTORY,
    DEFAULT_LIBRARIES_DIRECTORY,
    DEFAULT_MODEL_CONFIG,
-    DEFAULT_PROMT_CONTEXT,
+    DEFAULT_PROMPT_CONTEXT,
    DEFAULT_MODEL_LIST_URL,
    downloadModel,
    retrieveModel,
@@ -510,4 +571,5 @@ export {
    DownloadController,
    RetrieveModelOptions,
    DownloadModelOptions,
    GpuDevice
 };
--- a/gpt4all-bindings/typescript/src/gpt4all.js
+++ b/gpt4all-bindings/typescript/src/gpt4all.js
@@ -34,6 +34,7 @@ async function loadModel(modelName, options = {}) {
        type: "inference",
        allowDownload: true,
        verbose: true,
        device: 'cpu',
        ...options,
    };
@@ -61,13 +62,13 @@ async function loadModel(modelName, options = {}) {
        model_name: appendBinSuffixIfMissing(modelName),
        model_path: loadOptions.modelPath,
        library_path: libPath,
        device: loadOptions.device,
    };
    if (loadOptions.verbose) {
        console.debug("Creating LLModel with options:", llmOptions);
    }
    const llmodel = new LLModel(llmOptions);
    if (loadOptions.type === "embedding") {
        return new EmbeddingModel(llmodel, modelConfig);
    } else if (loadOptions.type === "inference") {
--- a/gpt4all-bindings/typescript/src/models.js
+++ b/gpt4all-bindings/typescript/src/models.js
@@ -15,6 +15,10 @@ class InferenceModel {
        const result = this.llm.raw_prompt(prompt, normalizedPromptContext, () => {});
        return result;
    }
    dispose() {
        this.llm.dispose();
    }
 }
 class EmbeddingModel {
@@ -29,6 +33,10 @@ class EmbeddingModel {
    embed(text) {
        return this.llm.embed(text)
    }
    dispose() {
        this.llm.dispose();
    }
 }
--- a/gpt4all-bindings/typescript/src/util.js
+++ b/gpt4all-bindings/typescript/src/util.js
@@ -43,8 +43,9 @@ async function listModels(
 }
 function appendBinSuffixIfMissing(name) {
-    if (!name.endsWith(".bin")) {
+    const ext = path.extname(name);
-        return name + ".bin";
+    if (![".bin", ".gguf"].includes(ext)) {
        return name + ".gguf";
    }
    return name;
 }
--- a/gpt4all-bindings/typescript/test/gpt4all.test.js
+++ b/gpt4all-bindings/typescript/test/gpt4all.test.js
@@ -92,7 +92,7 @@ describe("listModels", () => {
 describe("appendBinSuffixIfMissing", () => {
    it("should make sure the suffix is there", () => {
-        expect(appendBinSuffixIfMissing("filename")).toBe("filename.bin");
+        expect(appendBinSuffixIfMissing("filename")).toBe("filename.gguf");
        expect(appendBinSuffixIfMissing("filename.bin")).toBe("filename.bin");
    });
 });
@@ -156,11 +156,11 @@ describe("downloadModel", () => {
    test("should successfully download a model file", async () => {
        const downloadController = downloadModel(fakeModelName);
        const modelFilePath = await downloadController.promise;
-        expect(modelFilePath).toBe(path.resolve(DEFAULT_DIRECTORY, `${fakeModelName}.bin`));
+        expect(modelFilePath).toBe(path.resolve(DEFAULT_DIRECTORY, `${fakeModelName}.gguf`));
        expect(global.fetch).toHaveBeenCalledTimes(1);
        expect(global.fetch).toHaveBeenCalledWith(
-            "https://gpt4all.io/models/fake-model.bin",
+            "https://gpt4all.io/models/gguf/fake-model.gguf",
            {
                signal: "signal",
                headers: {
@@ -189,7 +189,7 @@ describe("downloadModel", () => {
        expect(global.fetch).toHaveBeenCalledTimes(1);
        // the file should be missing
        await expect(
-            fsp.access(path.resolve(DEFAULT_DIRECTORY, `${fakeModelName}.bin`))
+            fsp.access(path.resolve(DEFAULT_DIRECTORY, `${fakeModelName}.gguf`))
        ).rejects.toThrow();
        // partial file should also be missing
        await expect(
--- a/gpt4all-bindings/typescript/test/models.json
+++ b/gpt4all-bindings/typescript/test/models.json
@@ -3,8 +3,8 @@
    "order": "a",
    "md5sum": "08d6c05a21512a79a1dfeb9d2a8f262f",
    "name": "Not a real model",
-    "filename": "fake-model.bin",
+    "filename": "fake-model.gguf",
    "filesize": "4",
    "systemPrompt": " "
  }
-]
+]
--- a/gpt4all-bindings/typescript/yarn.lock
+++ b/gpt4all-bindings/typescript/yarn.lock