Phi-3-Sprachmodelle mit der ONNX Runtime generate() API ausführen

Einleitung

Phi-3 und Phi 3.5 ONNX-Modelle werden auf HuggingFace gehostet und können mit der ONNX Runtime generate() API ausgeführt werden.

Die Versionen Mini (3,3B) und Medium (14B) sind jetzt mit Unterstützung verfügbar. Sowohl Mini als auch Medium verfügen über eine Kurzkontext-Version (4k) und eine Langkontext-Version (128k). Die Langkontext-Version kann deutlich längere Prompts akzeptieren und längere Ausgabetexte erzeugen, verbraucht aber mehr Speicher.

Verfügbare Modelle sind

Dieses Tutorial demonstriert den Download und die Ausführung der Mini-Variante (3B) des Phi-3-Modells mit kurzem Kontext (4k). Informationen zu den Download-Befehlen für andere Varianten finden Sie in der Modellreferenz.

Dieses Tutorial lädt die Mini-Variante (3B) des Phi-3-Modells mit kurzem Kontext (4k) herunter und führt sie aus. Informationen zu den Download-Befehlen für andere Varianten finden Sie in der Modellreferenz.

Einrichtung
Wählen Sie Ihre Plattform
Ausführung mit DirectML
Ausführung mit NVIDIA CUDA
Ausführung auf der CPU
Phi-3 ONNX Modellreferenz

Setup

Installieren Sie die Git Large File Storage (LFS) Erweiterung

HuggingFace verwendet git zur Versionskontrolle. Um die ONNX-Modelle herunterzuladen, benötigen Sie git lfs, falls Sie es noch nicht installiert haben.
- Windows: winget install -e --id GitHub.GitLFS (Wenn Sie winget nicht haben, laden Sie die exe von der offiziellen Quelle herunter und führen Sie sie aus)
- Linux: apt-get install git-lfs
- MacOS: brew install git-lfs
Führen Sie dann git lfs install aus
Installieren Sie die HuggingFace CLI
```
pip install huggingface-hub[cli]
```

Wählen Sie Ihre Plattform

Haben Sie einen Windows-Computer mit GPU?

Ich weiß es nicht → Lesen Sie diesen Leitfaden, um zu sehen, ob Sie eine GPU auf Ihrem Windows-Computer haben und bestätigen Sie, dass Ihre GPU DirectML-kompatibel ist.
Ja → Folgen Sie den Anweisungen für DirectML.
Nein → Haben Sie eine NVIDIA GPU?
- Ich weiß es nicht → Lesen Sie diesen Leitfaden, um zu sehen, ob Sie eine CUDA-fähige GPU haben.
- Ja → Folgen Sie den Anweisungen für NVIDIA CUDA GPU.
- Nein → Folgen Sie den Anweisungen für CPU.

Hinweis: Es wird nur ein Paket und Modell basierend auf Ihrer Hardware benötigt. Das heißt, führen Sie die Schritte nur für einen der folgenden Abschnitte aus.

Ausführung mit DirectML

Modell herunterladen

huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include directml/* --local-dir .

Dieser Befehl lädt das Modell in einen Ordner namens directml.

Die generate() API installieren
```
pip install --pre onnxruntime-genai-directml
```
Sie sollten nun onnxruntime-genai-directml in Ihrer pip list sehen.

Führen Sie das Modell aus

Führen Sie das Modell mit phi3-qa.py aus.

curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3-qa.py -o phi3-qa.py
python phi3-qa.py -m directml\directml-int4-awq-block-128 -e dml

Sobald das Skript das Modell geladen hat, wird es Sie in einer Schleife zur Eingabe auffordern und die Ausgabe streamen, sobald sie vom Modell produziert wird. Zum Beispiel

Input: Tell me a joke about GPUs

Certainly! Here\'s a light-hearted joke about GPUs:

Why did the GPU go to school? Because it wanted to improve its "processing power"!

This joke plays on the double meaning of "processing power," referring both to the computational abilities of a GPU and the idea of a student wanting to improve their academic skills.

Ausführung mit NVIDIA CUDA

Modell herunterladen

huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include cuda/cuda-int4-rtn-block-32/* --local-dir .

Dieser Befehl lädt das Modell in einen Ordner namens cuda.

Die generate() API installieren

pip install --pre onnxruntime-genai-cuda

Führen Sie das Modell aus

Führen Sie das Modell mit phi3-qa.py aus.

curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3-qa.py -o phi3-qa.py
python phi3-qa.py -m cuda/cuda-int4-rtn-block-32  -e cuda

Sobald das Skript das Modell geladen hat, wird es Sie in einer Schleife zur Eingabe auffordern und die Ausgabe streamen, sobald sie vom Modell produziert wird. Zum Beispiel

Input: Tell me a joke about creative writing
 
Output:  Why don't writers ever get lost? Because they always follow the plot! 

Ausführung auf der CPU

Modell herunterladen

huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* --local-dir .

Dieser Befehl lädt das Modell in einen Ordner namens cpu_and_mobile

Installieren Sie die generate() API für CPU
```
pip install --pre onnxruntime-genai
```

Führen Sie das Modell aus

Führen Sie das Modell mit phi3-qa.py aus.

curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3-qa.py -o phi3-qa.py
python phi3-qa.py -m cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4 -e cpu

Sobald das Skript das Modell geladen hat, wird es Sie in einer Schleife zur Eingabe auffordern und die Ausgabe streamen, sobald sie vom Modell produziert wird. Zum Beispiel

Input: Tell me a joke about generative AI

Output:  Why did the generative AI go to school?

To improve its "creativity" algorithm!

Phi-3 ONNX Modellreferenz

Phi-3 mini 4k Kontext CPU

huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* --local-dir .
python phi3-qa.py -m cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4 -e cpu

Phi-3 mini 4k Kontext CUDA

huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include cuda/cuda-int4-rtn-block-32/* --local-dir .
python phi3-qa.py -m cuda/cuda-int4-rtn-block-32 -e cuda

Phi-3 mini 4k Kontext DirectML

huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include directml/* --local-dir .
python phi3-qa.py -m directml\directml-int4-awq-block-128 -e dml

Phi-3 mini 128k Kontext CPU

huggingface-cli download microsoft/Phi-3-mini-128k-instruct-onnx --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* --local-dir .
python phi3-qa.py -m cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4 -e cpu

Phi-3 mini 128k Kontext CUDA

huggingface-cli download microsoft/Phi-3-mini-128k-instruct-onnx --include cuda/cuda-int4-rtn-block-32/* --local-dir .
python phi3-qa.py -m cuda/cuda-int4-rtn-block-32 -e cuda

Phi-3 mini 128k Kontext DirectML

huggingface-cli download microsoft/Phi-3-mini-128k-instruct-onnx --include directml/* --local-dir .
python phi3-qa.py -m directml\directml-int4-awq-block-128 -e dml

Phi-3 medium 4k Kontext CPU

git clone https://huggingface.co/microsoft/Phi-3-medium-4k-instruct-onnx-cpu
python phi3-qa.py -m Phi-3-medium-4k-instruct-onnx-cpu/cpu-int4-rtn-block-32-acc-level-4 -e cpu

Phi-3 medium 4k Kontext CUDA

git clone https://huggingface.co/microsoft/Phi-3-medium-4k-instruct-onnx-cuda
python phi3-qa.py -m Phi-3-medium-4k-instruct-onnx-cuda/cuda-int4-rtn-block-32 -e cuda

Phi-3 medium 4k Kontext DirectML

git clone https://huggingface.co/microsoft/Phi-3-medium-4k-instruct-onnx-directml
python phi3-qa.py -m Phi-3-medium-4k-instruct-onnx-directml/directml-int4-awq-block-128 -e dml

Phi-3 medium 128k Kontext CPU

git clone https://huggingface.co/microsoft/Phi-3-medium-128k-instruct-onnx-cpu
python phi3-qa.py -m Phi-3-medium-128k-instruct-onnx-cpu/cpu-int4-rtn-block-32-acc-level-4 -e cpu

Phi-3 medium 128k Kontext CUDA

git clone https://huggingface.co/microsoft/Phi-3-medium-128k-instruct-onnx-cuda
python phi3-qa.py -m Phi-3-medium-128k-instruct-onnx-cuda/cuda-int4-rtn-block-32 -e cuda

Phi-3 medium 128k Kontext DirectML

git clone https://huggingface.co/microsoft/Phi-3-medium-128k-instruct-onnx-directml
python phi3-qa.py -m Phi-3-medium-128k-instruct-onnx-directml/directml-int4-awq-block-128 -e dml

Phi-3.5 mini 128k Kontext CUDA

huggingface-cli download microsoft/Phi-3.5-mini-instruct-onnx --include cuda/cuda-int4-awq-block-128/* --local-dir .
python phi3-qa.py -m cuda/cuda-int4-awq-block-128 -e cuda

Phi-3.5 mini 128k Kontext CPU

huggingface-cli download microsoft/Phi-3.5-mini-instruct-onnx --include cpu_and_mobile/cpu-int4-awq-block-128-acc-level-4/* --local-dir .
python phi3-qa.py -m cpu_and_mobile/cpu-int4-awq-block-128-acc-level-4 -e cpu